Course No. 11 "Next Generation Sequencing Data Analysis"

 

Problem 2

Massively parallel sequencing, also known as next generation sequencing, is a technology enabling high-throughput sequencing of genomes or loci of interest.  A single locus of sequences is provided for three samples for which variant detection is performed.  Here we use a different set of samples and demonstrate shortcuts that are now available in Galaxy.

Outline:

  • Examine the quality of the sequence reads
     

  • Optional trimming the reads
     

  • Map the reads to a reference chromosome  
     

  •  Examine sequence variation  
     

  • Visualize the mapping and variants on the chromosome

The steps are as follows.

  1. Logging in:  Start Galaxy at the instructor provided link.  (Students who are not in the classroom can use http://usegalaxy.org.)  First time users should register by clicking on User and selecting Register.  This automatically logs in the     user.  Otherwise, login by clicking on User and selecting Login. 

  2. Data import:  Import the data by clicking on “Shared Data” and selecting “Published Histories”.  Search for the Name “Source Data 2” and click on the “Source Data 2” link.  Next click on the link “Import history”.  (A copy of the dataset is also provided:  Sourcedata2.zip.)

  3. Viewing the data:  In the Galaxy bar, click on “Analyze Data”.  The imported files should appear on the right with a green background. To view any of the files click on the icon that looks like an eye.  The files containing reads have the .fastq extension, and the reference sequence has the .fa extension.  

  4. Fastq grooming shortcut:  When this dataset was uploaded into Galaxy, the format for the fastq files was manually set to fastqsanger.  When this format is chosen, the grooming step is not required. 

  5. Quality control:  In left panel, find the section “NGS:  QC and manipulation”, and find the tool “FastQC:Read QC”, clicking on the link to activate the tool.  In the central panel, for “Short read data from you current history, select the first fastq file.  Enter the Title for the ouput file as FastQC.  Click the Execute button.  To view the results, on the right panel click on the eye icon next to the FastQC results. 

  6. Optional trimming:  (Please note that some mapping tools have optional built-in trimming.)  In the same section, find the tool “FastQ Quality Trimmer”, clicking on the link to activate the tool. Choose the first fastq file, and accept the defaults, with one exception:  set the Quality Score to 20.0.  Click Execute.  Repeat for the other two files.  In the next step below, use the trimmed files for mapping. 

  7. Read mapping:  In the left panel, click on the section “NGS: Mapping” to expand the list of mapping tools.  Click on the link “Map with BWA for Illumina” to activate the mapping tool.  The reference genome is the fasta file we imported, chr21.fa.  It is in our History in the right panel.  In the central panel, for the option “Will you select a reference genome from your history or use a built-in index”, select “Use one from the history”.  Then choose chr21.fa for “Select a reference from history”.  For the option, “Is this library mate-paired”, select “Single-end”, and for the “FASTQ file” option, choose the first fastq file.  Click on the Execute button.  Repeat these steps for the second and third fastq files.  Finally, view the SAM (Sequence Alignment/Map format) output in the right panel by clicking on the eye icon next to one of the “Map with BWA for Illumina” results.

  8. Format conversion shortcut:  Automatic SAM to BAM conversion is now found in many of the tools in Galaxy. 

  9. Pooling data:  Pooling data:  The variant caller FreeBayes can operate on pooled data.  To merge the BAM files, click on “Convert, Merge, Randomize BAM datasets" under NGS:BAM Tools.  Merge BAM Files” in the left panel in the same section.  In the central panel, select the first SAM-to-BAM dataset.  Use "+Insert BAM datasets to filter" to add additional file and repeat the step for the third file. Then click the Execute button. 

  10. Variant detection:  In the blue Tools panel on the left, click on the section “NGS: Call Variant Detection” to display the tools in this section.  Or scroll to the top of this panel, and type FreeBayes in the search box.  In the search results, click on FreeBayes to set up the variant detection calculation.  Use version 0.0.3.  For the option “Choose the source for the reference list”, select History.  For the option “BAM file”, choose the merged file from the previous step.  The option “Using reference file” should be chr21.fa, and the Basic options should be chosen.  Click the Execute button.  View the results by clicking in the right panel on the eye icon next to the FreeBayes results.

  11. Sorting the results:  In the Tools panel, click on the section “Filter and Sort” to see the list of tools.  (If you used the search box, click the x next to the query term to clear the search results.)  Click on the Sort tool and look at the central panel.  Set the “Sort Query” option to the FreeBayes result.  The “on column” option should be c6, since the QUAL column is the sixth column.  “with flavor” should be set to “Numerical sort”, and “everything in” should be “Descending order”.  Click the Execute button, and view the results by clicking on the eye icon next to the Sort results in the right panel.
  1. Viewing the mapped reads:  Click on merged file in the right panel to reveal the display option “display at UCSC main”.  Click on main.  This opens the UCSC Genome Browser with MergedBams.bam displayed.  To see the top scoring variant, type chr21:27,818,520-27,818,550 in the search box, and click the go button.  Then view the reads, by going down to the “Custom Tracks” section and select full in the menu labeled Convert, Merge, Randomize.  Then click refresh.  The variant in the reads is now visible.

  2. Viewing the variant analysis results (vcf file): Return to the Galaxy window to the panel on the right, and click on “FreeBayes on …” to reveal the display option “display at UCSC main”.  Click on main.  This opens another UCSC Genome Browser with the FreeBayes results added to the display.  Scroll down to the “Custom Tracks” section and change the FreeBayes menu to pack.  Change the Convert, Merge, Randomize menu to dense.  This yields a track displaying only the variants.

  3. Examining the biological context:  To view a variant in an exon, go to the textbox at the top of the page, type chr21:27,061,793-27,063,241, and click the go button.  Scroll down to the bar labeled “Genes and Gene Prediction Tracks”; ensure that the menu for “UCSC Genes …” is set to pack.  Now the variants can be seen in the context of the tracks chosen for display in the UCSC Genome Browser.  Zoom in on the first variant in this view by holding down the shift key and the left mouse button to make a rectangle enclosing it.  Finally, click on a variant in the FreeBayes track to view detailed information about it.

Questions, Comments:  Medha Bhagwat, PhD