Course No. 11 "Next Generation Sequencing Data Analysis"

 

Problem 1

Massively parallel sequencing, also known as next generation sequencing, is a technology enabling high-throughput sequencing of genomes or loci of interest.  A single locus of sequences is provided for three samples for which variant detection is performed.  

Outline:

  • Examine the quality of the sequence reads
     
  • Map the reads to a reference chromosome
     
  • Examine sequence variation
     
  • Visualize the mapping and variants on the chromosome

The steps are as follows.

  1. Logging in:  Start Galaxy at the instructor provided link.  (Students who are not in the classroom can use http://usegalaxy.org.)  First time users should register by clicking on User and selecting Register.  This automatically logs in the user.  Otherwise, login by clicking on User and selecting Login.
  2. Data import:  Import the data by clicking on “Shared Data” and selecting “Published Histories”.  Search for the Name “Source Data” and click on the “Source Data” link.  Next click on the link “Import history”.  (A copy of the dataset is also provided:  SourceData1.zip.)
     
  3. Viewing the data:  In the Galaxy bar, click on “Analyze Data”.  The imported files should appear on the right with a green background. To view any of the files click on the icon that looks like an eye.  The files containing reads have the .fastq extension, and the reference sequence has the .fa extension.
  4. Fastq grooming:  Go to the blue panel on the left, and click on the section “NGS:  QC and manipulation” to expand it.  In the list of tools, click on “FASTQ Groomer”.  For the option “File to groom”, choose the first fastq file.  Use the defaults Sanger for “Input FASTQ quality scores type” and “Hide Advanced Options” for “Advanced Options”.  Click the Execute button.  Repeat these steps for the second and third fastq files.
  5. Quality control:  In the same tool section, find the tool “FastQC:Read QC” and click on the link to activate the tool.  In the central panel, for “Short read data from your current history”, select the first groomed fastq file.  Enter the Title for the ouput file as FastQC.  “Contaminant list” is “Selection is Optional”.  Click the Execute button.  To view the results, on the right panel click on the eye icon next to the FastQC results.
  6. Read mapping:  In the left panel, click on the section “NGS: Mapping” to expand the list of mapping tools.  Click on the link “Map with BWA for Illumina” to activate the mapping tool.  The reference genome is the fasta file we imported, chr21.fa.  It is in our History in the right panel.  In the central panel, for the option “Will you select a reference genome from your history or use a built-in index”, select “Use one from the history”.  Then choose chr21.fa for “Select a reference from history”.  For the option, “Is this library mate-paired”, select “Single-end”, and for the “FASTQ file” option, choose the first groomed fastq file.  “BWA settings to use” are “Commonly Used”.  Click on the Execute button.  Repeat these steps for the second and third groomed fastq files.  Finally, view the SAM (Sequence Alignment/Map format) output in the right panel by clicking on the eye icon next to one of the “Map with BWA for Illumina” results.
  7. Format conversion:  Many tools require a binary version of SAM.  It is called BAM.  To convert sam to bam, go to the left panel, and click on “NGS:SAM Tools”.  Select the tool, SAM-to-BAM.  “Choose the source for the reference list” should be “History”.  “Convert SAM file” should be the first “Map with BWA …” file.  “Using reference file” should be “chr21.fa”.  Repeat this for the second and third “Map with BWA …” files.
     
  8. Pooling data:  Pooling data:  The variant caller FreeBayes can operate on pooled data.  To merge the BAM files, click on “Convert, Merge, Randomize BAM datasets" under NGS:BAM Tools.  Merge BAM Files” in the left panel in the same section.  In the central panel, select the first SAM-to-BAM dataset.  Use "+Insert BAM datasets to filter" to add additional file and repeat the step for the third file. Then click the Execute button. 

  9. Variant detection:  In the blue Tools panel on the left, click on the section “NGS: Call Variant Detection” to display the tools in this section.  Or scroll to the top of this panel, and type FreeBayes in the search box.  In the search results, click on FreeBayes to set up the variant detection calculation.  Use version 0.0.3.  For the option “Choose the source for the reference list”, select History.  For the option “BAM file”, choose the merged file from the previous step.  The option “Using reference file” should be chr21.fa, and the Basic options should be chosen.  Click the Execute button.  View the results by clicking in the right panel on the eye icon next to the FreeBayes results.

  10. Sorting the results:  In the Tools panel, click on the section “Filter and Sort” to see the list of tools.  (If you used the search box, click the x next to the query term to clear the search results.)  Click on the Sort tool and look at the central panel.  Set the “Sort Query” option to the FreeBayes result.  The “on column” option should be c6, since the QUAL column is the sixth column.  “with flavor” should be set to “Numerical sort”, and “everything in” should be “Descending order”.  Click the Execute button, and view the results by clicking on the eye icon next to the Sort results in the right panel.
  1. Viewing the mapped reads:  Click on MergedBams.bam file in the right panel to reveal the display option “display at UCSC main”.  Click on main.  This opens the UCSC Genome Browser with MergedBams.bam displayed.  To see the top scoring variant, type chr21:27,818,520-27,818,550 in the search box, and click the go button.  Then view the reads, by going down to the “Custom Tracks” section and select full in the menu labeled Convert, Merge, Randomize.  Then click refresh.  The variant in the reads is now visible.

  2. Viewing the variant analysis results (vcf file): Return to the Galaxy window to the panel on the right, and click on “FreeBayes on …” to reveal the display option “display at UCSC main”.  Click on main.  This opens another UCSC Genome Browser with the FreeBayes results added to the display.  Scroll down to the “Custom Tracks” section and change the FreeBayes menu to pack.  Change the Convert, Merge, Randomize menu to dense.  This yields a track displaying only the variants.
  1. Examining the biological context:  To view a variant in an exon, go to the textbox at the top of the page, type chr21:27,061,784-27,067,061, and click the go button.  Scroll down to the bar labeled “Genes and Gene Prediction Tracks”; ensure that the menu for “UCSC Genes …” is set to pack.  Now the variants can be seen in the context of the tracks chosen for display in the UCSC Genome Browser.  Zoom in on the first variant in this view by holding down the shift key and the left mouse button to make a rectangle enclosing it.  Finally, click on a variant in the FreeBayes track to view detailed information about it.
 

Questions, Comments:  Medha Bhagwat, PhD