Course No. 14 "Bioinformatics Data Integration Using Galaxy"

 


Problem 2

Topics to be covered:

  • Access the UCSC Table Browser using Galaxy to retrieve data

  • Join two files on their genomic coordinates

  • Count the number of SNPs per gene

  • Compare two datasets by joining on identifiers

  • Export data

For classroom training, access the cloud server using instructions provided in class. All others, access the public Galaxy server (http://usegalaxy.org). Click on the User menu at the top of the page and register for an account. This will automatically log you in to Galaxy. Click on Analyze Data at the top of the page.

Access the UCSC Table Browser using Galaxy to retrieve data

Retrieving Craig Venter’s SNP data for chr19 

Click on the link “Get Data” on the left side of the page.  Then click on “UCSC Main”.  
Set the following menus. clade: Mammal; genome: Human; assembly: Mar. 2006 (NCBI36/hg18) group: Variation and Repeats;
track: Genome Variants; table: Venter (pgVenter); region: position and type in chr19; output format: BED – browser extensible data.  Ensure that “Send output to Galaxy” is checked.  

Click on the button “describe table scheme” to see the description for each column in this table.  Click on the browser back button (<-).  

Click on the button “get output”.

The only selection should be Whole Gene.  Click on the button “Send query to Galaxy”.

View the dataset by clicking on the image of the eye on the right side next to the retrieved pgVenter title.  Note the types of variants in this table.  
Does this table have more classes of variants available than in the table for James Watson (Problem 1)?

Rename the dataset to “VVARs” by clicking on the image of the pencil on the right side next to the retrieved pgVenter title.  
Type VVARs in the Name text box.  Click on the button Save at the bottom of the page.

Retrieving the list of genes on chr19 in tabular format

This is an alternative approach to that in Problem 1.  It demonstrates the conversion of a general tabular format to an interval format that will be recognized by
 the “Operate on Genomic Intervals” tools.  This is useful for data integration on coordinates when one is uploading data in non-standard formats.

Click on the link “Get Data” on the left side of the page.  Then click on “UCSC Main”.  Set the following menus. clade: Mammal; genome: Human; assembly:
Mar. 2006 (NCBI36/hg18) group: Genes and Gene Predictions; track: RefSeq Genes; table: refGene; region: position and type in chr19; output format:
selected fields from primary and related tables.  Ensure that “Send output to Galaxy” is checked.  

Click on the button “describe table scheme” to see the description for each column in this table.  Click on the browser back button (<-).  

Click on the button “get output”. 

Check name, chrom, strand, txStart, txEnd, and name2.  Click the button “done with selections”.

Click the button “Send query to Galaxy”.

View the dataset by clicking on the image of the eye on the right side next to the retrieved refGene title.  Note that this time, the gene symbols are already in the table.  
Why do we see multiple RefSeq identifiers for the same gene symbol?

Rename the dataset to “geneNamesCoords” by clicking on the image of the pencil on the right side next to the retrieved refGene title.  
Type geneNamesCoords in the Name text box.  

At the top of this editor, click on the tab “Datatype”, and select “New Type: interval” from the menu.  Click the Save button. Edit the resulting Attributes page as follows.  
Chrom column: 2; Start column: 4; End column: 5; Strand column (click box & select): 3 with box checked; Name/Identifier column (click box & select): 6 with box checked.

Click the button Save.

Join two files on their genomic coordinates

Joining the datasets

Click on the link “Operate on Genomic Intervals” on the left side of the page.  Then click on the link Join.  
Configure the menus to Join VVARs with: geneNamesCoords with min overlap: 1 (bp).  Return: Only records that are joined.  
Click on the button Execute.  View the results as above.

Rename the dataset to “VVAR-GeneNames” by clicking on the image of the pencil on the right side next to the retrieved Join title.  
Type VVAR-GeneNames in the Name text box.  Click on the button Save.

Count the number of variants per gene

On the left side of the page, click on the link Statistics.  Click on Count.  Configure the menus as follows. From dataset: VVAR-GeneNames;
Count occurrences of values in column(s): c10 Delimited by Tab.  Click the Execute button.  View the results as above. Viewing the results
from Problem 1, how many genes contain SNPs from the Watson dataset?  In this problem, how many genes contain variants from the Venter dataset?

Rename the dataset to “CountVVARsPerGene” by clicking on the image of the pencil on the right side next to the retrieved Count title.  
Type CountVVARsPerGene in the Name text box.  Click on the button Save.

Compare two datasets by joining on identifiers

On the left side of the page, click on the link Join, Subtract, and Group.  Click Join two Datasets.  Configure the menus as follows.  
Join: CountSNPsPerGene Using column: c2 with: CountVVARsPerGene and column: c2; Keep lines of first input that do not join with second input: Yes;
Keep lines of first input that are incomplete: No; Fill empty columns: No.  Click Execute.  View the results as above.

Rename the dataset to “CmpSNP-VVARGeneCounts” by clicking on the image of the pencil on the right side next to the retrieved Count title.
 Type CmpSNP-VVARGeneCounts in the Name text box.  Click on the button Save.

Based on the instructions from problem 1, sort this file twice in descending order.  The first time, sort on the counts for the Watson SNPs per gene,
and the second, on counts for the Venter variants per gene.  Note that this comparison should be examined only at a qualitative level.  Explain the reason for this.

The publication for Venter’s genome uses a population mutation score to characterize variants.  For indels, this score is highest for chromosome 19.  
(Section, Initial Characterization of Variants, PLOS Biology 4 September 2007, doi: 10.1371/journal.pbio.0050254,
 http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.0050254).  
Explore other Galaxy tools to determine the number of indels in the Venter dataset VVARs for chromosome 19.  What pattern could be used to select the indels?

Export Data

On the right side of the window, click the link CountVVARsPerGene to view the Download icon which is the first icon in the expanded view.
Click on the Download icon and then select Save File and click OK.

To get access to the downloaded file, for example, in Firefox, click on the down arrow in the upper right corner.  Click on the small folder icon to open the folder containing the file.

Open a spreadsheet application such as Excel and drag the file into it.  In this way, datasets from Galaxy can be downloaded for input to other applications.  
(Mac users can open the file from within Excel by setting the Enable menu to All Files.)

Returning to Galaxy, on the right side of the window, click the link CountVVARsPerGene again to collapse the view.

 

 

Questions, Comments:  Lynn Young, PhD (lynny@mail.nih.gov)