Course No. 14 "Bioinformatics Data Integration Using Galaxy"

 


Problem 1

Topics to be covered:

  • Introduce Galaxy

  • Access the UCSC Table Browser using Galaxy to retrieve data

  • Join two files on their genomic coordinates

  • Add gene names to a table containing RefSeq identifiers

  • Count the number of SNPs per gene

  • Export data

Introduce Galaxy

For classroom training, access the cloud server using instructions provided in class.  All others, access the public Galaxy server (http://usegalaxy.org).
Click on the User menu at the top of the page and register for an account.  This will automatically log you in to Galaxy. Click on Analyze Data at the top of the page.

Access the UCSC Table Browser using Galaxy to retrieve data

Retrieving James Watson’s SNP data for chr19

Click on the link “Get Data” on the left side of the page.  Then click on “UCSC Main”.  
Set the following menus. clade: Mammal; genome: Human; assembly: Mar. 2006 (NCBI36/hg18) group: Variation and Repeats;
track: Genome Variants; table: Watson (pgWatson); region: position and type in chr19; output format: BED – browser extensible data.  Ensure that “Send output to Galaxy” is checked.  

Click on the button “describe table scheme” to see the description for each column in this table.  Click on the browser back button (<-).  

Click on the button “get output”.

The only selection should be Whole Gene.  Click on the button “Send query to Galaxy”.

View the dataset by clicking on the image of the eye on the right side next to the retrieved pgWatson title.  Note the types of variants in this table.

Rename the dataset to “SNPs” by clicking on the image of the pencil on the right side next to the retrieved pgWatson title.  Type SNPs in the Name text box.  
Click on the button “Save” at the bottom of the page.

Retrieving the list of genes on chr19 

Click on the link “Get Data” on the left side of the page.  Then click on “UCSC Main”.  
Set the following menus. clade: Mammal; genome: Human; assembly: Mar. 2006 (NCBI36/hg18) group: Genes and Gene Predictions;
track: RefSeq Genes; table: refGene; region: position and type in chr19; output format: BED – browser extensible data.  Ensure that “Send output to Galaxy” is checked.  

Click on the button “describe table scheme” to see the description for each column in this table.  Click on the browser back button (<-).  

Click on the button “get output”. As before, the only selection should be Whole Gene.  Click on the button “Send query to Galaxy”.

View the dataset by clicking on the image of the eye on the right side next to the retrieved refGene title.

Rename the dataset to “genes” by clicking on the image of the pencil on the right side next to the retrieved refGene title.  Type genes in the Name text box.  Click on the button Save.

Join two files on their genomic coordinates

Click on the link “Operate on Genomic Intervals” on the left side of the page.  Then click on the link “Join”.  
Configure the menus to Join: SNPs with: genes with min overlap: 1 (bp).  Return:  Only records that are joined.  Click on the button “Execute”.  View the results as above.

Rename the dataset to “SNP-Genes” by clicking on the image of the pencil on the right side next to the retrieved Join title.  Type SNP-Genes in the Name text box.  Click on the button Save.

Add gene names to a table containing RefSeq identifiers

Retrieving the gene name dataset

Click on the link “Get Data” on the left side of the page.  Then click on “UCSC Main”.  
Set the following menus. clade: Mammal; genome: Human; assembly: Mar. 2006 (NCBI36/hg18) group: Genes and Gene Predictions;
track: RefSeq Genes; table: refFlat; region: position and type in chr19; output format: selected fields from primary and related tables.  Ensure that “Send output to Galaxy” is checked.  

Click on the button “describe table scheme” to see the description for each column in this table.  Click on the browser back button (<-).  

Click on the button “get output”.

The only selections should be geneName and name.  Click on the button “done with selections”.  

Click on the button “Send query to Galaxy”.

View the dataset by clicking on the image of the eye on the right side next to the retrieved refFlat title.

Rename the dataset to “geneNames” by clicking on the image of the pencil on the right side next to the retrieved refFlat title.  Type geneNames in the Name text box.  Click on the button Save. 

Joining the gene names to the SNP-Gene table

On the left side of the page, click on the link Join, Subtract, and Group.  Click on Join two Datasets.  
Configure the menus as follows.  Join: geneNames using column: c2 with: SNP-Gene and column: c8. The remaining menus should be set to No. Click the Execute button.  View the results as above.

Rename the dataset to “SNP-GeneNames” by clicking the image of the pencil on the right side next to the retrieved Join title.  Type SNP-GeneNames in the Name text box.  Click the button Save.

Count the number of SNPs per gene

On the left side of the page, click on the link “Statistics”.  Click on “Count”.  
Configure the menus as follows. From dataset: SNP-GeneNames; Count occurrences of values in column(s): c1 Delimited by Tab.  Click the Execute button.  View the results as above.

Rename the dataset to “CountSNPsPerGene” by clicking on the image of the pencil on the right side next to the retrieved Count title.  Type CountSNPsPerGene in the Name text box.  
Click on the button Save.

Sorting the counts

On the left side of the page, click on the link “Filter and Sort”.  Click on “Sort”.  
Configure the menus to Sort Dataset: CountSNPsPerGene on column: c1 with flavor: Numerical sort everything in: Descending order.  Click on the button Execute.  View the results as above.  

The publication for Watson’s genome contains a table “SNPs matching HGMD mutations causing disease or other phenotypes.”  
[Table 3, Nature 452, 872-876 (17 April 2008) http://www.nature.com/nature/journal/v452/n7189/full/nature06884.html].  
The two SNPs listed for chromosome 19 are located in two genes, respectively: IL12RB1 and NPHS1.  Are these genes found in our list?  How many SNPs that occur are found in each of these genes?

Export data

On the right side of the window, click the link “CountSNPsPerGene” to view the Download icon, which is the first icon in the expanded view.  
Click on the Download icon and then select Save File and click OK.

To get access to the downloaded file, for example, in Firefox, click on the down arrow in the upper right corner.  Click on the small folder icon to open the folder containing the file.

Open a spreadsheet application such as Excel and drag the file into it.  In this way, datasets from Galaxy can be downloaded for input to other applications.  
(Mac users can open the file from within Excel by setting the Enable menu to All Files.) 

Returning to Galaxy, on the right side of the window, click the link CountSNPsPerGene again to collapse the view.

 

 

Questions, Comments:  Lynn Young, PhD (lynny@mail.nih.gov)