Course No. 12 "Gene Expression Omnibus (GEO)"


Problem 2

Gene Expression Omnibus

This class demonstrates how to search for an expression record in GEO, obtain differentially expressed genes and information about their pathway enrichment.

Topics to be covered include:

  • Types of databases (GEO DataSets and GEO Profiles)

  • Types of entries in GEO DataSets (Platform, Sample, Series and Dataset)

  • Searching options for GEO DataSets

  • Obtaining a differentially expressed gene list for an experiment (using analysis tools in GEO DataSet or using GEO2R)

  • Links to accessing or downloading data, profiles and pathway enrichment

Access the NCBI home page (  Click on the Genes and Expression link on the left side of the page.  Notice three listings, Gene Expression Omnibus (GEO) Database, Gene Expression Omnibus (GEO) DataSets and Gene Expression Omnibus (GEO) Profiles.  Click on the Gene Expression Omnibus (GEO) Database link.  Click on the Overview link listed under Documentation.  Note different types of submitted entries, NCBI curated records and their accession number prefixes.  Also note the contents of two databases, GEO DataSets and GEO Profiles.

Searching entries in GEO DataSets and downloading data

Go back to the GEO home page.  Click on the Search for studies at GEO DataSets link.   Click on the Advanced link.  Note various options listed under All Fields to restrict your query such as DataSet Type, Entry Type, Subset Variable Type, and Platform Technology Type.  Use the Show index list to list options under that field.   However, in this example, we will not use any restriction.  Go back to the DataSets main page.  Type smoking AND "T lymphocytes" including the double quotes in the Search box at the top, and click on the Search button. Note the number of different entry types, study types and organisms listed on the search results.  Note the first entry Cigarette smoking effect on T lymphocytes.  Note the links to its Platform and Series records and a link to download data.  Click on the Series GSE4806 link.  Note the summary, overall design, number of samples and links to various download options.  Click on the Query GEO DataSets for GSE4806 link at the top of the page to access all 9 associated entries. The first one is the curated DataSet, and the rest are submitter provided:  1 Series, 1 Platform and 6 sample entries.  Note the link to GEO Profiles from the GDS2563 Accession entry and the Analyze with GEO2R link from the GSE4806 Accession entry. 

Analyzing data using tools in GEO DataSet

Click on the title Cigarette smoking effect on T lymphocytes to access the DataSet Record GDS2563.   To get information about the samples, color coding and value distribution, click on the Experiment design and value distribution link then on the click for details link.  Note there are only 5 samples in the curated dataset even though 6 samples were submitted.  GSM108351 was removed during curation.  We will learn more details later when we use GEO2R. 

To obtain a list of differentially expressed genes, use the Compare 2 sets of samples link.  Select the test and a significance level of 0.01.  Click on Select which samples to put in Group A and Group B.  Assign samples to group A and B by clicking on them (Smoker in Group A and non-smoker in Group B).  Click on the OK button.  Click on Query Group A vs. B.  You can download the profile data by using the button at the top right, Download profile data.  (Links for top 200 genes with similar profiles can be obtained from the Profile neighbours link.) You may wish to sort the results page by Subgroup effect under the Display Settings and click on the Apply button.   Information about pathways enriched in these genes can be obtained by using the Find pathways button.  Alternatively, the gene list (without fold change) can be downloaded using the Find related data Database menu by Select -> Gene for input into your choice of pathway analysis resource. 

Go back to the DataSet Browser page.  Click on the Cluster heatmaps link.  Select the method and click on the Display button.  Select a particular area of interest by clicking in the region and then adjusting the location and size of the selection by dragging the mouse.  Once selected, double click on the selection to see the list of genes.  Go back to GDS cluster analysis page.  Access the genes in this region in the Gene Profiles database by using the View in Entrez button.

Go back to the DataSet Browser page.  Click on the Find genes link.  Type GZMA in the Find gene name or symbol box and click on the Go button.  GZMA is reported to be upregulated in smokers in the publication associated with the study. 

GEO2R:  The above analysis links are present only in the curated Dataset.  Many uncurated series are available in GEO.  You can use the GEO2R link provided on the Series page to obtain a list of differentially expressed genes for any series.  Access the GSE4806 page.  Click on the Analyze with GEO2R link.  Click on the Value distribution tab and then on the View button to view the distribution of value data for all 6 samples.  Note the value distribution in the sample GSM108351 compared to others.  This explains why the sample was removed during curation.  Click on the GEO2R tab.  Click on the Define groups link to select groups listed under the Treatment column.  Sort the sample names by clicking on Treatment. Select samples using the control key and assign it to the group by clicking on the appropriate group.  Do not use sample GSM108351 in this assignment.  Use the default options or choose options by clicking on the Options tab.  Click on top 250 or on Save all result.


Questions, Comments:  Medha Bhagwat, PhD