Course No. 10 "Gene Expression Microarray Data Analysis"


Problem 1

The NOTCH signaling pathway is involved in intercellular communication, using a family of NOTCH proteins involved in gene regulation.  Researchers created a peptide, SAHM1, which disrupts this pathway.  Students will find genes affected by this disruption using microarray data from the NCBI public database, Gene Expression Omnibus (GEO) with identifier GSE18198.  They will use BRB-Array Tools from NCI to analyze the data.

The steps are as follows (steps A-C can be skipped, if these files are already provided for you).

  1. Download data from the GEO page , under the section “Supplementary file” by clicking on the link “http”.  (A direct link to the data is provided here -  Extract the data from the zip file.
  1. Download the gene annotation file from the GEO page under the section “Platforms” by clicking on the link GPL570. The page for GPL570, the platform used in this study, has a link at the bottom called “Download full table…”.  This link can be used to download the annotation file.  (A direct link to the data is provided here - HG-U133_Plus_2na32AnnotBrief.txt)
  1. Create an experiment description table with information about the experiment using the file names of the micorarray data files as the array identifiers in the first column (file name extensions such as “.cel” should not be included).  The second column should contain the name of the treatment.  This information can be found at the GEO web site for GSE18198 under the heading, “Samples”: .  Use only the KOPT-K1 cell line.  Alternatively, one can use the file provided here -  ExpDescFile.xlsx.
  2. Import and preprocess the data:  Open excel and click on “Add-Ins” to access “ArrayTools”.  Click on “ArrayTools”, and select “Import data” and then “Data import wizard”.  In  the Data Import Wizard, select “Data Type” “Affymetrix .CEL Data”.  Select “File Type” as “The expression data are in separate files stored in one folder.  Click on the browse button. Find the Desktop folder “KOPT-K1Subset”.  Select this folder and click “OK”. In the “Data Import Wizard”, also click “OK”.  Click “Yes” when asked if 6 arrays are the correct number of arrays.  Select “justRMA” as the “method to analyze your Affymetrix CEL files”, and then click “OK”   

  1. Annotate the data:  Continuing with the “Options for Annotation”, select “Import your own annotation file, and click “OK”.  At this point you may get a question about installing a package from BioConductor.  If this occurs, click “Yes”.  Next for “Please specify the location of your gene identifiers”, select “The identifiers are stored in a separate file.”  Click “Browse” to select your “Gene Identifiers file”.  On the Desktop select the file HG-U133_Plus_2na32AnnotBrief.txt (file HG-U133_Plus_2na32AnnotBrief.txt) and click “Open”.  For the “Gene Name, Title, or Description”, select “Col 5: Gene Title”.  For the “GenBank Accession”, select “Col 2: Representative Public ID”.  For the “Map Location”, select “Col 4:  Alignments”.  Then click the button “Next”.
  1. Open the experiment description file:  Continuing with the Experiment Descriptor File (file ExpDescFile.xlsx), click the “Browse” button.  Select the Desktop file ExpDecFile.xls and click “Open”.  Then click the button “Next”. 
  1. Manage the output file location:  Name the project folder by typing “KOPTK1-Project” into the text box to the right “Project folder”.  Name the project by typing “KOPTK1Project.xls” into the text box to the right of “Project name”.  Then click the button, “Next”.  Please wait a few minutes for the large data sets to upload.  Also, currently on Windows XP and BRB ArrayTools 4.2, feedback is not forthcoming until one clicks on the window and moves it.
  2. Filtering:  Note that the analysis of Affymetrix data does not use the interfaces for “1. Spot filters” or “2. Normalization” except for two default parameters.  Click on “3.  Gene filters”.  Accept the defaults by clicking  the button “OK”.  This executes the data preparation stage of the analysis.  Acknowledge the “number of genes passing the filtering and subsetting criteria” by clicking on the “OK” button.  Acknowledge the number of arrays and the number of arrays for which data is shown by clicking on the button “OK”.  Check the import of the annotation file by selecting the excel worksheet labeled “Gene identifiers”.  This is at the bottom of the excel workbook.
  1. A quality control step such as clustering the samples to ensure that they group into the expected sets is performed next.  Click on the “ArrayTools” menu, and select “Graphics” and then “Visualization of samples”.  For “Class variable for coloring rotating scatterplot”, select “Treatment”.  Accept the rest of the defaults.  Click on the button “OK”.  Note that samples cluster according to treatment groups, with the DMSO group colored blue and the SAHM1 group colored green.  Click on the button “Close 3D plot”.
  1. Statistical tests to find genes that are expressed differently between the two sets of samples are performed.  Go to the “ArrayTools” menu, and select “Analysis wizard”.  Click on the button “Gene Finding”.  Select “Single label (e.g. Affy)”.    Click on the button “Comparing Classes”.  Click on the button “Class Comparison”.  Click on the “OK” button.  Under the “Experiment Design” section, use the menu to select the “Column defining classes” as “Treatment”.  Under “Find gene lists determined by”, choose “Restriction on proportion of false discoveries”.  Accept the rest of the defaults.  Click on the “OK” button to execute the gene finding calculation.  Note that the results will appear in the web browser such as Internet Explorer or Firefox.  Note that the table of genes can be copied and pasted into a spreadsheet application such as excel.  For further analysis, it is useful to save the list of AffymetrixProbeSet” identifiers from this table into a file (file KOPT-K1GenesListIDsOnly.txt).
  1.  Visualization:  “View a clustered heatmap of significant genes” by clicking on the link with that name on the results page under the section “Contents”.
  1. Functional analysis:  Use a web application called DAVID ( from the National Institute for Allergy and Infectious Diseases to aid in the functional interpretation of the list of differentially expressed genes by clicking here ( 
    1. Click on the “Start Analysis” link.  To upload the list of differentially expressed genes, use option “B. Choose From a File” under the tab “Upload”.  Click on “Browse”.  Select the Desktop file “KOPT-K1GeneListIDsOnly.txt”.  Click the “Open” button. 
    2. Ensure that “Step 2: Select Identifier” shows the menu item “AFFYMETRUX_3PRIME_IVT_ID”. 
    3. Under “Step 3: List Type”, choose “Gene List”. 
    4. For “Step 4: Submit List”, click the button “Submit List”.  Next select the tab “Background”.  Under this “Population Manager”, go to the section “Affymetrix 3’ IVT Backgrounds” and look for “Human Genome U133 Plus 2 Array”.  (It is listed below one of the “Focus” arrays.)  Select “Human Genome U133 Plus 2 Array”.  On the “List” tab, under the section “Select to limit annotations by one or more species” select only “Homo sapiens”.  Click on the button “Select Species”. 
    5. Note the name of the successfully submitted gene list and background listed on the right under “Step 1. Successfully submitted gene list”.  Now the analysis can be executed. 
    6. Gene Ontology: Click on the link “Functional Annotation Tool”.  Shown in dark red are the DAVID annotation categories and in parentheses, the number of annotations in each category which are associated with genes from the submitted list. 
    7. Click on the button “Functional Annotation Clustering”.  These are a list of annotation terms, including a count of how many genes from the list are associated with the term. 
    8. To see a heatmap representing associations of the genes and terms click on the heatmap icon. Click the “Run” button, if asked if you want to run the application.  Click on the “Yes” button if given a security warning.  On the resulting heatmap page, click on the “Zoom Out” link.  In the heatmap, choose a gene from the descriptions on the right of the map.  For example, ribosomal protein L38 has only one association with a term.  Hovering over the green square in the row to the left of this gene description, highlights the associated term in the list at the bottom of the table.  It is “GO:0006412~translations”. 
    9. Pathway:  Go back to the “Functional Annotation Result” tab of the web browser.  This is the page that showed in dark red the DAVID annotation categories.  Click on the button, “Functional Annotation Table”.  This table gives a list of all of the terms associated with each gene. 
    10. Go back to the “Functional Annotation Result” window of the web browser (the page with the DAVID annotation categories shown in dark red).  Find the annotation category “Pathways”, and click on the “+” to the left of it to expand the category.  Locate the “KEGG_PATHWAY” term and click on the blue bar to the right of the “Chart” button.  In this “Function Annotation Table”, under the section “Notch homolog 2 (Drosophila)”, click on the link “Notch Signaling Pathway”.  Note the genes from the list shown in red. 
    11. Return to the microarray data analysis output web page obtained in step 10 to verify that the treatment of SAHM1 (Class 2 in the output) caused a disruption in this pathway, possible decreasing the expression of the Notch 2 (1557543_at) and Deltex (227336_at, DTX1) genes shown in the figure.



Questions, Comments:  Medha Bhagwat, PhD