Course No. 15 "The ENCODE (Encyclopedia of DNA elements) Virtual Machine"

  Problem 1

Topics to be covered 

  • Install the virtual machine
  • Reproduce a figure
  • Reproduce a computational analysis

Installation of the Virtual Machine

This virtual machine allows researchers to run linux as an application on other platforms such as Windows or Mac operating systems.

Download the 18Gb virtual machine at this site:

  1. The virtual machine file (18Gb, MD5) to download is ENCODE.OVA (  It runs within an application called VirtualBox.  Depending on one's internet connection, the machine can take many hours to download.
  2. Next VirtualBox should be downloaded and installed.  Instructions can be found at  For example, if one is using Windows, the download would be VirtualBox 4.3.20 for Windows hosts.
    1. On Windows, after clicking the download link, click Save; then click Run.  The installation of VirtualBox will begin.
    2. Click Next; accept the defaults and click next; click the checkboxes and click next; and click Yes for Network Interfaces.
    3. Click Install and if asked if you want to install, click Yes.
    4. If asked if you want to install device software, click Install.  This may occur more than once.
    5. Click the checkbox to Start VirtualBox after installation and click Finish.
  3. When VirtualBox starts, click the File menu and select "Import Appliance".
  4. Click Open appliance.
  5. Navigate to the folder containing the file ENCODE.OVA; select the file; click Open; and then click Next.
  6. Click Import.  (This step can take 30 minutes to an hour.)
  7. Setup commands, especially the Note at the bottom of the ENCODE VM (virtual machine) page should also be followed:
  8. Once the ENCODE.OVA file has been loaded into VirtualBox, go to the VirtualBox window, and click Start.
  9. Troubleshooting
    1. Test if hardware virtualization is supported:
    2. Search the internet for "enable hardware virtualization" and the model of your computer.

Starting the Virtual Machine

  1. Double click on the Oracle VM VirtualBox icon.
  2. Select ENCODE.
  3. Click Start.
  4. If asked about the keyboard option, click OK, and be prepared to wait a few minutes.
  5. If asked about the mouse pointer integration, click OK each time.
  6. A Desktop should appear after several minutes.
  7. If a window with "Upgrade Available" appears, click "Don't Upgrade".  Then click OK.  When the verification appears, click OK.
    1. Note that the machine may be bigger than some laptop screens, and the scroll bar on the VM window needs to be used to find the buttons at the bottom of the upgrade window.
    2. Also the VM scrollbar must be used on smaller screens to see the confirmation message at the top of the VM.
  8. Double click the README file to open it in the Text Editor called gedit.  This file contains information about the location of files in the folder called figures which is on the Desktop.  Throughout the ENCODE VM, README files are found in folders, providing detailed information about files.
  9. To close the README file, click on the X at the top left of the gedit window.

Reproducing a Figure

This figure shows the overlap of predicted genome binding segments with transcription factor loci.  The R Statistical Package is included in the VM and is used to generate the figure.  Files with the extension .R are scripts written in the R language.

  1. On the Desktop, double click on the folder labeled figures and the subfolder labeled 5.  
  2. Double click on the file called README.
  3. Scroll down to the bottom of the section for Panel B; notice the R commands for the command line:
    1. R --no-save --quiet --slave < bin/panelB_TF.R
    2. R --no-save --quiet --slave < bin/panelB_rna.R
    3. Later, we will copy the first command and run it to produce a figure.
  4. Let's have a look at these files with extension .R which are for input into the R Statistical Package.
    1. Minimize the README window by clicking in the top left of the gedit window on the small horizontal line in the gray circle next to the X.
    2. Double click on the folder bin and then double click on the file panelB_TF.R.
    3. Update the R file by replacing the number 7 in the first two lines with the number 5.
    4. In the second line, note the name of the source file: TF_matrix.R.
    5. Click on the Save icon to save the file.
    6. Note that the text editor has tabs near the top under the icons.  
    7. On the tab labeled panelB_TF.R, click on the X to close the file.  Note that the README file remains and leave it open.
  5. In the remaining README file, copy the R command by positioning the cursor at the beginning of the command and left clicking and dragging the cursor to highlight the line. With the line highlighted, right click to access the menu for Copy, and select Copy.
  6. Minimize the README file window.
  7. Minimize the bin folder.
  8. Near the bottom left, locate the terminal icon; it has this symbol: >
  9. Click on the terminal icon.
  10. Type the following in this window to change to the folder for Figure 5:  cd figures/5
  11. Paste or type the R command from the README file (or from here):  R --no-save --quiet --slave < bin/panelB_TF.R
    1. To paste, right click inside the window and select Paste.
  12. On the keyboard, press the Enter or return key.
  13. You should see the output "null device" and the number 1.
  14. Type ls -ltr to see a list of the files in the folder.  Note that they are sorted by date with the newest file at the bottom.  This allows us to verify that our command generated a figure in a pdf file called Composite.TF.heatmaps.pdf.  Check the date and time to make sure it was created from the recent R command.
  15. Now we will return to the window view of this file.
    1. Minimize the terminal window.
    2. Double click on the Deskotp folder, figures.  Double click on the subfolder, 5.
  16. View the output from the R command by double clicking on the output file Composite.TF.heatmaps.pdf.

Reproducing a Computational Analysis

We will integrate genomic features from two different files, one with predicted binding segments and one with TF (transcription factor) sites.  Both files have chromosome and start, stop positions associated with the elements.  The program we will run will look for overlapping elements.

  1. As above, we will find the commands in the README file in the subfolder 5 within the Desktop folder figures.  Double click on the Desktop folder, figures, and on the subfolder, 5.  Double click on the README file.
  2. Go to the section, Panel B, of the README file and find the line beginning with  This will be the first command we use.  It will integrate the segments in the file chromhmm.segway.gm12878.comb11.concord4.bed with the TF sites in the file GM12878_Cytoplasm_Am_long_CSHL_Contig.CSHL_LID18547-003C-b1.LID18548-004C-b2.idrFilt0.1.bed.   These files are in other folders.
  3. Copy the  command which may span several lines.  It ends with the word "temp".
  4. Minimize the README window.
  5. Near the bottom left, locate the terminal icon; it has this symbol: >
  6. Click on the terminal icon.
  7. Change to the folder for Figure 5 by entering the command: cd ~/figures/5
  8. Paste the previously copied command into the terminal window and press the Enter or return key on the keyboard.
    1. ../../commonData/segmentations/Combined_7_state/chromhmm.segway.gm12878.comb11.concord4.bed ../../commonData/peaks/jan2011/spp/optimal/spp.optimal.wgEncodeUwTfbsGm12878CtcfStdAlnRep0_VS_wgEncodeUwTfbsGm12878InputStdAlnRep1.narrowPeak > temp
  9. Type the command, ls -ltr to view the long listing of the files in the folder.  Note the most recently written file with extension .csv.
    1. The name of the file is very long:
  10. Minimize the Terminal Window.
  11. Double click on the folder figures and the subfolder 5.
  12. Double click on the output csv file.  This is a summary file indicating the number of overlaps for each type of segmentation state (
    1. The name of the file is very long:
    2. If asked, select "Separated by" and "Comma" for the Separator options.  Then click on OK.
  13. Minimize the open windows, and return to the Terminal window.
  14. Now we will run a batch script to repeat this analysis over all of the transcription factors.
  15. In the terminal, make sure you are in the folder ~/figures/5.  If in doubt, type
    1. cd /homefigures/figures/5
  16. Run the batch script by typing 
    1. ../../commonData/segmentations/Combined_7_state/chrommhmm.segway.gm12878.comb11.concord4.bed ../../commonData/peaks/jan2011/spp/optimal/spp*Gm12878*.narrowPeak
    2. This command is found in the README file in the subfolder 5 of the desktop folder figures.
    3. Warning messages can be ignored.
  17. To view the perl script, minimize the terminal window, and click on the Home Folder near the top left (the second icon).  Ensure that Home is selected, then click on the folder called bin.  Next click on the folder perl.  Finally, click on the folder, and click the display button.  The equations for observed/expected coverage are found in the line beginning, my ($obs, $exp).
  18. To run the analysis, return to the terminal window by minimizing the open windows, and then click on the terminal icon on the left.  You should still be in the folder ~/figures/5. Type
    1. segment_comp.chromhmm.segway.*.csv > TF_matrix.R
    2. The results are conveniently written into an R script for display as a heatmap.
    3. To verify that the results have been written, type
      1. ls -ltr
    4. Check the date and time of the file TF_matrix.R are from the type the perl script was run.
    5. To view the results, go to the folder on the desktop called figures.  Once inside this folder, click on the subfolder 5.
    6. Scroll to the bottom and click on TF_matrix.R
      1. Two sets of numbers are given, one for coverage and one for counts.
      2. Each set of numbers has seven comma separated columns and 67 line separated rows.  The columns represent seven different types of ENCODE elements CTCF, E, PF, R, T, TSS, and WE, where CTCF is transcriptional repressor, E is predicted enhancers, PF is predicted promoter flanking regions, R is predicted repressed or low activity, T is transcribed or active gene bodies, TSS is transcription start site, and WE is predicted weak enhancers.
      3. The rows represent transcription factors.
      4. Note that this is similar to the data plotted in the previous section.

Stopping the Virtual Machine

  1. To stop the VM, click on the menu Machine at the top left of the VM window, and select Close… Then select Power off the machine, and click OK.
  2. To exit the VM VirtualBox Manager, click on File and select Exit.



Questions, Comments:  Medha Bhagwat, PhD