Course No. 19 "Beginning Python for Bioinformatics"

 

 

Problem 1

Python is a scripting language commonly used for learning computer programming and automating tasks such as reformatting output from one application for input into another; exploring sequence alignments; or building workflows. This class seeks to provide the users with a taste of python and enough skills and understanding to use pre-built python tools to examine data. The class uses python 3.5 and Jupyter. Jupyter is a web application in which python statements can be typed and executed.

Topics Covered

  • Basics
  • Collections
  • Control Statements
  • Genome Tools package visualization

Students Will Learn

  • Basic programming
  • Importing and using packages
  • Importing tabular data
  • Data displaying heatmap

For example, we can import gene expression data for multiple samples, cluster both the samples and genes, and display the results in a heatmap.

Steps for Setup

In the steps below, “>” is used to indicate the command line prompt and “>>>” is used to indicate the python prompt.

1.      Install miniconda

a.     https://conda.io/miniconda.html for Python 3.6

b.     Open the Anaconda Prompt application

c.     > conda update conda

2.     Create and activate a Python 3.5 environment

a.     conda create –n pyclass python=3.5

b.     Windows

                     i.   > activate pyclass

c.     Mac, Linux

                     i.   > source activate pyclass

3.     Install and test python packages

a.     > pip install jupyter

 

b.     Test

                     i.   > jupyter notebook

                                                   ii.     Trouble shooting: If the notebook does not open automatically, check the output of this command. It should return a secure, password tokened URL which can be pasted into a browser.

                                                 iii.     To exit the notebook, go back to the Anaconda Prompt window, and use ctrl-c to end the notebook.

 

c.     > activate pyclass [should already be active]

 

d.     For Windows scipy, the author recommends the conda install instead of pip for windows. This works fine for Mac OS, as well.

                     i.   > conda install scipy

                                                   ii.     Test

1.  > python

2.  >>> import scipy

 

e.     > pip install scikit-learn

                                                    i.     Test

1.  > python

2.  >>> import sklearn

 

f.      > pip install pandas

                                                    i.     Test

1.  > python

2.  >>> import pandas

 

g.     > pip install plotly

                                                    i.     Test

1.  > python

2.  >>> import plotly

 

h.     > pip install genometools

i.       Test

                     i.   > python

                   ii.   >>> import genometools

 

4.     Upon completion of the training

a.     Windows

                     i.   deactivate pyclass

b.     Mac, Linux

                     i.   source deactivate pyclass

Setup for Pre-Installed Computers

For users who already have the installation, go to the Anaconda Prompt application, and

for Windows, type

> activate pyclass

and for Mac, type

> source activate pyclass

Then change to the folder containing the data, and type

> jupyter notebook

The data folder pythonclass used in this training can be downloaded: http://nihlibrary.ors.nih.gov/bioinfo/python/pythonclass.zip.

Introduction

The computer is waiting for instruction. At the basic level, python instructions look like simple math and search commands. The ability to store multiple items of information in a single instruction and use python-defined tools to act on this information can substitute for many database and spreadsheet software uses. Control statements can be added to filter data or to use information about the data to determine the next step in its handling. They can also be used to automate tasks that would normally have to be typed by hand. Finally, prebuilt tools such as GenomeTools (https://pypi.python.org/pypi/genometools) provide instructions to the computer for visualizing gene expression data.  

To begin the hands-on class, return to the Jupyter notebook, and click on the menu labeled New.  Select Python 3.  For the remainder of the class, type commands in this notebook. To better understand the notebook, go to the Help menu, and select User Interface Tour.

Basics

Numbers

 

>>> 42

>>> -999

>>> 42.5

>>> .05

>>> 5e-2

Strings

 

>>> 'CAATATTACTGTGCTGCTTTA'

 

Operations

Numerical

 

>>> 9 + 8

>>> 9 – 8

>>> 9 * 8

>>> 9 ** 2

>>> 8 / 2

>>> 9 / 2

 

String

 

>>> 'TT' in 'CAATATTACTGTGCTGCTTTA'

>>> 'GG' in 'CAATATTACTGTGCTGCTTTA'

>>> 'GG' not in 'CAATATTACTGTGCTGCTTTA'

>>> 'AT' + 'GC'

>>> 'GTG' + 'GTG'

>>> 'GTG' * 2

>>> 'GTG' * 10

>>> 10 * 'GTG'

 

Indexing

 

>>> 'CAATATTACTGTGCTGCTTTA'[0]

>>> 'CAATATTACTGTGCTGCTTTA'[1]

>>> 'CAATATTACTGTGCTGCTTTA' [-1]

>>> 'CAATATTACTGTGCTGCTTTA'[-5]

Slicing

 

>>> 'CAATATTACTGTGCTGCTTTA'[1:4]

>>> 'CAATATTACTGTGCTGCTTTA'[5:15]

>>> 'CAATATTACTGTGCTGCTTTA' [:15]

>>> 'CAATATTACTGTGCTGCTTTA' [5:]

>>> 'CAATATTACTGTGCTGCTTTA'[1:10:2]

>>> 'CAATATTACTGTGCTGCTTTA'[::-1]

Useful calls

 

Functions

The Help menu Python selection can lead to more information about builtin functions: https://docs.python.org/3.5/library/functions.html.

>>> len('CAATATTACTGTGCTGCTTTA')

>>> print('MKRISTTITTITITTGNGAG', 'CAATATTACTGTGCTGCTTTA')

>>> print('MKRISTTITTITITTGNGAG', 'CAATATTACTGTGCTGCTTTA', sep='\n')

 
Methods

https://docs.python.org/3.5/library/stdtypes.html?highlight=count#string-methods

>>> 'CAATATTACTGTGCTGCTTTA'.count('TT')

>>> 'CAATATTACTGTGCTGCTTTA'.find('ACTG')

Names

Instead of typing or pasting the various numbers, sequences, and operations repeatedly, we can assign names to them. Type the name to see the results of any of the operations.

>>> ageNow = 9

>>> ageNextYear =  ageNow + 1

>>> trpOp = 'MKRISTTITTITITTGNGAG'

>>> trpOpLen = len(trpOp)

>>> trpOpCpy = trpOp

Exercise:

1.     Use the string method find to get the location of 'KRIS' in the string 'MKRISTTITTITITTGNGAG'.

2.     Use the location and slicing to extract 'KRIS' from 'MKRISTTITTITITTGNGAG'.

Collections

Lists

 

Defining

 

>>> letters1 = ['a', 'b', 'c', 'd']

>>> letters2 = ['e', 'f', 'g']

 

Combining

 

>>> letters = letters1 + letters2

>>> letters

>>> letters1

>>> letters2

>>> letters1.extend(letters2)

>>> letters1

>>> letters2

Accessing List Items

 

>>> geneIDs = [
'132261066',
'132259081',
'132266121',
'132262686',
'132258067',
'132266050',
'132258516',
'132276892',
'132256009',
'132260517',
'132277332',
'132269888',
'132257862',
'132257849',
'132286112',
'132256050',
'132261869']
 
The first item in the list is
 
>>> geneIDs[0]
 
The last item in the list is
 
>>> geneIDs[-1]

Mappings

 
>>> geneIDmapping = { 
# GDS6010
# ID           geneName
'132261066': 'CTGF',
'132259081': 'CHAC1',
'132266121': 'SESN2',
'132262686': 'RGS16',
'132258067': 'SLCO2A1',
'132266050': 'ATF3',
'132258516': 'KRTAP1',
'132276892': 'CKLF',
'132256009': 'MSMO1',
'132260517': 'LINC00312',
'132277332': 'LIF',
'132269888': 'PDE5A',
'132257862': 'CSF2',
'132257849': 'HMGCS1',
'132286112': 'TRAF1',
'132256050': 'STC2',

'132261869': 'LDLR'

}

 

>>> geneIDmapping.get('132257849')

 

>>> geneIDmapping.get(geneIDs[0])

Files

>>> samfile = open('sample.sam', 'r')

The character '\n' represents the end of a line. So we can split on the end of line character.

>>> text = samfile.read()

>>> text.split('\n')[1]

It will look better if we use the print function.

>>> print(text.split('\n')[1])

>>> samfile.close()

Automatic closing is accomplished using the with statement.

>>> with open('sample.sam', 'r') as samfile:

        print(samfile.read().split('\n')[1])

show the data for the first read.

>>> with open('sample.sam', 'r') as samfile:

        print(samfile.read().split('\n')[2])

 

Exercise

1.  >>> classes1 = ['bioinformatics', 'biology','statistics']

>>> classes2 = ['history', 'English']

>>> classes = classes1 + classes2 

 

Given the above code, what is the result for classes? Are classes1 and classes2 changed?

 

>>> classes1.extend(classes2)

                                            

2.     Suppose we have patient identifiers, p2348, p38942, p9876, p6876, with the respective tissue types, normal, tumor, tumor, normal. Create a mapping for this information. 

Write a python statement to retrieve the tissue type for the third patient.

Control Statements

Conditionals

For example, if we had used python to read flags from a SAM formatted file, we can check the file for unmapped reads using this conditional.

Simple

>>> flag = 4

>>> if flag == 4:

        print('unmapped')

>>> flag = 83

Provide alternative

>>> if flag == 4:

        print('unmapped')

    else:

        print('mapped')

>>> flag = 1

Multiple conditions

>>> if flag == 4:

        print('unmapped')

    elif flag == 1:

        print('read paired')

    else:

        print('For more information see Decoding SAM flags: http://broadinstitute.github.io/picard/explain-flags.html')  

>>> flag = 99

Rerun the multiple conditions example.

Loops

>>> flag = [163, 83, 4, 145, 99, 163, 4, 147, 4, 161]

>>> for f in flag:

>>>     print(f)

>>> number_unmapped = 0

>>> for f in flag:

     if f == 4:

        number_unmapped = number_unmapped + 1

>>> number_unmapped

We want to extract only the flags from the file sample.sam .

We can do it by reading lines in a file and keeping only the columns of interest.

>>> column = []

>>> with open('sample.sam', 'r') as samfile:

        for line in samfile:

           column.append(line.split()[1])

>>> column

Add a filter using a conditional for the flags.

>>> flag = []

>>> with open('sample.sam', 'r') as samfile:

    for line in samfile:

        if line[0] != '@':

            flag.append(line.split()[1])

>>> flag

Exercise:

1.     Write a loop to map all of the geneIDs in the list from the earlier section to geneNames in the mapping in the earlier section, and print the geneNames.

 

2.     Add a conditional statement in the loop so that geneNames are printed only for geneNames beginning with 'C'.

GenomeTools

The Genome Tools package was written by …

You can modify the scripts to change its behavior.

We will read a table of gene expression data located in the data folder. If you open the spreadsheet, you will see that the rows are gene names and the columns are sample names.

Example genes vs samples

Hierarchical clustering of expression matrix

Using a python package plotly

Classes for working with expression data: http://genometools.readthedocs.io/en/latest/api/expression.html?highlight=ExpMatrix#genometools.expression.ExpMatrix

>>># setup

>>> from plotly.offline import init_notebook_mode, iplot

>>> import plotly.graph_objs as go

>>> init_notebook_mode()

>>> import os

>>> from genometools.expression import ExpMatrix

>>> from genometools.expression.cluster import bicluster

>>> expression_file = os.path.join('data', 'brca_expression_5yr_survive.tsv')

>>> # read expression data

>>> matrix = ExpMatrix.read_tsv(expression_file)

>>> print('matrix shape: ', matrix.shape)

>>> # variance filtering

>>> matrix = matrix.filter_variance(top=2000)

>>> # center expression for each gene

>>> matrix = matrix.center_genes(use_median = True)

>>> # perform hierarchical clustering on both genes and samples

>>> matrix = bicluster(matrix)

>>> # visualize the clustered matrix as a heatmap

>>> fig = matrix.get_figure(

>>>       height = 800, width = 900,

>>>       emin=-2.0, emax=2.0,

>>>       margin_bottom = 100,

>>>       heatmap_kw = dict(colorbar_label =

'Centered Expression<br>(log<sub>2</sub>-scale) '))

>>> iplot(fig)

 

Exercises

1.     Modify the above code to show fewer genes: the top 100 genes.

 

2.     Explore the genome tools module more by typing

 

help("genometools")

 

Look at the details of additional genometools packages by typing

 

help("genometools.xxx")

 

where xxx is name of package.

 

For example, if enrichment is the package of interest, type

 

help("genometools.enrichment")

 

3.     Explore the data folder for brca expression files to analyze.

References

Bioinformatics Programming Using Python, M.L. Model, O’Reilly, 2010

Software Carpentry Python Lesson, http://swcarpentry.github.io/python-novice-inflammation/

The Python Tutorial, https://docs.python.org/3.5/tutorial/index.html

 

 

Questions, Comments:  Lynn Young