NIH Library's Bioinformatics Courses

NIH Library is offering several bioinformatics courses that describe the effective usage and practical applications of available bioinformatics resources. 
The courses are two hours in length and include both lecture/demo and hands on session. 

Please refer to the Bioinformatics web page for training schedule and other bioinformatics resources offered by the NIH Library.

Course No. 1 "Sequence Analysis:  Making Sense of DNA and Protein Sequences"

In this class, students will find a gene within a eukaryotic DNA sequence. They will then learn how to predict the function of the implied protein product by seeking sequence similarities to proteins of documented function using BLAST and other tools. Finally, students will find a 3D modeling template for this protein sequence using a Conserved Domain Database Search. During the first hour, the instructor will walk students through an analysis of an uncharacterized genomic sequence from a GenBank record. During the second hour of the class, students will perform the same analysis on another genomic sequence.

Notebook 1      Notebook 2

Course No. 2 "Gene Resources: From Transcription Factor Binding Sites to Function"

This course describes how to obtain information about a human gene at all levels of the central dogma of life, genome, transcript and protein, and transcription factors regulating its expression. It also covers information about single nucleotide polymorphisms (SNPs) in the gene and which ones are known to be associated with disease.  For more information....

During the first hour, an instructor will walk you through an analysis of a human gene found in Problem 1. During the second hour of the class, you will perform the similar analysis on another human gene as described in Problem 2.

Problem 1      Problem 2

Course No. 3 "Sequence Similarity Search: BLAST"

This course offers a practical introduction to nucleotide and protein sequence similarity searching using NCBI's BLAST family of programs.
Exercises range from simple searches to creative uses of the BLAST programs.

Topics to be covered include:

A. advantages of different BLAST programs such as blastn, blastp, tblastn and when to use which one
B. how to limit your searches to make them more specific
C. how to understand the results


Course No. 4 "Sequence Similarity Search: BLAST-Like Alignment Tool (BLAT)"

This course demonstrates how to use BLAT to map a cDNA/mRNA sequence to a genome to identify exon-intron locations in the genomic sequence and a protein sequence to a genome to search for gene family members in the genome.  It also demonstrates how to visualize the alignment in the UCSC genome browser and compare the results to a similar search done using NCBIís BLAST. 

Problem 1      Problem 2

Course No. 5 "Protein Structural Analysis: Binding Sites to Distant Homologs"

This course covers how to visualize and annotate 3D protein structures using NCBI's Cn3D program, identify conserved domain(s) and ligand binding sites in a protein, search for other proteins containing similar domain(s), explore a 3D modeling template for the query protein and find distant sequence homologs that may not be identified by BLAST.

Problem 1      Problem 2

Course No. 6 "Genome Browsers"

In this course, we will use the genome browsers from NCBI, UCSC and Ensembl. Used to view the assembly of the complete human genome, these browsers are valuable tools to identify and localize genes, and obtain information about them. In this course, we will see how to view different human genome maps/tracks and make best use of them. For example, the EST map can be used to identify undocumented exons or generate the alternative splice products of genes.

Problem 1      Problem 2

Course No. 7 "Identification of Disease Genes"

This course deals with identification of a disease gene using NCBI's human genome assembly. The reference genome assembly, along with integrated maps, literature, and expression information comprises a powerful discovery system for exploring candidate human disease genes.

We will start with expressed sequences obtained from a patient, identify the gene(s) expressing them, download their sequences and identify known SNPs in the expressed sequences, if any, that may contribute to the disease phenotype.

Problem 1      Problem 2

Course No. 8 “Correlation of Disease Genes to Phenotypes”

This course focuses on the correlation of a disease gene to the phenotype. It demonstrates how bioinformatics resources such as literature, expression and structure information can help provide potential functional information for disease genes.

This course describes how to determine what is known about a disease, the gene(s) associated with it and its genetic testing. We will then elucidate the biochemical and structural basis for the phenotype caused by the mutant protein.

Problem 1      Problem 2

Course No. 9 “Microbial Genome Analysis”

This course describes how to access the microbial genome sequences and annotations, explains how to navigate and download the gene and protein datasets, and introduces the available genomic and comparative genomic analysis tools from NCBI, IMG and EcoCyc.

During the first hour, an overview will be given using E. coli as an example as described in Problem1. During the second hour of class, you will perform a similar analysis on another organism.

Problem 1      Problem 2

Course No. 10 “Gene Expression Microarray Data Analysis

This course describes the analysis of microarray data for gene expression.  It shows how to set up the necessary information to describe the experiment; how to analyze the data; and how to interpret the results of the analysis.

Problem 1      Problem 2

Course No. 11 "Next Generation Sequencing Data Analysis"

Massively parallel sequencing, also known as next generation sequencing, is a technology enabling high-throughput sequencing of genomes or loci of interest.  This course focuses on a single locus.  It examines the quality of the sequence reads; mapping of reads; and visualization.  It also examines sequence variation.

Problem 1      Problem 2

Course No. 12 "Gene Expression Omnibus (GEO)"

This class demonstrates how to search for an expression record in Gene Expression Omnibus (GEO), obtain differentially expressed genes and information about their pathway enrichment.

Problem 1      Problem 2   

Course No. 13 "Introduction to Clinical Genomics"

This class describes how to access information about genes and their variants associated with diseases and the impact of variants on drug response and dosing guidelines. The class also provides an introduction to determination of the impact of the variants on function, pathogenicity or deleteriousness.

Problem 1

Course No. 14 "Bioinformatics Data Integration Using Galaxy"

Galaxy utilities can be used for data integration and are especially useful for integrating files with genomic coordinates.  For example, a file with a list of SNPs and their genome locations can be joined with a file of genes and their locations to determine overlap.  Another type of data integration is for files containing the same types of identifiers.  For example, a file of gene identifiers and expression values can be joined with a file of gene identifiers and annotations.

Problem 1      Problem 2 

Course No. 15 "The ENCODE (Encyclopedia of DNA elements) Virtual Machine"

One of the difficulties in reading journal articles is finding enough information to reproduce the results.  The ENCODE Project Consortium addressed this problem by making available the ENCODE Virtual Machine as supplemental material to their publication, "An integrated encyclopedia of DNA elements in the human genome" (  In this way, readers who want to reproduce figures in the paper and understand details of some of the analyses, can  access open source software and data within a linux environment, already set up for running the software.  In this course, commands are given to reproduce a figure in the article and to integrate data to generate the results for the figure.  The figure that is chosen is for comparison of predicted genome binding segments with transcription factor loci.

Problem 1

Course No. 16 "TCGA Data Analysis"

This course not only gives the background information on how to access TCGA data and understand different data types and levels but most importantly provides an introduction to use of various online publicly available tools to analyze the data to derive biologically meaningful information.   The course will demonstrate two approaches of analysis, cancer(s)-centric (from a cancer to the significantly mutated genes) and gene(s)-centric (from one or more genes to several cancers).

Problem 1

Course No. 17 "UNIX Command Line"

The UNIX command line interface (CLI) provides powerful access to computer files, especially for complex operations on several files at once. This exercise assumes no prior knowledge of UNIX or Linux. It covers the basics of file and folders operations. Additional topics are also covered such as finding files and information in files, loops, and shell scripts. Example data used are based on files from UniProt and the Sequence Read Archive. For example, we can find out which organism name occurs most frequently in a UNIPROT file. Or we can find out the number of mismatches for aligned reads in a bwa SAM file.

Problem 1

Course No. 18 "Bioinformatics Introduction to SQL"

Bioinformatics projects often involve files with tabular data. The ability to examine these files by filtering and summarizing, or to manipulate them by joining, is provided by SQL (Structured Query Language). In particular, the application SQLite provides a minimal platform for learning the basics of SQL. Because SQL can be run from the command line, it can easily be incorporated into data analysis pipelines. Examples will be taken from bioinformatics projects.

Problem 1