Course No. 18 "Bioinformatics Introduction to SQL"

  Problem 1

 

Bioinformatics projects often involve files with tabular data. The ability to examine these files by filtering and summarizing, or to manipulate them by joining, is provided by SQL (Structured Query Language). In particular, the application SQLite provides a minimal platform for learning the basics of SQL. Because SQL can be run from the command line, it can easily be incorporated into data analysis pipelines. Examples will be taken from bioinformatics projects.

Topics Covered

  • Terms used in database design
  • Key columns
  • Syntax of Structured Query Language (SQL)
  • SQLite

Students Will Learn

  • Filter tables
  • Summarize tables
  • Join tables
  • Administer novice databases

For example, we can find out the number of samples associated with an experiment, or we can find out the average number of samples per experiment for a particular researcher.

Steps for Setup

  1. The database application we will use, SQLite, is preinstalled on both Mac and Linux computers. To access it, a terminal app is available. For example, on Mac, go to Applications → Utilities → Terminal.app. Then type sqlite3.
  2. For Windows computers, the SQLite application can be downloaded from this site: https://www.sqlite.org/download.html, under the section “Precompiled Binaries for Windows.” In this class, we are working with https://www.sqlite.org/2016/sqlite-tools-win32-x86-3130000.zip. Double-click the zip file, and click on “Extract all files,” which is located in the Windows file explorer window beneath the menus, usually near the upper left of the window. Move the extracted folder to the Desktop. Inside the folder, double-click on sqlite3 to start it.

  3. The data file for this lesson is already installed on the training room computers, but those using other computers can download the file here: http://nihlibrary.ors.nih.gov/bioinfo/SQL/bioinf.zip. Double-click the zip file, and then click on “Extract all files,” which is located in the Windows file explorer window beneath the menus, usually near the upper left of the window. Move the extracted file, bioinf.db, to the sqlite-tools folder on the Desktop.

Introduction

SQL and databases are usually taught from the point of view of designing and building systems for long-term use. Our focus is using it as a utility for operations on tabular data. 

From the design and build stance, the need for databases is signaled by a table with a column that has the same information repeated on multiple rows. Why not create two separate tables, one containing the columns with unique information, and a second, smaller one to compact the columns where information is repeating on multiple rows? This idea leads to the introduction of two terms: database normalization and relational database.

Database normalization organizes the repetitive entries into separate tables, keeping track of the rows to which the entries relate. This gives birth to the concept of relational databases comprising tables with enough information in common to define their relationships and combine information as needed.

In the relational database context, columns are referred to as fields, and rows, as records. SQL is the most popular language for manipulating the information.

Since the focus of this class is SQL, we have chosen a database application that is simple to install and manage: SQLite. It provides all of the commands we will need for the functionality demonstrated in this class.

The demonstration data was inspired by GEOmetadb [1] and is from the Gene Expression Omnibus (GEO) [2] annotations. For example, samples are extracted from patients and assayed for gene expression using a specific technology or platform (GEO prefix GPL). All of the assayed samples (GEO prefix GSM) belonging to a single experiment are called a series (GEO prefix GSE). Datasets (GEO prefix GDS) are for the entire research project, possibly comprising more than one series.

We will look at a subset of this information to determine the types of experiments researchers were performing and the number of samples associated with them. Some researchers performed multiple types of experiments, such as gene expression for both RNA-Seq and microarray platforms. We will take advantage of the many built-in commands of databases to explore the data.

Our database comprises the following tables:

Contact: people associated with the data

Site: location associated with the data

Setting: date associated with the release of the data from a specific Site

Experiments: number of samples associated with an experiment of a specific type with a specific Contact

SQLite Setup

The database is stored in the file bioinf.db.

SQLite should already be open from the setup. (Windows users double-clicked on sqlite3, and Mac and Linux users typed sqlite3 from the terminal command line.)

Success: seeing the SQLite command prompt

sqlite>

It is called a prompt, because the program awaits input.

Load the database into SQLite by typing, at the prompt,

sqlite> .open bioinf.db

Notice the dot “.” in front of the open command. In SQLite, database manager commands begin with a dot.

To view all of the SQLite database manager commands, use:

sqlite> .help

The results of the commands will be easier to read if we includ the field headings.

sqlite> .headers on

Also, invoking the column format is useful for viewing results on the screen.

sqlite> .mode column

Finally, setting the column width for this mode is necessary.

Sqlite> .width 25

Filter Tables

Selecting Data

Begin by displaying the contact names.

sqlite> SELECT family, personal FROM Contact;      

Note the semi-colon at the end of the command.

To view all of the fields, use:

sqlite> SELECT * FROM Contact;

Sorting and Removing Duplicates

We want to know the types of experiments that were run; who the contacts are; and where the data originated.

To determine the types, we can use the Experiments table.

sqlite> SELECT type FROM Experiments;

We see some repetition here, but we just want a list of the unique types.

The DISTINCT keyword will remove the duplicates.

sqlite> SELECT DISTINCT type FROM Experiments;

Operating on both the person field and the type field yields the types for each person.

sqlite> SELECT DISTINCT person, type FROM Experiments;

Duplicates are removed regardless of where they are in the table. The data does not need to be sorted first.

Next we want to list the contacts in alphabetical order by their identifier field (id) using the key phrase ORDER BY.

sqlite> SELECT * FROM Contact ORDER BY id;

Reverse the order by adding the keyword DESC, which means descending.

sqlite> SELECT * FROM Contact ORDER BY id DESC;

We can sort the experiments in ascending order (keyword ASC) by contact and descending order by s_id.

sqlite> SELECT s_id, person, type FROM Experiments ORDER BY s_id DESC, person ASC;

Here we see the origin and people for the experiment types.

Some people seem to specialize in specific types of experiments. To see this more clearly, we can remove duplicates.

sqlite> SELECT DISTINCT type, person FROM Experiments ORDER BY type ASC;

Exercise 1

Retrieve the full names of the people in the Contact table, ordered by the family name.

Filtering

We want to see when data was released from a particular point of origin. The keyword WHERE introduces our filtering clause for the query.

sqlite> SELECT * FROM Setting WHERE site=’Institut Curie’;

The database manager executes the WHERE clause first and then retrieves the fields to display. This allows filtering on columns which are not displayed.

sqlite> SELECT s_id FROM Setting WHERE site=’Institut Curie’;

Several operators are also available: AND, OR, <, >, IN, LIKE.

We can find out how many release dates occurred before 2009.

sqlite> SELECT * FROM Setting WHERE site=’Institut Curie’ AND date<’2009-01-01’;

sqlite> SELECT * FROM Experiments WHERE person=’zwe’ OR person=’ern’;

An alternative to OR is IN.

sqlite> SELECT * FROM Experiments WHERE person IN (‘zwe’, ‘ern’);

Combining AND with OR requires care, aided by the use of parentheses.

If we want expression array experiments by either Delattre or Ernst, we should use parentheses.

sqlite> SELECT * FROM Experiments WHERE type=’expr_array’ AND (person=’del’ OR person=’ern’);

Note that removing the parentheses retrieves any experiment types by Ernst.

sqlite> SELECT * FROM Experiments WHERE type=’expr_array’ AND person=’del’ OR person=’ern’;

Suppose we are trying to remember a site beginning with the letter U. We can use a query that includes the LIKE keyword and has a wildcard symbol, %, which matches any characters in that place.

sqlite> SELECT * FROM Setting WHERE site LIKE ‘U%’;

When DISTINCT and WHERE are combined another level of filtering occurs.

sqlite> SELECT DISTINCT person, type FROM Experiments WHERE person=’del’ OR person=’ern’;

Note that DISTINCT only applies to the fields chosen for retrieval.

Exercise 2

Write a query to determine which experiments have samples numbering between 10 and 100.

Summarize Tables

Earlier we used DISTINCT to remove duplicates in fields. However, we would like to display the number of duplicates or perform other operations on this information. Aggregation provides the ability to perform such tasks: counting, summing, and averaging. Other operations are finding the minimum and maximum of data in a field.

Use the keyword COUNT to find out how many records are in a query.

sqlite> SELECT COUNT(s_id) FROM Experiments;

sqlite> SELECT COUNT(s_id) FROM Experiments WHERE samples > 50;

Use the keyword SUM to find out the total number of samples if all experiments are combined.

sqlite> SELECT SUM(samples) FROM Experiments;

AVG give the average number of samples.

sqlite> SELECT AVG(samples) FROM Experiments;

The earliest and latest release dates are found using the keywords MIN and MAX, respectively.

sqlite> SELECT min(date) FROM Setting;

sqlite> SELECT max(date) FROM Setting;

We would like to find out the number of experiments per person.

For one person, we already know how to write the query.

sqlite> SELECT person, count(s_id), type, samples

> FROM Experiments

> WHERE person=’del’;

To get the same information for all persons, the GROUP BY phrase is useful.

sqlite> SELECT person, count(s_id)

> FROM Experiments

> GROUP BY person;

Next, include the average number of samples per person.

sqlite> SELECT person, count(s_id), avg(samples)

> FROM Experiments

> GROUP BY person;

Finally, the phrase GROUP BY may be used on multiple columns.

sqlite> SELECT person, count(s_id), avg(samples)

> FROM Experiments

> GROUP BY person, type;

Exercise 3

How many experiments were submitted by Ernst, and what was the average number of samples?

Join Tables

We need to include the release dates and place of origin with our experiments information. However, this information is in another table. The SQL JOIN command can accomplish this.

sqlite> SELECT * from Setting JOIN Experiments;

The result is that each record of Setting is joined with each record of Experiments. However, the desired outcome is to see the setting relevant to the experiment. For this, we can use the ON keyword, which is similar to WHERE, especially in the context of this lesson.

sqlite> SELECT * FROM Setting JOIN Experiments ON Setting.s_id=Experiments.s_id;

Note the syntax Table.field to specify the field names.

Using this notation allows selection of fields from both tables for display.

sqlite> SELECT Setting.date, Setting.site, Experiments.type, Experiments.samples

> FROM Setting JOIN Experiments

> ON Setting.s_id=Experiments.s_id;

To add the full name of the person, multiple joins are required.

sqlite> SELECT Setting.date, Contact.personal, Contact.family, Setting.site, Experiments.type, Experiments.samples

> FROM Experiments

> JOIN Setting ON Experiments.s_id=Setting.s_id

> JOIN Contact On Experiments.person=Contact.id;

The joins are possible because the tables have primary keys and foreign keys. The primary key is a value unique for each record in the table. The foreign key is a value in another table that corresponds to the primary key. For example, the last JOIN above, Experiments.person, is a foreign key for the primary key Contact.id.

JOIN is useful for annotating gene lists. Given a list of gene identifiers in one table and a list of gene names in another table, the gene names table will also need the list of gene identifiers for a successful join yielding a single table the same size as the gene list with the gene names added.

Novice Database Administration

To import data into SQLite, the tables should first be created using the CREATE TABLE command. Included in this command is all of the information about the fields, including the column name and type.

For the bioinf.db database, the following create table commands were used.

To try them, first start a new sqlite window.

sqlite> CREATE TABLE Contact(id text, personal text, family text);

sqlite> CREATE TABLE Site(name text, lat real, long real);

sqlite> CREATE TABLE Setting(s_id integer, site text, date text);

sqlite> CREATE TABLE Experiments(s_id integer, person text, type text, samples real);

To remove tables, the DROP TABLE command is used.

sqlite> DROP TABLE Contact;

This command cannot be undone!

If we have files formatted as comma separated values (csv), then the SQLite .import command can be used to insert the data into the tables. Note the dot “.” which begins the command. (If you tried the DROP command, create the contact table again.)

First, tell SQLite that the data is formatted in comma separated values.

sqlite> .mode csv

sqlite> .import Contact.csv Contact

sqlite> .import Site.csv Site

sqlite> .import Setting.csv Setting

sqlite> .import Experiments.csv Experiments

Finally, tables can be modified by inserting, updating, or deleting records using the commands INSERT INTO, UPDATE SET, and DELETE, respectively.

sqlite> INSERT INTO Contact values(‘doe’, ‘John’, ‘Doe’);

sqlite> UPDATE Contact SET personal=’Jane’, WHERE id=’doe’;

sqlite> DELETE FROM Contact WHERE id=’doe’;

Additional Topics

Missing Data

Add missing data to a table.

sqlite> INSERT INTO Contact values(‘dee’, null, ‘Deer’);

A special phrase is required to find null values: IS NULL.

sqlite> SELECT * FROM Contact WHERE personal IS NULL;

To retrieve all records with values for personal, the phrase IS NOT NULL is used.

sqlite> SELECT * FROM Contact WHERE personal IS NOT NULL;

References

  1. GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus, Yuelin Zhu, Sean Davis, Robert Stephens, Paul S. Meltzer, and Yidong Chen, Bioinformatics (2008) 24 (23): 2798-2800
  2. Gene Expression Omnibus, http://ncbi.nlm.nih.gov/geo

Bioinformatics SQLite Databases

DisGeNET, http://www.disgenet.org/ds/DisGeNet/files/current/disgenet_2016.db.gz

GEOmetadb, https://gbnci-abcc.ncifcrf.gov/geo/GEOmetadb.sqlite.gz

SRAmetadb, http://gbnci.abcc.ncifcrf.gov/backup/SRAmetadb.sqlite.gz

Acknowledgements

This training is based on a module from Software Carpentry: Databases and SQL, http://swcarpentry.github.io/sql-novice-survey/.

 

Bioinformatics Program Main Page

 

Questions, Comments:  Lynn Young, PhD