Course No. 17 "UNIX Command Line"

  Problem 1

Topics Covered

     Files and folders

     Finding files

     Finding and analyzing information in files

     Loops

     Shell scripts

Students will learn

     Filter information in files

     Sort filtered information from files

     Sort filtered information from files

     Condense multiple occurrences of repetitive lines

     Count number of repetitive lines

     Batch operations

     Save batch operations for future projects

For example, we can find out which organism name occurs most frequently in a UNIPROT file. Or we can find out the number of mismatches for aligned reads in a bwa SAM file.

Steps for setup (not required for training room computers)

 

1.   The data files for this lesson are already installed on the training room computers, but those using other computers can download the archive files here: http://nihlibrary.ors.nih.gov/bioinfo/UNIX/cli.zip. Save the file to the desktop, and extract the contents to the Desktop.

2.    Both Mac and Linux computers have UNIX-like command line interfaces. To access the command line on these computers, a terminal app is usually available. For example, on Macs, go to Applications → Utilities → Terminal.app.

3.    For Windows computers, the UNIX command line can be emulated using the application Git BASH. It can be installed by installing git for windows: https://git-for-windows.github.io/. Click on Download and download the version appropriate for your computer. (To find out whether you need 32-bit or 64-bit, click on the windows icon or pearl in the lower left corner; then right click on Computer in the list on the righthand side. Select Properties at the bottom of the menu. Under the section System, the line System type indicates 32-bit or 64-bit.) For Windows users who do not have administrator rights, PortableGit can be used.

To begin the lesson, open the UNIX command line application. For example, Windows users should open the git-bash program, and Mac users should open Terminal.app.

The dollar sign ($) is called a prompt and indicates the application is waiting for you to type a command. It is waiting for input. Words after the dollar sign ($) represent commands to be typed.

To complete the setup, navigate to the files to be used for the class by typing the following commands.

$ cd

$ cd Desktop/cli

Find out the name the computer has for current user.

$ whoami

Files and Folders

Folders are directories. We will learn to view the files in the directory and to navigate between directories.

Display the location of the present working directory; that is, find out where you are.

$ pwd

Show contents of a directory.

$ ls

The following command displays contents in a way that distinguishes between files and directories by adding the symbol “/” to the end of the directory name. Directories created within existing directories are called subdirectories.

$ ls -F

Show the contents of a subdirectory.

$ ls sam

Move to a subdirectory.

$ cd sam

$ ls

Move up one level above the current directory; that is, move to the parent directory.

$ cd ..

Verify your location with the command we learned earlier.

$ pwd

Viewing, Creating, and Removing

A useful command for opening and creating files is vi. The command rm will remove a file. Files can also be copied by using the cp command and moved using the mv command.

For directories, creation is accomplished with mkdir and removal, rmdir

Navigate to the Desktop/cli directory if you are not already there. (Hint: type pwd to display your location.)

Files

Open a file.

$ vi partial_uniprot_sprot.dat

Close a file.

Type :q

Create a file.

$ vi note.txt

Insert text:

Type i

then begin typing, and when finished, press the

“esc

key on the upper left of the keyboard.

Save the text.

Type :w

Close the file.

Type :q

Delete a file.

$ ls

$ rm junk.txt

$ ls

Directories

Create a directory.

$ mkdir docs

$ ls -F

Remove a directory.

$ rmdir docs

$ ls -F

More file operations

Copy a file.

$ cp note.txt notecopy.txt

Move a file (can also be used for renaming).

$ mv note.txt notes.txt

$ ls

$ mkdir docs

$ ls -F

$ mv notecopy.txt docs

$ ls -F

$ ls docs

Exercise 1

The directory structure keeps track of all of the levels of subdirectories. This is called a tree structure. Suppose we type the following commands. How many levels of subdirectories will we have if we begin counting from the cli directory?

mkdir project

cd project

mkdir experiment1

mkdir experiment2

cd experiment1

mkdir sample1

mkdir sample2

cd ..

cd experiment2

mkdir sample1

mkdir sample2

Exercise 2

Building on the previous exercise, suppose a file has been placed in the project subdirectory, and suppose it was generated from sample 1 of experiment 2 and needs to be moved to the sample1 subdirectory in the directory structure.  List the command(s) needed to accomplish this.

Finding Files

Files

The current directory can be referred to by the symbol “.” which is the period.

$ find . -name notecopy.txt -print

Note the resulting location is given relative to the symbol or location used in the find request. The result is called the path and file name.

Finding and Analyzing Information in Files

Information in files

Searching for information in a file uses grep. For example, in UNIPROT, the organism name is found on lines beginning with 'OS'. If we want to find all of the organism names in the file, we can use grep as follows.

$ grep 'OS ' partial_uniprot_sprot.dat
 

Directing and analyzing results

Structured data, non-tabular

To direct the ouput to a file, use the operator “>

$ grep 'OS ' partial_uniprot_sprot.dat > sprotOS.txt

To analyze the results we use the sort and uniq commands. The sort command sorts alphabetically. The uniq command removes successive identical lines. Commands can have options, usually designated by a hyphen “-“. The option -c for the uniq command will prefix the lines by the number of occurrences. The -g option for the sort command will sort according to a general numerical value.

$ sort sprotOS.txt > sprotOSsorted.txt

$ uniq -c sprotOSsorted.txt > sprotOSsummary.txt

To combine multiple commands in one line use the pipe operator “|”.

$ grep 'OS ' partial_uniprot_sprot.dat | sort > sprotOSsorted.txt

$ grep 'OS ' partial_uniprot_sprot.dat | sort | uniq -c > sprotOSsummary.txt

To count the number of lines in the file, use the “wc” command.

$ wc sprotOSsummary.txt

The first number in the results is the number of lines. Also given are the number of words and number of byte counts.

$ wc -l sprotOSsummary.txt

prints only the number of lines.

Tabular data

$ cd sam

$ vi SRR016865srtd27to28M.sam

@PG entry shows that bwa is the program that generated the alignment.

Column 16 of the file is an XM tag which gives the number of mismatches in the alignment.

Here we examine only column 16 by using the cut command. We then summarize the number of mismatches over the entire file using pipes and the sort and uniq commands.

$ cut -f16 SRR016865srtd27to28M.sam | sort | uniq -c

XO is the number of gap openings and can be found in column 17.

$ cut -f17 SRR016865srtd27to28M.sam | sort | uniq -c

MD gives the position/base of the mismatch. For example, MD:Z:0A40 indicates that the one mismatch is an A in the first position, and MD:Z:40A0 indicates that the one mismatch is an A in the last position. The MD tag is in column 19.

$ cut -f19 SRR016865srtd27to28M.sam | sort | uniq -c | sort -g

Exercise 3

Column 2 in the SAM format is the FLAG that provides mapping information about the read. (The numbers are bit flags that are explained at https://broadinstitute.github.io/picard/explain-flags.html.)

For example, if the flag is set at 4, the read is unmapped.

Write a command or commands to summarize the types and counts of flags for the file SRR016865srtd27to28M.sam.

Loops

Suppose that we want to operate on all of the files instead of typing the command separately for each file.

We can use the for;do;done combination, as follows.

$ for samfile in *.sam

>  do

>       echo $samfile

>  done

When the command line with “for is typed at the dollar sign prompt “$”, a new prompt appears “>”, the greater than symbol, also known as the right angle bracket. Thus, after the command line with “for is typed, the next commands to be typed follow the right angle bracket (>).

The word following “for” is used to represent each file in the for loop.   It is called a variable, because it changes each time around the loop.  In subsequent lines, it is prepended with a dollar sign to tell the command line that it is a variable.  Next the “do” command indicates that subsequent lines are commands to be executed until the “done” command appears.

In the example above, “samfile” is the variable, and only one command is given between “do” and “done”.  This is the “echo” command which echoes the file name.  The “for” command tells the “do” command to replace $samfile with the files found using the wildcard character * and extension .sam. Thus, samfile becomes the variable $samfile, and *.sam is expanded to SRR016861srtd27to28M.sam, SRR016862srtd27to28M.sam, and SRR016865srtd27to28M.sam, such that the loop takes the next file in the list each time around until all files have been used.

The “echo” command is useful for testing loops.

Now we will perform the “cut” operation we used previously over all of the .sam files using the for loop.

$ for samfile in *.sam

> do

>      echo $samfile

>      cut -f16 $samfile | sort | uniq -c

> done

View the results.

Exercise 4

Building on the previous exercise, write a loop to summarize the results in column 2 for all of the sam files in the sam directory.

Shell Scripts

This information can be saved in a file to be run on data from other experiments.

We will use the extension .sh to indicate a shell script.

$ vi samTagSummary.sh

Type i

then begin typing the commands below.

for samfile in *.sam

do

     echo $samfile

     cut -f16 $samfile | sort | uniq -c

done

Press the

“esc”

key on the upper left of your keyboard.

To save and exit

type :wq

To run the shell script

$ bash samTagSummary.sh

Exercise 5

Create a shell script to rename the sam files in the sam directory by adding the prefix nottrimmed- .

References

1.    Software Carpentry: The UNIX Shell, http://swcarpentry.github.io/shell-novice/

Bioinformatics Program Main Page

 

Questions, Comments:  Lynn Young, PhD