Mapping tutorial BME

The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.

The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular.

There are over 50 read mapping programs listed here. We're going to (mainly) stick to BWA in this class.

Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).

Mapping tools summary

Tool

TACC

Version

Download

Manual

BWA

module load bwa/0.6.2

0.6.1; 0.6.2

link

link

See if you can find some other modules on Lonestar that pertain to Alignment

 Solution
login1$ module key Alignment
----------------------------------------------------------------------------
The following modules match your search criteria: "Alignment"
----------------------------------------------------------------------------

  beast: beast/1.7, beast/1.7.2
    Tool for Bayesian MCMC analysis of molecular sequences

  bismark: bismark/0.7.2, bismark/0.7.4
    bismark - A tool to map bisulfite converted sequence reads and
    determine cytosine

  blast: blast/2.2.26
    NCBI BLAST+ sequence alignment package. The program compares nucleotide
    or protein sequences to sequence databases and calculates the
    statistical significance of matches.

  bowtie: bowtie/0.12.8, bowtie/2.0.0b6
    Ultrafast, memory-efficient short read aligner

  bwa: bwa/0.6.1, bwa/0.6.2
    bwa - Burrows-Wheeler Alignment Tool

  clustalw2: clustalw2/2.1
    Tool for multiple sequence alignment

  gsnap: gsnap/20120320, gsnap/20120620
    gsnap - Genomic Short-read Nucleotide Alignment Program

  mafft: mafft/6.864, mafft/6.903
    Multiple alignment program for amino acid or nucleotide sequences

  mummer: mummer/3.23
    MUMmer - A modular system for the rapid whole genome alignment of
    finished or draft sequence

  plink: plink/1.07
    plink - Whole genome association analysis toolset

  tophat: tophat/1.4.1, tophat/2.0.4
    TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns
    RNA-Seq reads to mammalian-sized genomes using the ultra
    high-throughput short read aligner Bowtie, and then analyzes the
    mapping results to identify splice junctions between exons.

Obviously, the word Alignment pops up the secription of a few non-NGS modules, but you can see the utility in modules' keyword search capabilities.

Example: E. coli genome re-sequencing data

Before We Start

In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BI to point to the current filesystem location of the BioITeam directory. For any shell you open that accesses Lonestar during today's tutorial, please enter the following command:

export BI=/corral-repl/utexas/BioITeam

We also have some handy scripts for you to use, but we need to add them to your path for convenience:

export PATH=$PATH:$BI/bin

Note: to see what that did, you can type

echo $PATH

and you should see the new BioITeam folder at the end. The PATH variable contains all the places Linux looks for all the commands you type.

Finally, we will use bioperl in our scripts, so let's load that into our environment:

module load perl bioperl

Data

The data files for this example are in the path:

$BI/ngs_course/intro_to_mapping/data

File Name

Description

Sample

SRR030257_1.fastq

Paired-end Illumina, First of pair, FASTQ format

Re-sequenced E. coli genome

SRR030257_2.fastq

Paired-end Illumina, Second of pair, FASTQ format

Re-sequenced E. coli genome

NC_012967.1.gbk

Reference Genome in Genbank format

E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory to your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

login1$ ls
NC_012967.1.gbk  SRR030257_1.fastq  SRR030257_2.fastq
 If you need a little help click the triangle...
cds
cp -r $BI/ngs_course/intro_to_mapping/data intro_to_mapping
cd intro_to_mapping

Exercises

  • What basic linux commands could help us quickly peek at the files that were just copied to get an idea of their contents?
     Some commands that might be helpful...
    head
    tail
    wc -l
    
  • How many sequences are in the file SRR030257_1.fastq?
     Use a linux one-liner to get the answer
    grep ^@SRR030257 SRR030257_1.fastq | wc -l
    grep --count ^@SRR030257 SRR030257_1.fastq
    

    Or you could just count the number of lines and divide by 4!

    wc -l SRR030257_1.fastq
    
  • How many bases long are the reads in SRR030257_1.fastq?
     Use a linux one-liner to get the answer
    sed -n 2p SRR030257_1.fastq | awk -F"[ATCGatcg]" '{print NF-1}'
    

    ... and there must be more clever ways than this.

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?

The bp_seqconvert.pl script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We did this earlier.

Run the script without any arguments to get the help message:

bp_seqconvert.pl

Exercises

The file NC_012967.1.gbk is in Genbank format. The files SRR030257_*.fastq are in FASTQ format.

  • Convert NC_012967.1.gbk to FASTA format. Call the output NC_012967.1.fasta.
    • Does EMBL format have sequence features (like genes) annotated?
       Example commands. Don't peek unless you have to...
      bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta
      head -n 100 NC_012967.1.fasta
      

      You might get an error or a warning, even if the script executed correctly.

    • What information was lost by this conversion?
       Example commands. Don't peek unless you have to...
      head SRR030257_1.fastq
      head SRR030257_1.fasta
      

      The line of funny ASCII characters was lost. Those are your "base quality scores".

Extra reading

 What's in a Genbank file?
 What's the difference between Genbank and a FASTA files?

The first portion of a Genbank file contains information about "features" of the genome, like genes. The second part contains the actual bases of the reference sequence. Therefore, a Genbank essentially has an embedded FASTA file inside it. There are also a lot of nice statistics and metadata, like the size of the sequence and its base composition in the GenBank header.

Mapping with BWA

BWA (the Burrows-Wheeler Aligner) is a fast mapping program. It's the successor to another aligner you might have used or heard of called MAQ (Mapping and Assembly with Quality).

To use BWA, we first need to load the software into our environment using TACC's module system.

Load the module:

module load bwa

There are multiple versions of BWA on TACC, so you might want to check which one you have loaded for when you write up your awesome publication that was made possible by your analysis of next-gen sequencing data.

 Here are some commands that could help...
module key bwa
module list
bwa

Make sure you're in the intro_to_mapping directory (or wherever you put the data files):

cd $SCRATCH/intro_to_mapping

Try to figure out how to index and map from the command line help:

bwa

You will need to run this set of commands (with options that you should try to figure out) in this order:

bwa index
bwa aln
bwa samse or sampe

What's going on at each step?

Remember to use the option that enables multithreading, if there is one, for each BWA command.

Indexing

First, run the index command (index) on the reference file. This is fast, so you can run it interactively.

 Why do we create an index?

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.

BWA doesn't give you a choice of where to create your index files. It creates them in the same directory as the FASTA that you input. Run the index command using the copied FASTA as input.

 Here's the full command line if you can't figure it out...
bwa index NC_012967.1.fasta

Take a look at your output directory using the ls command to see what new files appear after indexing.

Mapping

Then, run the mapping command (aln). Note that you need to map each set of reads in the pairs separately with BWA because of how it separates the initial mapping and the later alignment steps.

Submit to the TACC queue

CPU intensive jobs must run on compute nodes, not the head node. We typically run jobs on compute nodes by submitting them to the queue.

Make a file called bwa_job.sge, and put this into the file:

#!/bin/bash

#$ -V
#$ -cwd
#$ -pe 12way 12
#$ -q normal
#$ -l h_rt=00:30:00
#$ -A BME_2012
#$ -o output.$JOB_ID
#$ -e error.$JOB_ID
##$ -m be
##$ -M <your email would go here>
#$ -N align_bwa_01

module load bwa/0.6.2

bwa aln -t 12 -f SRR030257_1.sai NC_012967.1.fasta SRR030257_1.fastq
bwa aln -t 12 -f SRR030257_2.sai NC_012967.1.fasta SRR030257_2.fastq

bwa sampe -f SRR030257.sam NC_012967.1.fasta SRR030257_1.sai SRR030257_2.sai SRR030257_1.fastq SRR030257_2.fastq

Why did we use -t 12 for multithreading? Lonestar has 12 CPU cores per node, so we want to use them all.

Submit the job to the queue

The file we just made is a job submission script. It lets us run jobs on powerful compute nodes. To actually submit the job, we use the qsub command to send it to the scheduler. The scheduler then runs our job on the first available compute node(s). Your qsub command will look something like this:

qsub bwa_job.sge

To see if our job is in the queue, try typing this:

showq -u

or alternately, you can use:

qstat

Using one of those, you can see if you job is waiting or running. When it is complete, it will disappear from the list, and you will see the new output in your directory

After the job runs

Take a look at your output directory using ls to see what new files have appeared. What is a *.sai file? It's a file containing "alignment seeds" in a file format specific to BWA. Many programs produce this kind of "intermediate" file in their own format and then at the end have tools for converting things to a "community" format shared by many downstream programs.

Alignment

Beyond aligning the sequences, we need to extend these seed matches into alignments of entire reads, choose the best matches, and convert the output to SAM format. We already put this in the job script above so that we only have to submit one job, but let's explain why.

Do we use sampe or samse? What's the difference?

 Answer

sam*pe* is for paired end reads, and sam*se* is single end.

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

  1. SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or using a viewer like IGV, which we will cover later.
  2. SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

head SRR030257.sam

What do you think the 4th and 8th columns mean?

More reading about SAM files

Exercises

  • In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
    • Are there any options you can change so that more reads are mapped or that speed up performance without causing many fewer reads to be mapped?

What's Next

In this section, we've taken raw reads from a sequencer and have mapped them to a reference genome. Next, we'll look at variant calling to see how our sample that we sequenced differs from the reference. Some differences can cause important changes to the proteins they encode. Finding variants in the genome can give insight into health problems and their pathways, they can help diagnose cancer and issues, etc.