Mapping with bowtie2 Tutorial
Overview
The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.
The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular. There are over 50 read mapping programs listed here. Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.). As evidence of how things are settling down, we're going to (mainly) stick to just bowtie2 in this course.
Other read mappers
Learning Objectives
This tutorial covers the commands necessary to use several common read mapping programs.
- Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to
SAM/BAM
format for downstream analysis. - Use bowtie2 and BWA to map reads from an E. coli Illumina data set to a reference genome and compare the output.
Theory
Please see the Introduction to mapping presentation for more details of the theory behind read mapping algorithms and critical considerations for using these tools correctly.
Mapping tools summary
The tutorial currently available on the Lonestar cluster at TACC is as follows:
Modules also exist at the current time for: bwa, bowtie, and SHRiMP.
Tutorial: E. coli genome re-sequencing data
The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.
Transferring Data
We have already downloaded data files for this example and put them in the path:
$BI/gva_course/mapping/data
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina, First of pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in Genbank format | E. coli B strain REL606 |
The easiest way to run the tutorial is to copy this entire directory into a new folder called "bowtie2mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls
command.
tacc:~$ ls NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq
Useful commands
Often you will have general questions about your sequencing files that you want to answer before or after starting your actual analysis. Here we show you some very handy commands after a warning:
Beware the cat command when working with NGS data
NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and may crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember ctrl+c will kill whatever command you just executed
Linux 1 liners
Converting sequence file formats
Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?
The bp_seqconvert.pl
script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We also took care of this for you when we edited your ~/.profile_user
file in the Linux introduction.
Run the script without any arguments to get the help message:
bp_seqconvert.pl
Exercises
The file NC_012967.1.gbk
is in Genbank
format. The files SRR030257_*.fastq
are in FASTQ
format.
- Convert
NC_012967.1.gbk
toEMBL
format. Call the outputNC_012967.1.embl
.Does EMBL format have sequence features (like genes) annotated?
- Convert only the first 10,000 lines of
SRR030257_1.fastq
toFASTA
format.What information was lost by this conversion?
Mapping with bowtie2
Bowtie2 is a complete rewrite of bowtie. It is currently the latest and greatest in the eyes of one very picky instructor (and his postdoc/gradstudent) in terms of configurability, sensitivity, and speed.
Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.
Next, load the bowtie2 module (we use module spider
to get the current name, which may not be bowtie/2.1.0
if you re-run this tutorial):
Now convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
Generally speaking, the first step in mapping is quite often indexing the reference file regardless of what mapping program is used. Put the output of this command into the bowtie
directory we created a minute ago. The command you need is:
bowtie2-build
Try typing this alone in the terminal and figuring out what to do from the help show just from typing the command by itself.
Why do so many different mapping programs create an index as a first step you may be wondering?
Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.
Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.
Finally, map the reads! The command you need is:
bowtie2
Try reading the help to figure out how to run the command yourself. This command takes a while (~5 minutes). This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, TACC noticed the spike in usage last time we taught the class and we got in trouble.
So, you will want to submit the full job to the cluster like you learned in the introduction.
But first, try to figure out the command and start it in interactive mode. Remember these are paired-end reads. Use control-c
to stop the job once you are sure it is running without an immediate error! Then, submit your command that is working to the TACC queue.
Submit to the TACC queue or run in an idev shell
Create a commands
file and use launcher_creator.py followed by qsub.
Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:
- SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
orgrep
or using a viewer like IGV, which we will cover later. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time.
Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:
head bowtie2/SRR030257.sam
More reading about SAM files
Multithreaded execution
We have actually massively under-utilized Lonestar in this example. We submitted a job that reserved a single node on the cluster, but that node has 12 processors. Bowtie was only using one of those processors (a single "thread")! For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 12 processors for the bowtie process.
If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn about the "wayness" of how you request nodes on Lonestar and possibly edit your *.sge script.
One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder
flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output.
Optional Exercises
In the bowtie2 example, we mapped in
--local
mode. Try mapping in--end-to-end
mode (aka global mode).- Do the BWA tutorial so you can compare their outputs.
- Did bowtie2 or BWA map more reads?
- In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
- Which aligner took less time to run? Are there any options you can change that:
- Lead to a larger percentage of the reads being mapped? (increase sensitivity)
- Speed up performance without causing many fewer reads to be mapped? (increase performance)
Next steps...
The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2015 course schedule.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.