Overview

Once you know you are working with the best quality data (Evaluating Raw Sequencing data tutorial) possible, the first step in nearly every NGS analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll explore these basic principles using bowtie2 on TACC.

The world of read mappers is settling down after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular. There are over 50 read mapping programs listed here. Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, etc). It is possible a different read mapper would be better for your set of experiments. More will be discussed about selecting a good tool on Friday.

Learning Objectives

This tutorial covers the commands necessary to use bowtie2 to map reads to a reference genome, and concepts applicable to many more mappers.

Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to SAM/BAM format for downstream analysis.
Use bowtie2 to map reads from an E. coli Illumina data set to a reference genome and compare the output.

Theory

Please see the Introduction to mapping presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.

Tutorial: E. coli genome re-sequencing data

The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.

Transferring Data

Rather than having to download these files from the SRA or EUN and NCBI, these data files are available in the following directory:

$BI/gva_course/mapping/data

You may recognize this as the same files we used for the fastqc and cutadapt tutorial. If you chose to improve the quality of R2 reads using cutadapt as you did for R1 in the tutorial, you could use the improved reads in this tutorial to see what a difference the improved reads can make for read mapping.

File Name	Description	Sample
`SRR030257_1.fastq`	Paired-end Illumina, First of pair, FASTQ format	Re-sequenced E. coli genome
`SRR030257_2.fastq`	Paired-end Illumina, Second of pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.gbk`	Reference Genome in Genbank format	E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory into a new folder called "GVA_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

tacc:/scratch/<#>/<UserName>/GVA_bowtie2_mapping$ ls
NC_012967.1.gbk  SRR030257_1.fastq  SRR030257_2.fastq  SRR030257_2.fastq.gz

Remember that to copy an entire folder requires the use of the recursive (-r) option.

cds
cp -r $BI/gva_course/mapping/data GVA_bowtie2_mapping
cd GVA_bowtie2_mapping
ls

Reminders about working with sequencing files

Beware the cat command when working with NGS data

NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and will likely crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember control+c will kill whatever command you just executed.

If hitting control+c several times doesn't work, control +z will stop the process, you then need to kill the process using kill %1 if control+z doesn't work, you may be best off closing the window, opening a new window, logging back in, and picking up where you left off. Note that for the purpose of this class, you should make sure to restart an idev session.

Remember, from the introduction tutorial, there are multiple ways to look at our sequencing files without using cat:

Command	useful for	bad if
head	seeing the first lines of a file (10 by default)	file is binary
tail	seeing the last lines of a file (10 by default)	file is binary
cat	print all lines of a file to the screen	the file is big and/or binary
less	opens the entire file in a separate program but does not allow editing	if you are going to type a new command based on the content, or forget the q key exits the view, or file is binary
more	prints 1 page worth of a file to the screen, can hold enter key down to see next line repeatedly. Contents will remain when you scroll back up.	you forget that you hit the q key to stop stop looking at the file, or file is binary

How to determine how many reads are in a fastq file

grep -c "^+$" SRR030257_1.fastq

How to determine how long the reads are in a fastq file

sed -n 2p SRR030257_1.fastq | awk -F"[ATCGNatcgn]" '{print NF-1}'

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky? Everybody has own favorite formats and/or those that they are the most familiar with but humans can typically pick the information they need out of comparable formats. Programs can only be written to assume a single type of format (or allow you to specify a format if the author is particularly generous), and can only find things in single locations based on that format.

While you could write your own sequence converter, hopefully it jumps out at you that this is something someone else must have done before. In situations like this, you can often spend a few minutes on google finding a stackoverlow question/answer that deals with something like this. Some will be in reference to how to code such things, but the particularly useful ones will be the ones that point to a program or repository where someone has already done this for you.

In this case the bp_seqconvert.pl perl script is included as part of the bioperl module package. Rather than attempt to find it as part of a conda package, or in some other repository we will use the module version. If needing this script in the future outside of TACC, https://metacpan.org/dist/BioPerl/view/bin/bp_seqconvert.

Load the bioperl module and run the script without any options to display the help contents

module load bioperl
bp_seqconvert.pl

The information in this box is related to the path variable, perl programming libraries, having multiple copies of a script/file available in your path, and computer architecture. If you are not interested in this, you can skip this box.

On the head node, after you have installed the bioperl module, there are actually 2 instances of bp_seqconvert.pl available to you.

module load bioperl/1.007002
which -a bp_seqconvert.pl

If you run on an idev node you get 1 result related to the bioperl module, but if you run on the head node (outside idev) you get 2 results. On the head node, 1 points to the BioITeam near where you keep finding your data (/corral-repl/utexas/BioITeam/) which is part of the BioITeam, specifically the "bin" folder which is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the TACC community. The other is in a folder specifically associated with the bioperl module. You can load and unload the bioperl module to see the difference.

Why do you get 2 different results depending on if you are inside or outside of an idev node

This has to do with how compute nodes are configured. On stampede2 /corral-repl/ and all of its subdirectories are not accessible so even though the BioITeam is in your $PATH, on the compute node, the command line can't access it. This is why in later tutorials you have to log out of the idev session to copy new raw data files to work with.

If you try to run the BioITeam version of the script (/corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl) from the head node without the bioperl module loaded, you get an error message similar to the following:

module unload bioperl
bp_seqconvert.pl

Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.

Deciphering error messages

The above error message is pretty helpful, but much less so if you are not familiar with perl. As I doubt anyone in the class is more familiar with perl than I am, and I am not familiar with perl hardly at all, this is a great opportunity to explain how I would tackle the error message to figure out what is going on.

"compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8."
1. The last line here actually tells us that the script did not get very far, only to line 8.
2. My experience with other programing language tells me that the beginning of scripts is all about checking that the script has access to all the things it needs to do what it is intended to do, so this has me thinking some kind of package might be missing.
"(@INC contains: ..."
1. This reads like the PATH variable, but is locations I don't recognize as being in my path, suggesting this is not some external binary or other program.
2. Many of the individual pathways list "lib" in one form or another. This reinforces the idea from above that some kind of package is missing.
"Can't locate Bio/SeqIO.pm in @INC"
1. "Can't locate" reads like a plain text version of something being missing, and like something generic that is not related to my system/environment (like all the listed directories), and not related to the details of the script I am trying to run (like the last line that details the name of the script we tried to envoke)
2. This is what should be googled for help solving the problem.
  1. the google results list similar error messages associated with different repositories/programs (github issues) suggesting some kind of common underlying problem.
  2. The 3rd result https://www.biostars.org/p/345331/ reads like a generic problem and sure enough the answers detail needing to have the Bio library installed from cpan (perl's package management system)

We get this error message because because while perl is installed on stampede2, the required SeqIO.pm library is not available by default but it is easily installed with the bioperl module. As it is likely rare that you will need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday, but if you find yourself using the command `module load bioperl` often, you may want to add it.

On the head node, after loading the bioperl module, you have access to the program in 2 different locations.

module load bioperl
which -a bp_seqconvert.pl

How does the computer know which location to use?

It will use whatever location it finds earliest in the $PATH,
which is the same as the top line in the which -a command output,
which is the same as the line printed if you run the `which` command without the "-a".

Using just the script name by itself, will use which ever is found first, but you can always force the computer to use a given copy by specifying the full path to the copy you want. Thus, the following 2 commands are not equal:

/corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl 
/home1/apps/bioperl/1.007002/bin/bp_seqconvert.pl

While the commands are different, both copies can use the same bioperl library SeqIO.pm when the bioperl module is loaded and thus work.

Convert a gbk reference to a embl reference

Convert the Genbank file NC_012967.1.gbk to EMBL format, and name this new file NC_012967.1.embl.

Try reading through the program help when you run the bp_seqconvert.pl without any options to see the syntax required

module load bioperl
bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
head -n 100 NC_012967.1.embl

It is somewhat frustrating or confusing that this command does not give us any output saying it was successful. The fact that you get your prompt back is often the only sign the computer has finished doing something.

Does EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command.

Using the head to check the first 100 lines

head -n 100 NC_012967.1.embl

Using the less command

less NC_012967.1.embl

Using the more command

more NC_012967.1.embl

remember that you can quit the less and more views with the q key.

Converting from fastq to fasta format

Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq to FASTA format.

Remember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script.

head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta

What information was lost by this conversion? Use the head command to look at the top of both the .fastq and .fasta file

head SRR030257_1.fastq
head SRR030257_1.fasta

The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases.

Mapping with bowtie2

Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

Commands for making a directory named bowtie2

mkdir bowtie2

First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta and put it in the same directory as NC_012967.1.gbk.

Use the information you you learned above working with the bp_seqconvert.pl script to convert the entire .gbk file into a .fa file

bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta

Bowtie2 installation

While you could consult previous year's tutorial for installing bowtie2 via the module system, this year's course will be using the conda system to install it. The bowtie2 home page can be found here, and if you needed to download the program itself, version 2.5.1 could be downloaded here. Instead, we want to make sure the bowtie2 version 2.5.1 is installed via conda Like we did for fastqc and cutadapt. See if you can figure out how to install bowtie2 into a new conda environment named "GVA-bowtie2-mapping". Note that "2" is actually part of the program name, neither a typo nor a comment on the program version.

Click here for a hint before the answer.

Remember that we want to use the https://anaconda.org/ search function and end up at the bowtie2 page: https://anaconda.org/bioconda/bowtie2. Like we discussed with the ls command in the introduction tutorial, we can combine creating a new environment while at the same time, telling it what programs we want to access inside that environment.

Remember, you likely need to specify the channel bowtie2 is in using the -c command

conda create -n GVA-bowtie2-mapping -c bioconda bowtie2
# enter "n" to cancel the installation and read on for more information

As mentioned in explaining why cutadapt installed version 1.18 instead of 4.4, the default anaconda channel and the bioconda channel do not always have all necessary program requirements to install the latest version of programs. In the list of new packages that were to be installed the following line lists that the bowtie2 version that will be installed will be 2.4.1:

bowtie2 bioconda/linux-64::bowtie2-2.4.1-py38he513fc3_0

While it may seem like installing a different version of the program is bad behavior, this is actually a huge benefit of the conda program. Often changes from version to version of a program are small and only effect subsets of the program, and the conda package installer is designed to find whatever way it can to get you a working version of the program. If we know that there is a particular version we want (be it the newest version, or a previous version you want to use to maintain consistent behavior in a given data set) and we tell conda that we want that version, if conda can't install that version it wont prompt you to proceed it will just fail.

What would this look like for bowtie 2.4.5?

Command specifying that bowtie 2.5.1 is required

conda create -n GVA-bowtie2-mapping -c bioconda bowtie2=2.5.1

Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                                                        

UnsatisfiableError:

Since we dont have a lot of information about what is causing the conflict with bowtie2 version 2.5.1, a simple step to try is to give the installation access to conda-forge. Similar to bioconda, conda-forge is a channel that is community run rather than company run and is both more nimble at including new things, and expansive for including more tools. More information about conda-forge can be found here.

Command specifying that bowtie 2.5.1 is required

conda create -n GVA-bowtie2-mapping -c bioconda bowtie2=2.5.1 -c conda-forge
# enter Y to verify and proceed
conda activate GVA-bowtie2-mapping

if conda-forge and bioconda are such important channels, why are they not included by default

Remember conda is a computational tool for all disciplines. In an optional tutorial on Friday, there will be information about how to permanently add different channels to be searched. In the mean time, forcing you to interact with different channels more manually, will help you gauge if adding them is actually something you will benefit from in your own work.

How to check what version of bowtie2 was loaded?

We just went thought a lot of work to make sure we installed the version that we wanted, but sometimes we need to work the other way and figure out what version we have already been working with such as yesterday with cutadapt.

Recall many programs have a --version flag that can be used to retrieve this information, and conda has a 'list' function that takes additional optional arguments

bowtie2 --version
conda list bowtie2

The above should show you now have access to version 2.5.1. If you have a different version listed (such as 2.3.5 or 2.3.2) make sure you are using the conda installation with access to conda forge and not relying on the TACC module, and then get my attention for clarification.

Building an index

IMPORTANT

The following command is extremely taxing to the head node and thus means we should not run it on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the hostname or showq -u command to make sure you are on an idev node.

Command	on idev node	on head node
hostname	lists a compute node starting with a C followed by a number before "stampede2.tacc.utexas.edu"	lists a login node plus number before "stampede2.tacc.utexas.edu"
showq -u	-bash: showq: command not found	shows you a summary of jobs you have. (very likely empty during these tutorials)

If you are not sure if you are on an idev node or are seeing other output with one or both commands, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies.

If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial.

For many read mappers, the first step in mapping reads to a genome is quite often indexing the reference file. Put the output of this command into the bowtie directory we created a minute ago. The command you need is:

bowtie2-build

Try typing this alone in the terminal and figuring out what to do from the help show just from typing the command by itself.

If you're stuck click here for an explanation of what arguments the command does need

The command requires 2 arguments. The first argument is the reference sequence in FASTA format. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*.

Click here to check your work, or for the answer if needed

bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1

Take a look at your output directory using ls bowtie2 to see what new files have appeared. These files are binary files, so looking at them with head or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into stampede2 with a new terminal, and start a new idev session.

you may be wondering why creating an index is a common first step for many different mapping programs.

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.

Mapping reads

Again, try reading the help for the bowtie2 command to figure out how to run the command yourself. Remember these are paired-end reads.

bowtie2

It is important that you use 8 processors when doing this mapping due to course time constraints.

Solution

bowtie2 -t -p 8 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  
# the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

Command break down

Command portion	Purpose
-t	Print wall clock time each step takes.
-p 8	Use 8 processors. As discussed above and below this is selected so the command will finish in a reasonable amount of time
-x bowtie2/NC_012967.1	listing the location and name of the index we created above with the bowtie2-build command
-1 SRR030257_1.fastq	Read 1 file name (note if not using the -1 and -2 options reads would not be mapped in paired end mode)
-2 SRR030257_2.fastq	Read 2 file name (note if not using the -1 and -2 options reads would not be mapped in paired end mode)
-S bowtie2/SRR030257.sam	Output mapped reads in sam format at given location with given name

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

SAM files can be enormously humongous text files (potentially measured in gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or more or using a viewer like IGV, which we will cover in a later tutorial.
SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation about read mapping and file formats which you can find on the home page.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

head bowtie2/SRR030257.sam

What do you think the 4th and 8th columns mean(click for answer)?

If you thought the answer was the mapping coordinates of the read pairs you were right!

Multithreaded execution

We have actually massively under-utilized stampede2 in this example by only using 8 cores. We ran the command using only 8 processors rather than the 48 we have available on our idev session. if we increase to 48 total processors and rerun the analysis, how long do you expect the command to take?

Modify the previous mapping command to re-run this analysis using all 68 cores.

You need to increase the -p, for "processors" option from 8 to 48.

click here to check your answer

bowtie2 -t -p 48 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the times listed at the end of each command

How much faster was it using all 68 processors?

8 processor took a little over 5 minutes, 68 processors took ~1 minute. Can you think of any reasons why it was ~ 5x faster rather than ~8x faster? — note the times here are incorrect but the principle is the same

Answer

Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?

An additional note from the stampede2 user manual is that while there are 68 cores available, and each core is capable of hyperthreading 4 x processors per core using all 272 processors is rarely the go to solution. While I am sure that this is more rigorously and appropriately tested in some other manner, I ran a test using different numbers of processors with the following results:

-p option	time (min:sec)
272	1:54
136	1:13
68	0:57
34	1:14
17	2:25
8	5:12
4	9:01
2	18:13
1	35:01

Again while there are almost certainly better ways to benchmark this, there are 2 things of note that are illustrated here:

~doubling the number of processors does not reduce the time in half, and while some applications may use hyperthreading on the individual cores appropriately, and assuming a program can/will actually makes things take longer.
Working on your laptop (which likely has at most 4-8 processors available) would significantly increase the amount of time these tutorials take.

One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output (something I have never done in my own work).

What comes after mapping?

The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course.

Optional exercises

In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).
Do the BWA tutorial so you can compare their outputs (note BWA has a conda package making it even easier to try).
- Did bowtie2 or BWA map more reads?
- In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
- Which aligner took less time to run? Are there any options you can change that:
  - Lead to a larger percentage of the reads being mapped? (increase sensitivity)
  - Speed up run time without causing many fewer reads to be mapped? (increase performance)

Here is a link to help you return to the GVA 2023 course schedule.

Bioinformatics Team (BioITeam) at the University of Texas

Read Mapping with bowtie2 Tutorial GVA2023

Analytics