Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to SAM/BAM format for downstream analysis.
Use bowtie2 to map reads from an E. coli Illumina data set to a reference genome and compare the output.

Theory

Please see the Introduction the Introduction to mapping presentation on presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.

...

The easiest way to run the tutorial is to copy this entire directory into a new folder called "GVA_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

Code Block
(GVA2021) tacc:/scratch/<#>/<UserName>/GVA_bowtie2_mapping$ ls NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq SRR030257_2.fastq.gz

...

While you could write your own sequence converter, hopefully it jumps out at you that this is something someone else must have done before. In situations like this, you can often spend a few minutes on google finding a stackoverlow question/answer that deals with something like this. Some will be in reference to how to code such things, but the particularly useful ones will be the ones that point to a program or repository where someone has already done this for you.

In our case on TACC, the BioITeam has uploaded such a conversion script to one of the shared spaces. Using the which command, can you figure out where the this case the bp_seqconvert.pl script is located?

...

perl script is included as part of the bioperl module package. Rather than attempt to find it as part of a conda package, or in some other repository we will use the module version. If needing this script in the future outside of TACC, https://metacpan.org/dist/BioPerl/view/bin/bp_seqconvert.

Code Block

language	bash
title	Recall that we have used the which command to determine where executable files are located, and only take 2 pieces of information.

which bp_module load bioperl/1.007002
which -a bp_seqconvert.pl

By the end of the class, hopefully you will recognize the "If you run the which command above outside of an idev session you should see 2 results. If you run from inside an idve node you get 1 result. On the head node (outside idev) 1 that points to the BioITeam near where you keep finding your data (/corral-repl/utexas/BioITeam/" ) part of the answer as the BioITeam. The "bin" folder specifically is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the TACC community.

The bp_seqconvert.pl script is a common script written in bioperl that is a helpful utility for converting between many common sequence formats. If you try to run the script however, you get the following error message:

...

The other is in a folder specifically associated with the bioperl module.

Info

title	Why do you get 2 different results depending on if you are inside or outside of an idev node

This has to do with how compute nodes are configured. On stampede2 /corral-repl/ and all of its subdirectories are not accessible so even though the BioITeam is in your $PATH, on the compute node, the command line can't access it. This is why in later tutorials you have to log out of the idev session to copy new raw data files to work with.

If you try to run the BioITeam version of the script from the head node /corral-repl/utexas/BioITeam/

...

bin/bp_seqconvert.pl , you get the following error message:

No Format

Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.

Info

title	Deciphering error messages

The above error message is pretty helpful, but much less so if you are not familiar with perl. As I doubt anyone in the class is more familiar with perl than I am, and I am not familiar with perl hardly at all, this is a great opportunity to explain how I would tackle the error message to figure out what is going on.

"compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8."
1. The last line here actually tells us that the script did not get very far, only to line 8.
2. My experience with other programing language tells me that the beginning of scripts is all about checking that the script has access to all the things it needs to do what it is intended to do, so this has me thinking some kind of package might be missing.
"(@INC contains: ..."
1. This reads like the PATH variable, but is locations I don't recognize as being in my path, suggesting this is not some external binary or other program.
2. Many of the individual pathways list "lib" in one form or another. This reinforces the idea from above that some kind of package is missing.
"Can't locate Bio/SeqIO.pm in @INC"
1. "Can't locate" reads like a plain text version of something being missing, and like something generic that is not related to my system/environment (like all the listed directories), and not related to the details of the script I am trying to run (like the last line that details the name of the script we tried to envoke)
2. This is what should be googled for help solving the problem.
  1. the google results list similar error messages associated with different repositories/programs (github issues) suggesting some kind of common underlying problem.
  2. The 3rd result https://www.biostars.org/p/345331/ reads like a gneric generic problem and sure enough the answers detail needing to have the Bio library installed from cpan (perl's package management system)

...

We get this error message because on TACCstampede2, while perl is installed, but the bioperl library isn'trequired bioperl module is not loaded by default. Rather than having to actually install the library yourself (as we will do for several libraries in the SVDetect tutorial later today), we are lucky that TACC has a bioperl module. So, you must have the bioperl module loaded before the script will run. As it is fairly rare that you need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday. Additionally it gives an opportunity to have you working with the module system.

After loading the bioperl library to get past the error mesagemessage, run the script from the BioITeam without any arguments to get the help message:

Code Block
module load bioperl /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl

If you find yourself needing to do lots of sequence conversions, and find the bp_seqconvert.pl script useful to do them, you may want to add a 'module load bioperl/1.007002' line to your .bashrc file.

Convert a gbk reference to a embl reference

Convert the Genbank file NC_012967.1.gbk to EMBL format, and Recall that because the bp_seqconvert.pl script exists in 2 different locations as 2 different copies only the first one in the PATH variable will be used. Using the which -a command you see the copy used will be the module version unless you specifically envoke the BioITeam version. In this case it does not matter as the scripts are the same.

Convert a gbk reference to a embl reference

Convert the Genbank file NC_012967.1.gbk to EMBL format, and name this new file NC_012967.1.embl.

...

Info

It is somewhat frustrating or confusing that this command does not give us any output saying it was successful. The fact that you get your prompt back is often the only sign the computer has finished doing something.

Last year some students were getting the following error message when they execute this command even though the new file seems to be generated correctly. As I am not able to reconstruct the error, please send a message or say something on zoom if you do encounter it so I know it is still present

Code Block

Use of uninitialized value in substitution (s///) at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 777, <STDIN> line 164674.
Use of uninitialized value in concatenation (.) or string at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 779, <STDIN> line 164674.

head -n 100

Expand

title	Does EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command.

Code Block

language	bash
title	Using the head to check the first 100 lines

Expand

title	Does EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command.

Code Block

language	bash
title	Using the head to check the first 100 lines

head -n 100 NC_012967.1.embl

Code Block

language	bash
title	Using the less command

less NC_012967.1.embl

Code Block

language	bash
title	Using the more command

more NC_012967.1.embl

Code Block

language	bash
title	Using the less command

less NC_012967.1.embl

Code Block

language	bash
title	Using the more command

more NC_012967.1.embl

remember that you can quit the less and more views remember that you can quit the less and more views with the q key.

Converting from fastq to fasta format

...

Warning

title	IMPORTANT

This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the hostname or showq -u command to make sure you are on an idev node. If you are not sure if you are on an idev node, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies.

If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial.

Code Block

language	bash
title	Solution
collapse	true

bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  
# the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

SAM files can be enormously humongous text files (potentially >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or more or using a viewer like IGV, which we will cover in a later tutorial.
SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation about read mapping and file formats which you can find on the home page.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

Code Block
head bowtie2/SRR030257.sam

Expand

title	What do you think the 4th and 8th columns mean(click for answer)?

If you thought the answer was the mapping coordinates of the read pairs you were right!

Multithreaded execution

We have actually massively under-utilized stampede2 in this example. We ran the command using only a single processor (a single "thread") rather than the 272 we have available. For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 272 processors for the bowtie process.

1 processor took a little over 5 minutes, 48 processors took ~ 15 seconds. Can you think of any reasons why it was ~ 16x faster rather than 272x faster?

Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?
An additional note from the stampede2 user manual is that using all 272 processors is rarely the go to solution. In my own work, if I am working with a single sample, or expect samples to finish at different rates yet fully distribute tasks throughout the duration of the program (as is true for read mapping like we are doing here, and read trimming like with trimmomatic or cutadapt, and actually very little else), I do use 272 as the default and only reduce it if there is a problem.

Expand

title	See if you can figure out how to re-run this using all 272 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 272 processors available to our job.

Code Block

language	bash
title	click here to check your answer
collapse	true

bowtie2 -t -p 272 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

Expand

title	How much faster was it using all 272 processors?

Expand

title	Answer

Command	on idev node	on head node
hostname	lists a compute node starting with a C followed by a number before "stampede2.tacc.utexas.edu"	lists a login node plus number before "stampede2.tacc.utexas.edu"
showq -u	-bash: showq: command not found	shows you a summary of jobs you have. (very likely empty during these tutorials)

Versions Compared

Old Version 2

New Version Current

Key

Theory

Convert a gbk reference to a embl reference

Convert a gbk reference to a embl reference

Converting from fastq to fasta format

More reading about SAM files

Multithreaded execution

More reading about SAM files

Multithreaded execution

Page Comparison

Versions Compared

Old Version 2

New Version Current

Key

Theory

Convert a gbk reference to a embl reference

Convert a gbk reference to a embl reference

Converting from fastq to fasta format

More reading about SAM files

Multithreaded execution

More reading about SAM files

Multithreaded execution