Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Please see the Introduction to mapping presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.

Mapping tools summary

While you could consult last year's tutorial for installing bowtie2 via the module system, this year's course will be using the conda system to install it. The bowtie2 home page can be found here, and if you needed to download the program itself, version 2.4.4 could be downloaded here



Tutorial: E. coli genome re-sequencing data

...

Warning
titleBeware the cat command when working with NGS data

NGS data can be quite large, a single lane of an Illumina Hi-Seq run generates 2 files each with 100s of millions of lines. Printing all of that can take an enormous amount of time and will likely crash your terminal long before it finishes. If you find yourself in a seemingly endless scroll of sequence (or anything else for that matter) remember control+c will kill whatever command you just executed.

If hitting control+c several times doesn't work, control +z will stop the process, you then need to kill the process using kill %1 if control+z doesn't work, you may be best off closing the window, opening a new window, logging back in, and picking up where you left off. Note that for the purpose of this class, you should make sure to restart an idev session.


Remember, from the introduction tutorial, there are multiple ways to look at our sequencing files without using cat:

...

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky? Everybody has own favorite formats and/or those that they are the most familiar with but humans can typically pick the information they need out of comparable formats. Programs can only be written to assume a single type of format (or allow you to specify a format if the author is particularly generous), and can only find things in single locations based on that format. 

The bp_seqconvert.pl script is a common script written in Bioperl that is a helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. As it is fairly rare that you need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday. Additionally it gives an opportunity to have you working with the module system. If you find yourself needing to do lots of sequence conversions, you may want to add a 'module load bioperl/1.007002' line to your .bashrc file.

Run the script without any arguments to get the help message:

Code Block
module load gcc
module load bioperl
bp_seqconvert.pl

Convert a gbk reference to a embl reference

Convert the Genbank file NC_012967.1.gbk to EMBL format, and name this new file NC_012967.1.embl.

Code Block
languagebash
titleTry reading through the program help when you run the bp_seqconvert.pl without any options to see the syntax required
collapsetrue
bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
head -n 100 NC_012967.1.embl
Info

It is somewhat frustrating or confusing that this command does not give us any output saying it was successful. The fact that you get your prompt back is often the only sign the computer has finished doing something.

Last year some students were getting the following error message when they execute this command even though the new file seems to be generated correctly. As I am not able to reconstruct the error, please send a message or say something on zoom if you do encounter it so I know it is still present

Code Block
Use of uninitialized value in substitution (s///) at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 777, <STDIN> line 164674.
Use of uninitialized value in concatenation (.) or string at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 779, <STDIN> line 164674.
Expand
titleDoes EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command.
Code Block
languagebash
titleUsing the head to check the first 100 lines
head -n 100 NC_012967.1.embl
Code Block
languagebash
titleUsing the less command
less NC_012967.1.embl
Code Block
languagebash
titleUsing the more command
more NC_012967.1.embl

remember that you can quit the less and more views with the q key.

Converting from fastq to fasta format

Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq to FASTA format.

Code Block
languagebash
titleRemember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script.
head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
Expand
titleWhat information was lost by this conversion? Use the head command to look at the top of both the .fastq and .fasta file
Code Block
languagebash
head SRR030257_1.fastq
head SRR030257_1.fasta

The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases.

Mapping with bowtie2

Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

Code Block
languagebash
titleCommands for making a directory named bowtie2
collapsetrue
mkdir bowtie2

First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta and put it in the same directory as NC_012967.1.gbk.

Code Block
languagebash
titleUse the information you you learned above working with the bp_seqconvert.pl script to convert the entire .gbk file into a .fa file
collapsetrue
bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta

Next, we want to make sure the bowtie2 module is loaded (we use module spider to get the current name, which may not be bowtie/2.3.4 if you re-run this tutorial later):

...

titleClick here for a hint without the answer and some links back to where you would have learned this previously

...

While you could write your own sequence converter, hopefully it jumps out at you that this is something someone else must have done before. In situations like this, you can often spend a few minutes on google finding a stackoverlow question/answer that deals with something like this. Some will be in reference to how to code such things, but the particularly useful ones will be the ones that point to a program or repository where someone has already done this for you. 

In our case on TACC, the BioITeam has uploaded such a conversion script to one of the shared spaces. Using the which command, can you figure out where the bp_seqconvert.pl script is located?

Code Block
languagebash
titleRecall that we have used the which command to determine where executable files are located, and only take 2 pieces of information.
which bp_seqconvert.pl

By the end of the class, hopefully you will recognize the "/corral-repl/utexas/BioITeam/" part of the answer as the BioITeam. The "bin" folder specifically is full of binary or (typically small) bash/python/perl/R scripts that someone has written to help the community. 


The bp_seqconvert.pl script is a common script written in bioperl that is a helpful utility for converting between many common sequence formats. If you try to run the script however, you get the following error message:

No Format
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
Info
titleDeciphering error messages

The above error message is pretty helpful, but much less so if you are not familiar with perl. As I doubt anyone in the class is more familiar with perl than I am, and I am not familiar with perl hardly at all, this is a great opportunity to explain how I would tackle the error message to figure out what is going on.

  1. "compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8." 
    1. The last line here actually tells us that the script did not get very far, only to line 8.
    2. My experience with other programing language tells me that the beginning of scripts is all about checking that the script has access to all the things it needs to do what it is intended to do, so this has me thinking some kind of package might be missing.
  2. "(@INC contains: ..."
    1. This reads like the PATH variable, but is locations I don't recognize as being in my path, suggesting this is not some external binary or other program.
    2. Many of the individual pathways list "lib" in one form or another. This reinforces the idea from above that some kind of package is missing.
  3. "Can't locate Bio/SeqIO.pm in @INC"
    1. "Can't locate" reads like a plain text version of something being missing, and like something generic that is not related to my system/environment (like all the listed directories), and not related to the details of the script I am trying to run (like the last line that details the name of the script we tried to envoke)
    2. This is what should be googled for help solving the problem. 
      1.  the google results list similar error messages associated with different repositories/programs (github issues) suggesting some kind of common underlying problem.
      2. The 3rd result https://www.biostars.org/p/345331/ reads like a gneric problem and sure enough the answers detail needing to have the Bio library installed from cpan (perl's package management system)


We get this error message because on TACC, perl is installed, but the bioperl library isn't. Rather than having to actually install the library yourself, we are lucky that TACC has a bioperl module. So, you must have the bioperl module loaded before the script will run. As it is fairly rare that you need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday. Additionally it gives an opportunity to have you working with the module system. 

After loading the bioperl library to get past the error mesage, run the script without any arguments to get the help message:

Code Block
module load bioperl
bp_seqconvert.pl

If you find yourself needing to do lots of sequence conversions, and find the bp_seqconvert.pl script useful to do them, you may want to add a 'module load bioperl/1.007002' line to your .bashrc file.

Convert a gbk reference to a embl reference

Convert the Genbank file NC_012967.1.gbk to EMBL format, and name this new file NC_012967.1.embl.

Code Block
languagebash
titleTry reading through the program help when you run the bp_seqconvert.pl without any options to see the syntax required
collapsetrue
bp_seqconvert.pl --from genbank --to embl < NC_012967.1.gbk > NC_012967.1.embl
head -n 100 NC_012967.1.embl
Info

It is somewhat frustrating or confusing that this command does not give us any output saying it was successful. The fact that you get your prompt back is often the only sign the computer has finished doing something.

Last year some students were getting the following error message when they execute this command even though the new file seems to be generated correctly. As I am not able to reconstruct the error, please send a message or say something on zoom if you do encounter it so I know it is still present

Code Block
Use of uninitialized value in substitution (s///) at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 777, <STDIN> line 164674.
Use of uninitialized value in concatenation (.) or string at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 779, <STDIN> line 164674.
Expand
titleDoes EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command.
Code Block
languagebash
titleUsing the head to check the first 100 lines
head -n 100 NC_012967.1.embl
Code Block
languagebash
titleUsing the less command
less NC_012967.1.embl
Code Block
languagebash
titleUsing the more command
more NC_012967.1.embl

remember that you can quit the less and more views with the q key.


Converting from fastq to fasta format

Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq to FASTA format.

Code Block
languagebash
titleRemember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script.
head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
Expand
titleWhat information was lost by this conversion? Use the head command to look at the top of both the .fastq and .fasta file
Code Block
languagebash
head SRR030257_1.fastq
head SRR030257_1.fasta

The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases.


Mapping with bowtie2

Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.

Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.

Code Block
languagebash
title
click here for the best answer
collapsetrue
module list bowtie
Code Block
languagebash
titleUnexpected output
# Currently Loaded Modules Matching: bowtie
#  None found.
# Inactive Modules Matching: bowtie
#   1) bowtie/2.3.4

...

Commands for making a directory named bowtie2
collapsetrue
mkdir bowtie2

First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta and put it in the same directory as NC_012967.1.gbk.

When we loaded the bioperl module we first loaded the gcc compiler which unloaded several other modules (such as bowtie) which require the intel compiler to function. If you now try to load bioperl you'll see that it loads without requiring the gcc compiler. This was done deliberately to introduce you to another quirk of the module system. Hopefully the error messages were informative enough to help you work through how to get the modules to work.

Despite these quirks, this is still far easier to deal with than issues that can arise when installing other programs on tacc or your personal computer.
Code Block
Expand
titleYou may be wondering how bowtie was inactivated.
languagebash
Lmod has detected the following error:  These module(s) or extension(s) exist but cannot be loaded as requested: "bowtie/2.3.4"
   Try: "module spider bowtie/2.3.4" to see how to load the module(s).

See if you can figure out how to load bowtie using the information above.

Code Block
languagebash
titleClick here for answer
collapsetrue
module load intel/18.0.2
module load bowtie/2.3.4
titleUse the information you you learned above working with the bp_seqconvert.pl script to convert the entire .gbk file into a .fa file
collapsetrue
bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta

Bowtie2 installation

While you could consult last year's tutorial for installing bowtie2 via the module system, this year's course will be using the conda system to install it. The bowtie2 home page can be found here, and if you needed to download the program itself, version 2.4.4 could be downloaded here. Instead, we want to make sure the bowtie2 is installed via conda Like we did for fastqc and cutadapt. See if you can figure out how to install bowtie2. Note that "2" is actually part of the name, neither a typo nor a comment on the program version.

Expand
titleClick here for a hint before the answer.

Remember that we want to use the https://anaconda.org/ search function and end up at the bowtie2 page: https://anaconda.org/bioconda/bowtie2

Code Block
languagebash
titleRemember, you likely need to specify the channel bowtie2 is in using the -c command
collapsetrue
conda install -c bioconda bowtie2
# enter "y" to approve the installation
Expand
titleOPTIONAL -- How to check what version of bowtie2 was loaded?

Here are a few of the possibilities that will workRemember yesterday we did the same thing with cutadapt.

module list
Code Block
languagebash
titleIn this case all of these methods will work, that may not be true of all programs
Recall many programs have a --version flag that can be used to retrieve this information, and conda has a 'list' function that takes additional optional arguments
collapsetrue
bowtie2 --version
which bowtie2

Many programs have --version or -v options that can be passed to them to specifically print the version of the program and sometimes information about what papers the authors would like you to cite if you use the program.

which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules
conda list bowtie2

The above should show you now have access to version 2.3.5.1. If you have a different version listed (such as 2.3.2) make sure you are using the conda installation and not relying on the TACC module, and then get my attention for clarification.

For many read mappers, the first step is quite often indexing the reference file regardless of what mapping program is used. Put the output of this command into the bowtie directory we created a minute ago. The command you need is:

...

Expand
titleIf you're stuck click here for an explanation of what arguments the command does need

The command requires 2 arguments. The first argument is the reference sequence in FASTA format. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*.

...

Take a look at your output directory using ls bowtie2 to see what new files have appeared. These files are binary files, so looking at them with head or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into lonestar stampede2 with a new terminal, and start a new idev session.

Info
titleyou may be wondering why creating an index is a common first step for many different mapping programs.

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.

...

Warning
titleIMPORTANT

This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node. If you are not sure if you are on an idev node, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies.

If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial.

Code Block
languagebash
titleSolution
collapsetrue
bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  
# the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers

...

We have actually massively under-utilized Lonestar stampede2 in this example. We ran the command using only a single processor (a single "thread") rather than the 48 272 we have available. For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 48 272 processors for the bowtie process.

Expand
titleSee if you can figure out how to re-run this using all 48 272 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 48processors 272 processors available to our job.

Code Block
languagebash
titleclick here to check your answer
collapsetrue
bowtie2 -t -p 48272 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

Expand
titleHow much faster was it using all 48 272 processors?

1 processor took a little over 5 minutes, 48 processors took ~ 15 seconds. Can you think of any reasons why it was ~ 16x faster rather than 48x 272x faster?

Expand
titleAnswer

Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?

...


An additional note from the stampede2 user manual is that using all 272 processors is rarely the go to solution. In my own work, if I am working with a single sample, or expect samples to finish at different rates yet fully distribute tasks throughout the duration of the program (as is true for read mapping like we are doing here, and read trimming like with trimmomatic or cutadapt, and actually very little else), I do use 272 as the default and only reduce it if there is a problem.  


One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output (something I have never done in my own work)

What comes after mapping?

...

  • In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).

  • Do the BWA tutorial so you can compare their outputs (note BWA has a conda package making it even easier to try).
    • Did bowtie2 or BWA map more reads?
    • In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
    • Which aligner took less time to run? Are there any options you can change that:
      • Lead to a larger percentage of the reads being mapped? (increase sensitivity)
      • Speed up run time without causing many fewer reads to be mapped? (increase performance)

...

Here is a link to help you return to the GVA 2020 2021 course schedule.