Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to SAM/BAM format for downstream analysis.
  2. Use bowtie2 to map reads from an E. coli Illumina data set to a reference genome and compare the output.

Theory

Please see the Introduction the Introduction to mapping presentation on presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.

...

If you try to run the BioITeam version of the script from the head node /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl , you get an error message similar to the following error message:

No Format
Can't locate Bio/SeqIO.pm in @INC (@INC contains: /corral-repl/utexas/BioITeam//local/share/perl5 /corral-repl/utexas/BioITeam//perl5/lib/perl5/x86_64-linux-thread-multi /corral-repl/utexas/BioITeam//perl5/lib/perl5 /corral-repl/utexas/BioITeam//perl5/lib64/perl5/auto /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.
BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/bp_seqconvert.pl line 8.

...

We get this error message because on stampede2, while perl is installed, the required bioperl module is not loaded by default. Rather than having to actually install the library yourself (as we will do for several libraries in the SVDetect tutorial later today), we are lucky that TACC has a bioperl module. So, you must have the bioperl module loaded before the script will run. As it is fairly rare that you need to convert sequence files between different format, bioperl is actually not listed as one of the modules on your .bashrc file in your $HOME directory that you set up yesterday. Additionally it gives an opportunity to have you working with the module system, and give you something to compare installing libraries manually for the SVDecect tutorial against

After loading the bioperl library to get past the error message, run the script from the BioITeam without any arguments to get the help message:

...

Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.

...

Expand
titleClick here for a hint before the answer.

Remember that we want to use the https://anaconda.org/ search function and end up at the bowtie2 page: https://anaconda.org/bioconda/bowtie2. Like we discussed with the ls command in the introduction tutorial, we can combine creating a new environment while at the same time, telling it what programs we want to access inside that environment.

Code Block
languagebash
titleRemember, you likely need to specify the channel bowtie2 is in using the -c command
collapsetrue
conda create -n GVA-bowtie2-mapping -c bioconda bowtie2
# enter "n" to cancel the installation and read on for more information

As mentioned in explaining why cutadapt installed version 1.18 instead of 3.5, the default anaconda channel and the bioconda channel do not always have all necessary program requirements to install the latest version of programs. In the list of new packages that were to be installed the following line lists that the bowtie2 version that will be installed will be 2.2.5:

  bowtie2            bioconda/linux-64::bowtie2-2.2.5-py37h6bb024c_5
While it may seem like installing a different version of the program is bad behavior, this is actually a huge benefit of the conda program. Often changes from version to version of a program are small and only effect subsets of the program, and the conda package installer is designed to find whatever way it can to get you a working version of the program. If we know that there is a particular version we want (be it the newest version, or a previous version you want to use to maintain consistent behavior in a given data set) and we tell conda that we want that version, if conda can't install that version it wont prompt you to proceed it will just fail.

Expand
titleWhat would this look like for bowtie 2.4.5?


Code Block
languagebash
titleCommand specifying that bowtie 2.4.5 is required
conda create -n GVA-bowtie2-mapping -c bioconda bowtie2=2.4.5


No Format
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: | 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                                                                                                                                        

UnsatisfiableError: 


Since we dont have a lot of information about what is causing the conflict with bowtie2 version 2.4.5, a simple step to try is to give the installation access to conda-forge. Similar to bioconda, conda-forge is a channel that is community run rather than company run and is both more nimble at including new things, and expansive for including more tools. More information about conda-forge can be found here.

Code Block
languagebash
titleCommand specifying that bowtie 2.4.5 is required
conda create -n GVA-bowtie2-mapping -c bioconda bowtie2=2.4.5 -c conda-forge
# enter Y to verify and proceed
conda activate GVA-bowtie2-mapping


Info
titleif conda-forge and bioconda are such important channels, why are they not included by default

Remember conda is a computational tool for all disciplines. In an optional tutorial on Friday, there will be information about how to permanently add different channels to be searched. In the mean time, forcing you to interact with different channels more manually, will help you gauge if adding them is actually something you will benefit from in your own work.



Expand
titleOPTIONAL -- How to check what version of bowtie2 was loaded?

We just went thought a lot of work to make sure we installed the version that we wanted, but sometimes we need to work the other way and figure out what version we have already been working with such as  yesterday with cutadapt.

Code Block
languagebash
titleRecall many programs have a --version flag that can be used to retrieve this information, and conda has a 'list' function that takes additional optional arguments
collapsetrue
bowtie2 --version
conda list bowtie2

The above should show you now have access to version 2.4.5. If you have a different version listed (such as 2.3.5 or 2.3.2) make sure you are using the conda installation with access to conda forge and not relying on the TACC module, and then get my attention for clarification.

Building an index

For many read mappers, the first step in mapping reads to a genome is quite often indexing the reference file. Put the output of this command into the bowtie directory we created a minute ago. The command you need is:

Code Block
bowtie2-build

Try typing this alone in the terminal and figuring out what to do from the help show just from typing the command by itself.

Expand
titleIf you're stuck click here for an explanation of what arguments the command does need

The command requires 2 arguments. The first argument is the reference sequence in FASTA format. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*.

Code Block
languagebash
titleClick here to check your work, or for the answer if needed
collapsetrue
bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1

Take a look at your output directory using ls bowtie2 to see what new files have appeared. These files are binary files, so looking at them with head or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into stampede2 with a new terminal, and start a new idev session.

Info
titleyou may be wondering why creating an index is a common first step for many different mapping programs.

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.

Mapping reads

Code Block
languagebash
titleAgain, try reading the help for the bowtie2 command to figure out how to run the command yourself. Remember these are paired-end reads.
collapsetrue
bowtie2

This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the hostname or  showq -u command to make sure you are on an idev node.

If you are not sure if you are on an idev node or are seeing other output with one or both commands, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies.

If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial.
Warning
titleIMPORTANT
Commandon idev nodeon head node
hostnamelists a compute node starting with a C followed by a number before "stampede2.tacc.utexas.edu"lists a login node plus number before "stampede2.tacc.utexas.edu"
showq -u

-bash: showq: command not found

shows you a summary of jobs you have. (very likely empty during these tutorials)
Warning
titleIMPORTANT

The following command is extremely taxing to the head node and thus means we should not run it on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the hostname or  showq -u command to make sure you are on an idev node.

Commandon idev nodeon head node
hostnamelists a compute node starting with a C followed by a number before "stampede2.tacc.utexas.edu"lists a login node plus number before "stampede2.tacc.utexas.edu"
showq -u

-bash: showq: command not found

shows you a summary of jobs you have. (very likely empty during these tutorials)

If you are not sure if you are on an idev node or are seeing other output with one or both commands, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies.

If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial.

For many read mappers, the first step in mapping reads to a genome is quite often indexing the reference file. Put the output of this command into the bowtie directory we created a minute ago. The command you need is:

Code Block
bowtie2-build

Try typing this alone in the terminal and figuring out what to do from the help show just from typing the command by itself.

Expand
titleIf you're stuck click here for an explanation of what arguments the command does need

The command requires 2 arguments. The first argument is the reference sequence in FASTA format. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*.


Code Block
languagebash
titleClick here to check your work, or for the answer if needed
collapsetrue
bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1

Take a look at your output directory using ls bowtie2 to see what new files have appeared. These files are binary files, so looking at them with head or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into stampede2 with a new terminal, and start a new idev session.

Info
titleyou may be wondering why creating an index is a common first step for many different mapping programs.

Like an index for a book (in the olden days before Kindles and Nooks), creating an index for a computer database allows quick access to any "record" given a short "key". In the case of mapping programs, creating an index for a reference sequence allows it to more rapidly place a read on that sequence at a location where it knows at least a piece of the read matches perfectly or with only a few mismatches. By jumping right to these spots in the genome, rather than trying to fully align the read to every place in the genome, it saves a ton of time.

Indexing is a separate step in running most mapping programs because it can take a LONG time if you are indexing a very large genome (like our own overly complicated human genome). Furthermore, you only need to index a genome sequence once, no matter how many samples you want to map. Keeping it as a separate step means that you can skip it later when you want to align a new sample to the same reference sequence.


Mapping reads

Code Block
languagebash
titleAgain, try reading the help for the bowtie2 command to figure out how to run the command yourself. Remember these are paired-end reads.
collapsetrue
bowtie2

It is important that you use 8 processors when doing this mapping due to course time constraints.

...

  1. SAM files can be enormously humongous text files (potentially measured in gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or more or using a viewer like IGV, which we will cover in a later tutorial.
  2. SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation about read mapping and file formats which you can find on the home page.

...

  • In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).

  • Do the BWA tutorial so you can compare their outputs (note BWA has a conda package making it even easier to try).
    • Did bowtie2 or BWA map more reads?
    • In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
    • Which aligner took less time to run? Are there any options you can change that:
      • Lead to a larger percentage of the reads being mapped? (increase sensitivity)
      • Speed up run time without causing many fewer reads to be mapped? (increase performance)

...