...
- Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to
SAM/BAM
format for downstream analysis. - Use bowtie2 to map reads from an E. coli Illumina data set to a reference genome and compare the output.
Theory
Please see the Introduction the Introduction to mapping presentation on presentation on the course outline for more details of the theory behind read mapping algorithms and critical considerations for using these tools and references correctly.
...
The easiest way to run the tutorial is to copy this entire directory into a new folder called "GVA_bowtie2_mapping" on your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls
command.
Code Block |
---|
(GVA2021) tacc:/scratch/<#>/<UserName>/GVA_bowtie2_mapping$ ls
NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq SRR030257_2.fastq.gz
|
...
Info |
---|
It is somewhat frustrating or confusing that this command does not give us any output saying it was successful. The fact that you get your prompt back is often the only sign the computer has finished doing something. Last year some students were getting the following error message when they execute this command even though the new file seems to be generated correctly. As I am not able to reconstruct the error, please send a message or say something on zoom if you do encounter it so I know it is still present Code Block |
---|
Use of uninitialized value in substitution (s///) at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 777, <STDIN> line 164674.
Use of uninitialized value in concatenation (.) or string at /opt/apps/bioperl/1.6.901/Bio/SeqIO/embl.pm line 779, <STDIN> line 164674. |
|
Expand |
---|
title | Does EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command. |
---|
|
Code Block |
---|
language | bash |
---|
title | Using the head to check the first 100 lines |
---|
| head -n 100 NC_012967.1.embl
|
Code Block |
---|
language | bash |
---|
title | Using the less command |
---|
| less NC_012967.1.embl
|
Code Block |
---|
language | bash |
---|
title | Using the more command |
---|
| more NC_012967.1.embl
|
remember that you can quit the less and more views with the q key. |
Converting from fastq to fasta format
Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq
to FASTA
format.
Code Block |
---|
language | bash |
---|
title | Remember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script. |
---|
|
head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
|
Expand |
---|
title | What information was lost by this conversion? Use the head command to look at the top of both the .fastq and .fasta file |
---|
|
Code Block |
---|
| head SRR030257_1.fastq
head SRR030257_1.fasta |
The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases. |
Mapping with bowtie2
Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.
...
Expand |
---|
title | Does EMBL format have sequence features (like genes) annotated? The answer is near the top of the file but not within the first 10 lines. DO NOT check with the cat command. |
---|
|
Code Block |
---|
language | bash |
---|
title | Using the head to check the first 100 lines |
---|
| head -n 100 NC_012967.1.embl
|
Code Block |
---|
language | bash |
---|
title | Using the less command |
---|
| less NC_012967.1.embl
|
Code Block |
---|
language | bash |
---|
title | Using the more command |
---|
| more NC_012967.1.embl
|
remember that you can quit the less and more views with the q key. |
Converting from fastq to fasta format
Sometimes you only want to work with a subset of a full data file to check for overall trends, or to try out a new piece of code. Convert only the first 10,000 lines of SRR030257_1.fastq
to FASTA
format.
Code Block |
---|
language | bash |
---|
title | Remember you can use the "|" character to have the output of head feed directly into the bp_seqconvert.pl script. |
---|
|
head -n 10000 SRR030257_1.fastq | bp_seqconvert.pl --from fastq --to fasta > SRR030257_1.fasta
|
Expand |
---|
title | What information was lost by this conversion? Use the head command to look at the top of both the .fastq and .fasta file |
---|
|
Code Block |
---|
| head SRR030257_1.fastq
head SRR030257_1.fasta |
The line of ASCII characters was lost. Remember, those are your "base quality scores". Many mappers will use the base quality scores to improve how the reads are aligned by not placing as much emphasis on poor bases. |
Mapping with bowtie2
Bowtie2 is a complete rewrite of an older program bowtie. In terms of configurability, sensitivity, and speed it is useful for a wide range of projects. After years of teaching bwa mapping along with bowtie2, bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation about read mapping details and file formats on the course home page, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2 after the course is over, I'd love to hear from you.
Create a fresh output directory named bowtie2. We are going to create a specific output directory for the bowtie2 mapper within the directory that has the input files so that you can compare the results of other mappers if you choose to do the other tutorials.
Code Block |
---|
language | bash |
---|
title | Commands for making a directory named bowtie2 |
---|
collapse | true |
---|
|
mkdir bowtie2 |
First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
Code Block |
---|
language | bash |
---|
title | Commands for making a directory named bowtie2 |
---|
collapse | true |
---|
|
mkdir bowtie2 |
First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
Code Block |
---|
language | bash |
---|
title | Use the information you you learned above Use the information you you learned above working with the bp_seqconvert.pl script to convert the entire .gbk file into a .fa file |
---|
collapse | true |
---|
|
bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta
|
...
Warning |
---|
|
This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node. If you are not sure if you are on an idev node, speak up on zoom and I'll show(q) -u what you are looking for. Yes, your instructor likes bad puns. My apologies. If you are not on an idev node, and need help to relaunch it, click over to the idev tutorial. |
Code Block |
---|
It is important that you use 8 processors when doing this mapping due to course time constraints.
Code Block |
---|
language | bash |
---|
title | Solution |
---|
collapse | true |
---|
|
bowtie2 -t -p 8 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam
# the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers |
Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:
- SAM files can be enormously humongous text files (potentially >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
or grep
or more or using a viewer like IGV, which we will cover in a later tutorial. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation about read mapping and file formats which you can find on the home page.
Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:
Code Block |
---|
head bowtie2/SRR030257.sam |
Expand |
---|
title | What do you think the |
---|
Expand |
---|
|
Command portion | Purpose |
---|
-t | Print wall clock time each step takes. | -p 8 | Use 8 processors. As discussed above and below this is selected so the command will finish in a reasonable amount of time | -x bowtie2/NC_012967.1 | listing the location and name of the index we created above with the bowtie2-build command | -1 SRR030257_1.fastq | Read 1 file name (note if not using the -1 and -2 options reads would not be mapped in paired end mode) | -2 SRR030257_2.fastq | Read 2 file name (note if not using the -1 and -2 options reads would not be mapped in paired end mode) | -S bowtie2/SRR030257.sam | Output mapped reads in sam format at given location with given name |
|
Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:
- SAM files can be enormously humongous text files (potentially measured in gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
or grep
or more or using a viewer like IGV, which we will cover in a later tutorial. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation about read mapping and file formats which you can find on the home page.
Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:
Code Block |
---|
head bowtie2/SRR030257.sam |
Expand |
---|
title | What do you think the 4th and 8th columns mean(click for answer)? |
---|
|
If you thought the answer was the mapping coordinates of the read pairs you were right! |
...
Expand |
---|
title | See if you can figure out how to re-run this using all 272 processors68 cores. Click here for a hint |
---|
|
You need to use the -p , for "processors" option. Since we had 272 processors 68 cores available to our job. Code Block |
---|
language | bash |
---|
title | click here to check your answer |
---|
collapse | true |
---|
| bowtie2 -t -p 27268 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam
|
Try it out and compare the speed of execution by looking at the log files. Expand |
---|
title | How much faster was it using all 272 68 processors? |
---|
| 1 8 processor took a little over 5 minutes, 48 68 processors took ~ 15 57 seconds. Can you think of any reasons why it was ~ 16x 5x faster rather than 272x ~8x faster? Expand |
---|
| Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back?
An additional note from the stampede2 user manual is that using all 272 processors is rarely the go to solution. In my own work, if I am working with a single sample, or expect samples to finish at different rates yet fully distribute tasks throughout the duration of the program (as is true for read mapping like we are doing here, and read trimming like with trimmomatic or cutadapt, and actually very little else), I do use 272 as the default and only reduce it if there is a problem. manual is that while there are 68 cores available, and each core is capable of hyperthreading 4 x processors per core using all 272 processors is rarely the go to solution. While I am sure that this is more rigorously and appropriately tested in some other manner, I ran a test using different numbers of processors with the following results: -p option | time (min:sec) |
---|
272 | 1:54 | 136 | 1:13 | 68 | 0:57 | 34 | 1:14 | 17 | 2:25 | 8 | 5:12 | 4 | 9:01 | 2 |
| 1 |
|
Again while there are almost certainly better ways to benchmark this, there are 2 things of note that are illustrated here: - ~doubling the number of processors does not reduce the time in half, and while some applications may use hyperthreading on the individual cores appropriately, and assuming a program can/will actually makes things take longer.
- Working on your laptop (which likely has at most 4-8 processors available) would significantly increase the amount of time these tutorials take.
|
|
|
One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder
flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output (something I have never done in my own work).
...