...
Code Block |
---|
language | bash |
---|
title | How to determine the total number of sequences in a fastq file |
---|
collapse | true |
---|
|
wc -l * # and then divide by 4 using the your knowledge of fastq files
# OR
grep ^@SRR030257 SRR030257_1.fastq | wc -l
# OR
grep --count ^@SRR030257 SRR030257_1.fastq
# OR
grep --count "^+$" SRR030257_1.fastq |
...
Bowtie2 is a complete rewrite of bowtie. It is currently the latest and greatest in the eyes of one very picky instructor professor (and his postdoc/gradstudent) in terms of configurability, sensitivity, and speed. After years of teaching bwa mapping along with bowtie2, last year was the first class to use only bowtie2 since we bowtie2 alone is now taught as I never recommend anyone use bwa, and based on positive feedback we continue with this set up. For some more details about how read mappers work see the bonus presentation, and if you find a compelling reason to use bwa (or any other read mapper) rather than bowtie2, weI'd love to hear from you.
...
Code Block |
---|
language | bash |
---|
title | Commands for making a directory and changing into it |
---|
collapse | true |
---|
|
mkdir bowtie2 |
Next, make sure the bowtie2 module is loaded (we use module spider
to get the current name, which may not be bowtie/2.2.6
if you re-run this tutorial):
...
First you need to convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
Expand |
---|
title | If you're stuck and want a hint without the answer |
---|
| Remember in our earlier tutorial we discussed the use of lonestar's module
commands "spider" and "load" to install new functionality
Code Block |
---|
|
Use the bp_seqconvert.pl script Here are a few of the possibilities that will work Code Block |
---|
language | bash |
---|
title | click Click here for to get the answer without having to go back through the previous tutorialor to check your command |
---|
collapse | true |
---|
| module list bowtie
# This should return the following 2 lines:
# Currently Loaded Modules Matching: bowtie
# 1) bowtie/2.2.6
# If you do not see the above output, run these commands, and repeat this block
module spider bowtie
module load bowtie/2.2.6 |
|
Expand |
---|
title | OPTIONAL -- How to check what version of bowtie2 was loaded? |
---|
|
bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta
|
|
Next, we want to make sure the bowtie2 module is loaded (we use module spider
to get the current name, which may not be bowtie/2.3.4
if you re-run this tutorial later):
Expand |
---|
title | Click here for a hint without the answer |
---|
|
Remember in our earlier tutorial we discussed the use of lonestar's module commands "spider" and "load" to install new functionality and "list", "keyword", and "avail" to find different modules. |
Expand |
---|
title | If you're stuck and want a hint without the answer... |
---|
|
Use the bp_seqconvert.pl script Code Block |
---|
language | bash |
---|
title | Click here to get the answer or to check your command |
---|
collapse | true |
---|
| bp_seqconvert.pl --from genbank --to fasta < NC_012967.1.gbk > NC_012967.1.fasta
language | bash |
---|
title | In this case all of these methods will work, that may not be true of all programs |
---|
| bowtie2 --version
module list
which bowtie2 |
Note that which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules. |
Now convert the reference file from GenBank to FASTA using what you learned above. Name the new output file NC_012967.1.fasta
and put it in the same directory as NC_012967.1.gbk
.
click here for the best answer | collapse | true |
---|
| module list bowtie |
|
Code Block |
---|
language | bash |
---|
title | Unexpected output |
---|
|
# Currently Loaded Modules Matching: bowtie
# None found.
# Inactive Modules Matching: bowtie
# 1) bowtie/2.3.4 |
Further, when we try to load bowtie/2.3.4 we get an error message.
Code Block |
---|
|
Lmod has detected the following error: These module(s) exist but
cannot be loaded as requested: "bowtie/2.3.4"
Try: "module spider bowtie/2.3.4" to see how to load the module(s). |
See if you can figure out how to load bowtie using the information above.
Code Block |
---|
language | bash |
---|
title | Click here for answer |
---|
collapse | true |
---|
|
module load intel/18.0.2
module load bowtie/2.3.4 |
Expand |
---|
title | You may be wondering how bowtie was inactivated. |
---|
|
When we loaded the bioperl module we first loaded the gcc compiler which unloaded several other modules (such as bowtie) which require the intel compiler to function. If you now try to load bioperl you'll see that it loads without requiring the gcc compiler. This was done deliberately to introduce you to another quirk of the module system. Hopefully the error messages were informative enough to help you work through how to get the modules to work. Despite these quirks, this is still far easier to deal with than issues that can arise when installing other programs on tacc or your personal computer. |
Expand |
---|
title | OPTIONAL -- How to check what version of bowtie2 was loaded? |
---|
|
Here are a few of the possibilities that will work. Code Block |
---|
language | bash |
---|
title | In this case all of these methods will work, that may not be true of all programs |
---|
| bowtie2 --version
module list
which bowtie2 |
Note that which can be very useful for making sure you are running the executable that you think you are running, especially if you install your own programs. In particular make sure that the path matches up to what you expect. The most common situations arise from wanting to run a simplistically named script in your $HOME directory conflicting with something of the same name in the $BI directories or TACC modules. |
For many read mappers, the first step is quite often indexing the reference file regardless of what mapping program is used. Put the output of this command into the bowtie
directory we created a minute ago. The command you need is:
...
Expand |
---|
title | If you're stuck click here for an explanation of what arguments the command does need |
---|
|
The command requires 2 arguments. The first argument is the reference FASTA. The second argument is the "base" file name to use for the created index files. It will create a bunch of files beginning bowtie/NC_012967.1*. Code Block |
---|
language | bash |
---|
title | Click here to check your work, or for the answer if needed |
---|
collapse | true |
---|
| bowtie2-build NC_012967.1.fasta bowtie2/NC_012967.1
|
Take a look at your output directory using ls bowtie2 to see what new files have appeared. These files are binary files, so looking at them with head or tail isn't instructive and can cause issues with your terminal. If you insist on looking at them (or accidentally do so before you read this) and your terminal begins behaving oddly, simply close it and log back into lonestar with a new terminal, and start a new idev session. |
Why do so many different mapping programs create an index as a first step you may be wondering?
...
Warning |
---|
|
This command can take a while (~5 minutes) and is extremely taxing. This is longer than we want to run a job on the head node (especially when all of us are doing it at once). In fact, in previous years, TACC has noticed the spike in usage when multiple students forgot to make sure they were on idev nodes and complained pretty forcefully to us about it. Let's not have this be one of those years. Use the showq -u command to make sure you are on an idev node. If you are not scroll up to the start of this tutorial or see yesterday's tutorial for more information. If you are not sure if you are on an idev node, raise your hand and we'll show(q) -u what you are looking for. Yes, at least one of your instructors like your instructor likes bad puns. Our My apologies. |
Code Block |
---|
language | bash |
---|
title | Solution |
---|
collapse | true |
---|
|
bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam # the -t command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers |
...
- SAM files can be enormously humongous text files (potentially >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like
head
or grep
or more or using a viewer like IGV, which we will cover in a later tutorial. - SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time, or they are covered in the bonus presentation.
...
Expand |
---|
title | See if you can figure out how to re-run this using all 48 processors. Click here for a hint |
---|
|
You need to use the -p , for "processors" option. Since we had 48processors available to our job. Code Block |
---|
language | bash |
---|
title | click here to check your answer |
---|
collapse | true |
---|
| bowtie2 -t -p 48 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam
|
Try it out and compare the speed of execution by looking at the log files. Expand |
---|
title | How much faster was it using all 48 processors? |
---|
| 1 processor took ~5 minutes, 48 processors took ~ 1 15 seconds minute. Can you think of any reasons why it was ~ 5x 16x faster rather than 48x faster? Expand |
---|
| Anytime you use multiprocessing correctly things will go faster, but even if a program can divide the input perfectly among all available processors, and combine the outputs back together perfectly, there is "overhead" in dividing things up and recombining them. These are the types of considerations you may have to make with your data: When is it better to give more processors to a single sample? How fast do I actually need the data to come back? |
|
|
...
...
The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2017 2019 course schedule.