Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
titleIf you need a little help but don't want the answer yet, click the triangle...

Remember that to copy an entire folder requires the use of the recursive (-r) option.

Code Block
languagebash
titleStill stuck? click here for the correct code
collapsetrue
cds
mkdir bowtie2mapping
cp -r $BI/gva_course/mapping/data mappingbowtie2MappingTutorial
cd mapping
bowtie2MappingTutorial
ls

Useful commands

Often you will have general questions about your sequencing files that you want to answer before or after starting your actual analysis. Here we show you some very handy commands after a warning:

...

Warning
titleSubmit to the TACC queue or run in an idev shell

Create a commands file and use launcher_creator.py followed by qsub.

Put this in your commands file:
Expand
titleNot sure what to do...
click here for a hint before getting the answer

 

launcher_creator.py -h will give you insight to how to use that command.

Code Block
languagebash
titlecommands for the commands file if you can't work them out yourself
collapsetrue
bowtie2 -t -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam  # 
What does
the -t
option do?

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

  1. SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or using a viewer like IGV, which we will cover later.
  2. SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

Code Block
head bowtie2/SRR030257.sam

What do you think the 4th and 8th columns mean?

More reading about SAM files

Multithreaded execution

We have actually massively under-utilized Lonestar in this example. We submitted a job that reserved a single node on the cluster, but that node has 12 processors. Bowtie was only using one of those processors (a single "thread")! For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 12 processors for the bowtie process.

...

It's -p, for "processors". Since we had 12 processors available to our job, the better bowtie alignment commands file would look like this.

Code Block
bowtie2 -t -p 12 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn about the "wayness" of how you request nodes on Lonestar and possibly edit your *.sge script.

...

 command is not required for the mapping, but it can be particularly informative when you begin comparing different mappers
Code Block
languagebash
titleClick here for the specific launcher_creator.py commands
collapsetrue
launcher_creator.py -n "bowtie2" -t 00:10:00

 

 

 

 

Your final output file is in SAM format. It's just a text file, so you can peek at it and see what it's like inside. Two warnings though:

  1. SAM files can be enormously humongous text files (maybe >1 gigabytes). Attempting to open the entire file at once can cause your computer to lock up or your text editor to crash. You are generally safer only looking at a portion at a time using linux commands like head or grep or using a viewer like IGV, which we will cover later.
  2. SAM files have some rather complicated information encoded as text, like a binary encoded FLAGS field and CIGAR strings. We'll take a look at some of these later, if we have time.

Still, you should recognize some of the information on a line in a SAM file from the input FASTQ, and some of the other information is relatively straightforward to understand, like the position where the read mapped. Give this a try:

Code Block
head bowtie2/SRR030257.sam
Expand
titleWhat do you think the 4th and 8th columns mean(click for answer)?
If you thought the answer was the mapping coordinates of the read pairs you were right!

More reading about SAM files

Multithreaded execution

We have actually massively under-utilized Lonestar in this example. We submitted a job that reserved a single node on the cluster, but that node has 12 processors. Bowtie was only using one of those processors (a single "thread")! For programs that support multithreaded execution (and most mappers do because they are obsessed with speed) we could have sped things up by using all 12 processors for the bowtie process.

Expand
titleSee if you can figure out how to re-run this using all 12 processors. Click here for a hint

You need to use the -p, for "processors" option. Since we had 12 processors available to our job.

Code Block
languagebash
titleclick here to check your answer
collapsetrue
bowtie2 -t -p 12 -x bowtie2/NC_012967.1 -1 SRR030257_1.fastq -2 SRR030257_2.fastq -S bowtie2/SRR030257.sam

Try it out and compare the speed of execution by looking at the log files.

If you want to launch many processes as part of one job, so that they are distributed one per node and use the maximum number of processors available, then you need to learn about the "wayness" of how you request nodes on Lonestar and possibly edit your *.sge script.

One consequence of using multithreading that might be confusing is that the aligned reads might appear in your output SAM file in a different order than they were in the input FASTQ. This happens because small sets of reads get continuously packaged, "sent" to the different processors, and whichever set "returns" fastest is written first. You can force them to appear in the same order (at a slight cost in speed) by adding the --reorder flag to your command, but is typically only necessary if the reads are already ordered or you intend to do some comparison between the input and output.

 

Optional Exercises

  • In the bowtie2 example, we mapped in --local mode. Try mapping in --end-to-end mode (aka global mode).

  • Do the BWA tutorial so you can compare their outputs.
    • Did bowtie2 or BWA map more reads?
    • In our examples, we mapped in paired-end mode. Try to figure out how to map the reads in single-end mode and create this output.
    • Which aligner took less time to run? Are there any options you can change that:
      • Lead to a larger percentage of the reads being mapped? (increase sensitivity)
      • Speed up performance without causing many fewer reads to be mapped? (increase performance)

Next steps...

The next steps are often to view the output using a specific viewer on your local machine, or to begin identifying variant locations where the reads differ from the reference sequence. These will be the next things we cover in the course. Here is a link to help you return to the GVA 2015 course schedule.