Read mapping with BWA and BOWTIE
Before We Start
In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to the current filesystem location of the TACC NGS tutorial material. For any shell you open that accesses Lonestar during today's tutorial, please enter the following command:
export BASE=/corral-repl/utexas/BioITeam/tacc_ngs
Read Mapping Tutorial
Your objective today is to map the paired-end sequence data in these files to a reference sequence, then convert the results to a sorted, indexed BAM file. You can use BWA or Bowtie to do the alignment, and SAMtools facilitate the other tasks. In the latter half of today's work, you will use these results files to call some variants in the human genome.
$BASE/human_variation/allseqs_R1.fastq $BASE/human_variation/allseqs_R2.fastq
$BASE/human_variation/ref/hs37d5.fa
BWA and Bowtie (as well as most other throughput-oriented aligners) require that you provide not just the reference sequence, but an index of that sequence as well. These can be generated by bwa index
or bowtie-build
, respectively. To save time, we have provided indices formatted for BWA and Bowtie.
Look in the $BASE/human_variation/ref to see additional index files besides hs37d5.fa
See if you can find some other modules on Lonestar that pertain to Alignment
Submit some alignment jobs to Lonestar
First, let's create a working directory: mkdir $WORK/bwa-align && cd $WORK/bwa-align
. Now, we will demonstrate how to submit a BWA alignment job to Lonestar. It's up to you to figure out how to do the same for Bowtie. Using a text editor, create the following file. Feel free to copy and paste or copy it from $BASE/align_bwa_01.sh
#!/bin/bash #$ -V #$ -cwd #$ -pe 1way 12 #$ -q normal #$ -l h_rt=01:00:00 #$ -A 20121008-NGS-ACES #$ -m be #$ -M vaughn@tacc.utexas.edu #$ -N align_bwa_01 module load bwa/0.6.1 BASE="/corral-repl/utexas/BioITeam/tacc_ngs" time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation/allseqs_R1.fastq > r1.sai time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation/allseqs_R2.fastq > r2.sai time bwa sampe $BASE/human_variation/ref/hs37d5.fa r1.sai r2.sai $BASE/human_variation/allseqs_R1.fastq $BASE/human_variation/allseqs_R2.fastq > hs37d5_allseqs_bwa.sam
Once you've edited the job file, submit it to Lonestar's queue, then check the status of the job to make sure it went in.
Submit your job file using qsub
Check the status of your job in the queue
Now, figure out how to align the same sequences using Bowtie
You're running two computing jobs out of the same directory. This will work, but you need to make sure that none of the files created by the two tasks have the same name or both will end up failing or reporting erroneous results. In our case, we've assured this by using two different algorithms (BWA and Bowtie) and by choosing different output names hs37d5_allseqs_bwa.sam and hs37d5_allseqs_bowtie.sam
Setting up BAM conversion as a dependency
We're going to use SAMTools to convert the results of your alignment jobs from the text version of the SAM format to the more efficient and useful binary BAM format. Along the way we'll sort the reads by chromosome position and create an index of the BAM file. "But Wait!", you say, "our alignments haven't run yet!". We're going to kill two birds with one stone: We will show you the basics of setting up SAM->sorted BAm conversion and we are also going to demonstrate how to orchestrate a simple workflow using job dependency on Lonestar. This means the downstream conversion tasks won't kick off until the alignment tasks have completed successfully.
First, let's assemble the job file we need to convert the BWA output file hs37d5_allseqs_bwa.sam from SAM to BAM. Using a text editor, create the following file. Feel free to copy/paste from the Wiki or copy it from /corral-repl/utexas/BioITeam/tacc_ngs/samtools_bwa_01.sh .
Don't submit samtools_bwa_01.sh to the Lonestar queue yet until you've been shown how to establish a job dependency, or the task will fail
#!/bin/bash #$ -V #$ -cwd #$ -pe 1way 12 #$ -q normal #$ -l h_rt=02:00:00 #$ -A 20121008-NGS-ACES #$ -m be #$ -M vaughn@tacc.utexas.edu #$ -N samtools_bwa_01 # Load the samtools module module load samtools # Set up a variable so we don't have to # keep typing in all this path BASE="/corral-repl/utexas/BioITeam/tacc_ngs" # For readability, these are listed one command per line. If you # want to chain them all together into a set of linearly dependent # commands, check out the samtools_bwa_02.sh file for an example # of using the && operator to do just that # # samtools has a suite of sub-commands documented here # http://samtools.sourceforge.net/samtools.shtml # # 'view' is used for filtering as well as simple conversion samtools view -S -b hs37d5_allseqs_bwa.sam > hs37d5_allseqs_bwa.bam # 'sort' lets you sort by position or read name samtools sort hs37d5_allseqs_bwa.bam hs37d5_allseqs_bwa.sorted # 'index' lets you create an index of a BAM file samtools index hs37d5_allseqs_bwa.sorted.bam
Setting up a job dependency
Check on the status of your bwa job (using qstat). It's probably still running. If it is, and you do an 'ls' on the output directory, either the 'hs37d5_allseqs_bwa.sam' file will be absent or will still be in process of being written. So, if we start running the SAMtools from above assuming that file exists and is complete, we're going to have a situation. We want to tell Lonestar "Don't start running the samtools tasks until AFTER bwa has completed". To do this, you just need to know the job-ID of the BWA task.
Find out the job-ID of your BWA task
Establishing a job dependency is easy. Instead of just typing:
qsub samtools_bwa_01.sh
to tell Lonestar to start running SAMTools ASAP, you tell it this instead:
qsub -hold_jid 773416 samtools_bwa_01.sh
where you would replace 773416 with the job-ID of your BWA task.
Challenge Mode
- Create a script to process your Bowtie results using SAMtools
- Submit it as a dependency to your Bowtie alignment task
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.