...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Reports these alignment statistics:
Interestingly, the local alignment rate here is lower than we saw with the gloabl alignment. |
Exercise #5: BWA-MEM - Human mRNA-seq
After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:
- It incorporates a lot of the simplicity of using bwa with the complexities of local alignment, enabling straightforward alignment of datasets like the mirbase data we just examined
- It can align different portions of a read to different locations on the genome
- In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
- or a pair of reads in a paired-end library will fall on either side of a splice junction.
- We want to be able to align these splice-adjacent reads for many reasons, from accurate transcript quantification to novel fusion transcript discovery.
- In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
This exercise will align a human total RNA-seq dataset composed (by design) almost exclusively of reads that cross splice junctions.
A word about real splice-aware aligners
Using BWA mem for RNA-seq alignment is sort of a "poor man's" RNA-seq alignment method. Real splice-aware aligners like tophat2, hisat2 or STAR have more complex algorithms (as shown below) – and take a lot more time!
In the transcriptome-aware alignment above, reads that span splice junctions are reported in the SAM file with genomic coordinates that start in the first exon and end in the second exon (the CIGAR string uses the N operator, e.g. 30M1000N60M).
BWA MEM does not know about the exon structore of the genome. But it can align different sub-sections of a read to two different locations, producing two alignment records from one input read (one of the two will be marked as secondary (0x100 flag).
BWA MEM splits junction-spanning reads into two alignment records |
---|
Setup for BWA mem
First set up our working directory for this alignment. Since it takes a long time to build a bwa index for a large genome (here human hg38/GRCh38), we'll use one that the BioITeam maintains in its /work2/projects/BioITeam/ref_genome area.
Code Block | ||||
---|---|---|---|---|
| ||||
# Make sure you're in an idev session
idev -m 120 -p normal -A UT-2015-05-18 -N 1 -n 68
# Load the modules we'll need
module load biocontainers
module load bwa
module load samtools
# Copy over the FASTQ data if needed
mkdir -p $SCRATCH/core_ngs/alignment/fastq
cp $CORENGS/alignment/*.gz $SCRATCH/core_ngs/alignment/fastq/
# Make a new alignment directory for running these scripts
mkdir -p $SCRATCH/core_ngs/alignment/bwamem
cd $SCRATCH/core_ngs/alignment/bwamem
ln -sf ../fastq
ln -sf /work2/projects/BioITeam/ref_genome/bwa/bwtsw/hg38
|
Now take a look at bwa mem usage (type bwa mem with no arguments). The most important parameters are the following:
Option | Effect |
---|---|
-k | Controls the minimum seed length (default = 19) |
-w | Controls the "gap bandwidth", or the length of a maximum gap. This is particularly relevant for MEM, since it can determine whether a read is split into two separate alignments or is reported as one long alignment with a long gap in the middle (default = 100) |
-M | For split reads, mark the shorter read as secondary |
-r | Controls how long an alignment must be relative to its seed before it is re-seeded to try to find a best-fit local match (default = 1.5, e.g. the value of -k multiplied by 1.5) |
-c | Controls how many matches a MEM must have in the genome before it is discarded (default = 10000) |
-t | Controls the number of threads to use |
RNA-seq alignment with bwa mem
Based on its help info, this is the structure of the bwa mem command we will use:
Code Block |
---|
bwa mem -M <ref.fa> <reads.fq> > outfile.sam |
After performing the setup above, execute the following command in your idev session:
Code Block | ||
---|---|---|
| ||
cd $SCRATCH/core_ngs/alignment/bwamem
bwa mem -M hg38/hg38.fa fastq/human_rnaseq.fastq.gz 2>hs_rna.bwamem.log |
samtools view -b | \
samtools sort -O BAM -T human_rnaseq.tmp > human_rnaseq.sort.bam |
This multi-pipe command performs three steps:
- The bwa mem alignment
- the program's progress output (on standard error) is redirected to a log file (2>hs_rna.bwamem.log)
- its alignment records (on standard output) is piped to the next step (conversion to BAM)
- Conversion of bwa mem's SAM output to BAM format
- recall that the -b option to samtools view says to output in BAM format
- Sorting the BAM file
Because the progress output is being redirected to a log file, we won't see anything until the command completes. Then you should have a human_rnaseq.sort.bam file and an hs_rna.bwamem.log logfile.
Exercise: Compare the number of original FASTQ reads to the number of alignment records.
Expand | ||
---|---|---|
| ||
Use the zcat | wc -l | awk idiom to count FASTQ reads. Use samtools flagstat to report alignment statistics. |
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
Count the FASTQ file reads:
The file has 100,000 reads. Generate alignment statistics from the sorted BAM file:
Output will look like this:
There were 133,570 alignment records reported for the 100,000 input reads. Because bwa mem can split reads and report two alignment records for the same read, there are 33,570 secondary reads reported here. |
Tip |
---|
Be aware that some downstream tools (for example the Picard suite, often used before SNP calling) do not like it when a read name appears more than once in the SAM file. Such reads can be filtered, but only if they can be identified as secondary by specifying the bwa mem -M option as we did above. This option leaves the longest alignmen normally but marks additional alignments for the read as secondary (the 0x100 BAM flag). This designation also allows you to easily filter the secondary reads with samtools view -F 0x104 if desired. |