...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Reports these alignment statistics:
Interestingly, the local alignment rate here is lower than we saw with the gloabl alignment. |
Exercise #5: BWA-MEM - Human mRNA-seq
After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:
- It incorporates a lot of the simplicity of using bwa with the complexities of local alignment, enabling straightforward alignment of datasets like the mirbase data we just examined
- It can align different portions of a read to different locations on the genome
- In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
- or a pair of reads in a paired-end library will fall on either side of a splice junction.
- We want to be able to align these splice-adjacent reads for many reasons, from accurate transcript quantification to novel fusion transcript discovery.
- In a total RNA-seq experiment, reads will (at some frequency) span a splice junction themselves
Thus, our last exercise will be the alignment of a human total RNA-seq dataset composed (by design) almost exclusively of reads that cross splice junctions.
bwa mem was is made available by loading the bwa BioContainers module
we loaded the bwa module, so take a look at its usage information. The most important parameters, similar to those we've manipulated in the past two sections, are the following:
...
There are many more parameters to control the scoring scheme and other details, but these are the most essential ones to use to get anything of value at all.
The test file we will be working with is just the R1 file from a paired-end total RNA-seq experiment, meaning it is (for our purposes) single-end. Go ahead and take a look at it, and find out how many reads are in the file.
Expand | |||||
---|---|---|---|---|---|
| |||||
|
A word about real splice-aware aligners
Using BWA mem for RNA-seq alignment is sort of a "poor man's" RNA-seq alignment method. Real splice-aware aligners like tophat2 or star have more complex algorithms (as shown below) – and take a lot more time!
RNA-seq alignment with bwa aln
Now, try aligning it with bwa aln like we did in Example #1, but first link to the hg19 bwa index directory. In this case, due to the size of the hg19 index, we are linking to Anna's scratch area INSTEAD of our own work area containing indexes that we built ourselves.
Code Block | ||||
---|---|---|---|---|
| ||||
cd $SCRATCH/core_ngs/alignment
ln -s -f /scratch/01063/abattenh/ref_genome/bwa/bwtsw/hg19
ls hg19 |
You should see a set of files analogous to the yeast files we created earlier, except that their universal prefix is hg19.fa.
Go ahead and try to do a single-end alignment of the file to the human genome using bwa aln like we did in Exercise #1, saving intermediate files with the prefix human_rnaseq_bwa. Go ahead and just execute on the command line.
Code Block | ||||
---|---|---|---|---|
| ||||
bwa aln hg19/hg19.fa fastq/human_rnaseq.fastq.gz > human_rnaseq_bwa.sai
bwa samse hg19/hg19.fa human_rnaseq_bwa.sai fastq/human_rnaseq.fastq.gz > human_rnaseq_bwa.sam |
Once this is complete use less to take a look at the contents of the SAM file, using the space bar to leaf through them. You'll notice a lot of alignments look basically like this:
Code Block |
---|
HWI-ST1097:228:C21WMACXX:8:1316:10989:88190 4 * 0 0 * * 0 0
AAATTGCTTCCTGTCCTCATCCTTCCTGTCAGCCATCTTCCTTCGTTTGATCTCAGGGAAGTTCAGGTCTTCCAGCCGCTCTTTGCCACTGATCTCCAGCT
CCCFFFFFHHHHHIJJJJIJJJJIJJJJHJJJJJJJJJJJJJJIIIJJJIGHHIJIJIJIJHBHIJJIIHIEGHIIHGFFDDEEEDDCDDD@CDEDDDCDD |
Notice that the contig name (field 3) is just an asterisk ( * ) and the alignment flags value is a 4 (field 2), meaning the read did not align (decimal 4 = hex 0x4 = read did not map).
Essentially, nothing (with a few exceptions) aligned. Why?
Expand | ||
---|---|---|
| ||
Because this file was generated exclusively from reads in a larger dataset that cross at least one splice junction. The sequences as they exists in most of the reads do not correspond to a single location in the genome. However subsections of each read do exist somewhere in the genome. So, we need an aligner that is capable aligning different parts of the read to different genomic loci. |
RNA-seq alignment with bwa mem
Exercise: use bwa mem to align the same data
Based on the following syntax and the above reference path, use bwa mem to align the same file, saving output files with the prefix human_rnaseq_mem. Go ahead and just execute on the command line.
Code Block |
---|
bwa mem <ref.fa> <reads.fq> > outfile.sam |
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Check the length of the SAM file you generated with wc -l. Since there is one alignment per line, there must be 586266 alignments (minus no more than 100 header lines), which is more than the number of sequences in the FASTQ file. This is bwa mem can report multiple alignment records for the same read, hopefully on either side of a splice junction. These alignments can still be tied together because they have the same read ID.
Expand | |||||
---|---|---|---|---|---|
| |||||
To get an idea of how often each read aligned, and what the 'real' alignment rate is, use the following commands:
This alignment rate is pretty good, but it could get better by playing around with the finer details of bwa mem. |
Tip |
---|
Be aware that some downstream tools (for example the Picard suite, often used before SNP calling) do not like it when a read name appears more than once in the SAM file. To mark the extra alignment records as secondary, specify the bwa mem -M option. This option leaves the best (longest) alignment for a read as -is but marks additional alignments for the read as secondary (the 0x100 BAM flag). This designation also allows you to easily filter the secondary reads with samtools if desired. |
BWA-MEM vs Tophat
Another approach to aligning long RNA-seq data is to use an aligner that is more explicitly concerned with sensitivity to splice sites, namely a program like Tophat. Tophat uses either bowtie (tophat) or bowtie2 (tophat2) as the actual aligner, but performs the following steps:
- aligns reads to the genome
- reads that do not align to the genome are aligned against a transcriptome, if provided
- if they align, the transcriptome coordinates are converted back to genomic coordinates, with gaps represented in the CIGAR string, for example as 196N
- reads that do not align to the transcriptome are split into smaller pieces, each of which Tophat attempts to map to the genome
Note that Tophat also reports secondary alignments, but they have a different meaning. Tophat always reports spliced alignments as one alignment records with the N CIGAR string operator indicating the gaps. Secondary alignments for Tophat (marked with the 0x100 BAM flag) represent alternate places in the genome where a read (spliced or not) may have mapped.
As you can imagine from this series of steps, Tophat is very computationally intensive and takes much longer than bwa mem – very large alignments (hundreds of millions of reads) may not complete in stampede's 48 hour maximum job time!