...
Mapping qualities are a measure of how likely a given sequence alignment to its reported location is correct. If a read's mapping quality is low (especially if it is zero, or mapQ 0 for short) the read maps to multiple locations on the genome (they are multi-hit or multi-mapping reads), and we can't be sure whether the reported location is the correct one.
...
- Alternate locations for a mapped read are are will be flagged as secondary (flag 0x100)
- While they often provide valuable information, secondary reads must be filtered for some downstream applications
- e.g., ChIP-seq peak calling and variant analysis with GATK.
- When secondary reads are reported, the total number of alignment records in the BAM file is greater than the number of reads in the input FASTQ files!
- this affects how the true mapping rate must be calculated
- true mapping rate = (pirmary mapped reads) / (total BAM file sequences - secondary mapped reads)
...
- bwa aln (global alignment) and bowtie2 with default parameters (both --local default end-to-end mode) report at most one location for a read that maps
- this will be the location with the best mapping quality and alignment
- if a given read maps equally well to multiple locations, these aligners pick one location at random
- bwa aln will always report a 0 mapping quality for these multi-hit reads
- bowtie2 will report a low mapping quality (< 10), based on the complexity (information content) of the sequence
- bwa mem (local alignment) can always report more than one location for a mapped read
- its definition of a secondary alignment is different (and a bit non-standard)
- if one part of a read maps to one location and another part maps somewhere else (e.g. because of RNA splicing), the longer alignment is marked as primary and the shorter as secondary.
- there is no way to disable reporting of secondary alignments with bwa mem.
- but they can be filtered from the sorted BAM with -F 0x100 (secondary alignment flag = 0).
- its definition of a secondary alignment is different (and a bit non-standard)
- bowtie2 can be configured to report more than one location for a mapped read
- the -k N option says to report up to N alignments for a read
- most transcriptome-aware RNA-seq aligners by default report more than one location for a mapped read
- e.g. hisat2, star, tophat2.
- when reads are quantified (counted with respect to genes), multiply-mapped reads can be counted fractionally
- e.g. if a read maps to 5 genes, it can be counted as 1/5 for each of the genes
Filtering for high-quality reads
Using our yeast_pe.sort.bam file, let's do some some quality filtering.
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Note that we use the -b flag to tell samtools view to output BAM format (not the SAM format text we've been looking at).
|
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
There were 1,184,360 alignment records in the original BAM, and only 456,890 in the quality-filtered BAM, around 38% of our starting reads. Since we have only properly paired reads, the filtered BAM will contain equal numbers of both R1s and R2s. So the number of read pairs is 456890/2 or 228451. |
...