Evaluating mapping results
Once you've received your NGS data, it is prudent to check/validate the data before investing time in analysis. Over time, you can develop quite extensive QC measures; they may be based on tools like the popular fastqc tool - it can report quality value distributions, look for erroneously abundant sequences, etc. Keep in mind that appropriate thresholds are dependent on the application.
Output files of mapping
Mappers spit out SAM/BAM files as output files. BAM is just a binary format of SAM. Here is a specification of SAM format SAM specification.
Commonly, SAM files are processed in this order:
1. SAM files are converted into BAM files
2. BAM files are sorted by reference coordinates
3. Sorted BAM files are indexed
Each step above can be done with commands below
samtools view -bS <sam file> > <bamfile> samtools sort <bamfile> <prefix of sorted bamfile> samtools index <sorted bamfile>
Basic mapping stats
samtools flagstat <bamfile>
prints out a host of useful stats about your mapping results, such as reads mapped. It can print a lot more information like % properly paired, # of duplicates, but it's simply relying on the information encoded in the second field of the SAM file - the bitwise flag field.
If you're using a mapper that doesn't generate a SAM file, stop what you are doing; go back and get a mapper that makes a SAM file.
Watch out for mapper-specific data
Not all mappers actually write data to that bitfield, or may not populate a field if you don't instruct the mapper right; for example, if you have paired-end data but tell the mapper to map both reads as single-end reads, you won't get pairing stats.
Not all mappers write the same information to a BAM file - most won't write unmapped reads unless you ask them to; some will list reads that map to multiple sites in the genome multiple times, others only once, others never.
A dirty secret is that these are often not published because they can raise questions that are uncomfortable or not easy to answer and often irrelevant to the subject of the publication. Because they are often not published, it can be difficult to use these numbers as QC statistics without some context. You should keep track of these as your experiments progress; they can be telltales of contamination (e.g. non-reference sequences), poor isolation (e.g. varying amounts of ribosomal RNA), or other library prep issues (e.g. adaptor dimers).
samtools idxstats <indexedbamfile>
reports on stats related to the chromosome-based indexing done by samtools index
. For each sequence of the reference, it provides:
- Sequence name (usually "chr1", etc.)
- BP in that sequence
- Reads mapping to that sequence
- Reads not mapping to that sequence
Additional SAMtools tricks
Extract/print sub alignments in BAM format.
If no region is specified in samtools view command, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’
Exercise 1
Count the number of mapped reads overlapping between chromosome3 123456 and chromosome3 124456
Extract/print mapped sub alignments in BAM format
As mentioned above, a bam/sam file includes or does not include unmapped reads depending on mappers or options on mappers. If you use bwa with default options, the output bam includes unmapped reads. In order to extract mapped reads from a bam file, use -F option in samtools view command. -F INT means "Skip alignments with bits present in INT". In other words, -F INT filters reads that have the INT in their flag. Please take a look at page 4 on SAM specification. 0X4 flag is for segment unmapped.
Exercise 2
Count the number of reads mapped on chromosome 3
Extract/print reversely mapped sub alignments in BAM format
If you have a strand-specific RNA-seq data, a mapping direction of a read is critical. For example, you may want to count only reversely-mapped reads on a (-)-direction gene. Directionality of mapped read is also recorded on flag.
Exercise 3
Count the number of reversely-mapped reads overlapping between chromosome 3: 123456 and chromosome 3: 124456
Hint: flag 0x10 = SEQ being reverse complemented
Instead of using samtools, you can do the same (or more sophisticated) task by using linux command tools such as sed and awk.
More advanced info from a sam file
Since the sam file is tab-delimited text, it's easy to use linux command line tools to tell you other useful information quickly. For example, if you'd like to get a quick histogram of the CIGAR scores in your sam file, you could do this:
head -100000 genomeMap.sam | awk 'BEGIN {FS="\t"} {print $6}' | sort | uniq -c | sort -n -r | head
Instead of the CIGAR score (field 6, "$6" to awk), a histogram field 9 might be more interesting to the person who made your library. If your mapper wrote to field 9 right.
Keep in mind you have to do this on an UNSORTED BAM file - where the reads are not sorted by genome location. If you try to look at at sorted BAM file, the "head" command will show you the first part of the first reference sequence rather than a random sample of your reads.
For sparse mapping data like ChIP-Seq or RNA-Seq data, you can quickly find the most heavily covered spot in your genome with something like this:
head -100000 genomeMap.sam | awk 'BEGIN {FS="\t"} {print $3"\t"$4}' | sort | uniq -c | sort -n -r | head
If you're looking for the most heavily covered 1 kb region, let awk
do the (crude) math for you:
head -100000 genomeMap.sam | awk 'BEGIN {FS="\t"} {print $3"\t"int($4/1000)}' | sort | uniq -c | sort -n -r | head
Finally, note that the SAM file specification says that all fields after field 11 are custom, in the format TAG:VTYPE:VALUE. This can be very confusing because multiple mappers may use the same two letter code for TAG but mean something different - know your mapper!
SAM files act as repositories
Although most mappers assume fastq input files, the SAM file concept is intended to be a working repository of sequences at any stage of analysis. It is general enough to hold the raw data from multiple different samples within one SAM file so that, for example, a Bayesian genotyping tool can formulate a stronger association with a putative alternate allele when it scans across an entire family rather than separately through individuals. This information is encoded in the RG field in the SAM file header and on each raw read.
After mapping, other programs can be used to correct for false positive SNP calls near indels, or adjust base quality values by assessing actual sequencing errors, etc. and then store that data back in the BAM file.
More Exercises
Explore the sam files created by various mappers:
- Compare similarities and differences in mapping stats.
- Parse a SAM file to count the number of unique start sites in your mapping data; again, compare mappers.
- Look at the different optional fields (field 12) produced by different mappers
- If time permits, try different options on some mappers and use
samtools flagstat
orawk
to look at the differences.