Overview
Once you've mapped your NGS data, it is prudent to check/validate the data before investing time in analysis. Over time, you can develop quite extensive QC measures; they may be based on tools like the popular fastqc and samstat tools - they can report quality value distributions, look for erroneously abundant sequences, etc. Keep in mind that appropriate thresholds are dependent on the application.
Learning Objectives
In this tutorial, you will:
- Learn about SAM/BAM format and how to index the output from mapping for further analysis.
- Extract information about how reads were mapped from a SAM/BAM file.
Theory
SAM/BAM files act as repositories
...
Although . Although most mappers assume FASTQ input files and output SAM files, the SAM file concept is intended to be a working repository/database of sequences that can be used at any stage of analysis. It is general enough to hold the alignments from multiple different samples within one BAM file so that, for example, a Bayesian genotyping tool can formulate a stronger association with a putative alternate allele when it scans across an entire family rather than separately through individuals. This information is encoded in the RG field in the SAM file header and on each raw read.
After mapping, other programs can be used to correct for false positive SNP calls near indels, or adjust base quality values by assessing actual sequencing errors, etc. and then store that data back in the BAM file.
Table of Contents
Table of Contents |
---|
Warning | ||
---|---|---|
When following along here, please start an idev session for running any example commands:
|
Output files of mapping
Mappers spit out SAM/BAM files as output files. BAM is just a binary format of SAM.
Converting SAM files to BAM files
Here is a specification of SAM format SAM specification.
...
Code Block | ||
---|---|---|
| ||
cds mkdir samtools_exercise cd samtools_exercise cp /corral-repl/utexas/BioITeam$BI/web/yeast_stuff/yeast_chip.sam . |
...
Expand | ||
---|---|---|
| ||
|
Mapping stats with samstat
You can quickly profile the alignments in a BAM file using the samstats
command (which we previously used to evaluate raw FASTQ read files).
Running samstat on BAM files
Code Block | ||
---|---|---|
| ||
# setup cds mkdir samstat_test2 cd samstat_test2 cp $BI/web/yeast_stuff/yeast_chip_sort.bam . # run the program $BI/bin/samstat yeast_chip_sort.bam |
...
Note that by default, samstat only considers mapped reads for BAM files, although this behavior can be changed by piping the subset of reads you want analyzed to samstat from a samtools view
command.
Mapping stats with SAMtools
Code Block |
---|
samtools flagstat <bamfile> |
...
Expand | ||
---|---|---|
| ||
|
Watch out for mapper-specific data
Not all mappers actually write data to that bitfield, or may not populate a field if you don't instruct the mapper right; for example, if you have paired-end data but tell the mapper to map both reads as single-end reads, you won't get pairing stats.
...
Expand | ||
---|---|---|
| ||
|
Additional SAMtools tricks
Extract/print sub alignments in BAM format.
If no region is specified in samtools view command, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’
...
Expand | ||
---|---|---|
| ||
|
Extract/print mapped sub alignments in BAM format
As mentioned above, a bam/sam file includes or does not include unmapped reads depending on mappers or options on mappers. If you use bwa with default options, the output bam includes unmapped reads. In order to extract mapped reads from a bam file, use -F option in samtools view command. -F INT means "Skip alignments with bits present in INT". In other words, -F INT filters reads that have the INT in their flag. Please take a look at page 4 on SAM specification. 0X4 flag is for segment unmapped.
...
Expand | ||
---|---|---|
| ||
|
Extract/print reversely mapped sub alignments in BAM format
If you have a strand-specific RNA-seq data, a mapping direction of a read is critical. For example, you may want to count only reversely-mapped reads on a (-)-direction gene. Directionality of mapped read is also recorded on flag.
...
Instead of using samtools, you can do the same (or more sophisticated) task by using linux command tools such as sed and awk.
More advanced info from a SAM file
Since the sam file is tab-delimited text, it's easy to use linux command line tools to tell you other useful information quickly. For example, if you'd like to get a quick histogram of the CIGAR scores in your sam file, you could do this:
...
Finally, note that the SAM file specification says that all fields after field 11 are custom, in the format TAG:VTYPE:VALUE. This can be very confusing because multiple mappers may use the same two letter code for TAG but mean something different - know your mapper!
More Exercises
Explore the SAM files created by various mappers:
...