SAM format and samtools
Output files of mapping
Mappers spit out SAM/BAM files as output files. BAM is just a binary format of SAM.
Here is a specification of SAM format SAM specification.
Commonly, SAM files are processed in this order:
- SAM files are converted into BAM files (samstools view)
- BAM files are sorted by reference coordinates (samtools sort)
- Sorted BAM files are indexed (samtools index)
Each step above can be done with commands below
samtools view -bS <samfile> > <bamfile> samtools sort <bamfile> <prefix of sorted bamfile> samtools index <sorted bamfile>
cds mkdir samtools_exercise cd samtools_exercise cp /corral-repl/utexas/BioITeam/core_ngs_tools/yeast_stuff/yeast_chip.sam .
Exercise
Generate "yeast_chip.bam", a sorted bam file named "yeast_chip_sort.bam", and an index file of the sorted bam file
Basic mapping stats
samtools flagstat <bamfile>
prints out a host of useful stats about your mapping results, such as reads mapped. It can print a lot more information like % properly paired, # of duplicates, but it's simply relying on the information encoded in the second field of the SAM file - the bitwise flag field.
If you're using a mapper that doesn't generate a SAM file, stop what you are doing; go back and get a mapper that makes a SAM file.
Exercise
Print out basic stats about yeast_chip.bam
samtools idxstats <indexed bam file>
reports on stats related to the chromosome-based indexing done by samtools index
. For each sequence of the reference, it provides:
- Sequence name (usually "chr1", etc.)
- BP in that sequence
- Reads mapping to that sequence
- Reads not mapping to that sequence
In miRNA alignment, reference sequence name (3rd field in a sam file) is name of miRNA. Therefore, idxstats
gives the number of mapped read on each miRNA.
Watch out for mapper-specific data
Not all mappers actually write data to that bitfield, or may not populate a field if you don't instruct the mapper right; for example, if you have paired-end data but tell the mapper to map both reads as single-end reads, you won't get pairing stats.
Not all mappers write the same information to a BAM file - most won't write unmapped reads unless you ask them to; some will list reads that map to multiple sites in the genome multiple times, others only once, others never.
A dirty secret is that these are often not published because they can raise questions that are uncomfortable or not easy to answer and often irrelevant to the subject of the publication. Because they are often not published, it can be difficult to use these numbers as QC statistics without some context. You should keep track of these as your experiments progress; they can be telltales of contamination (e.g. non-reference sequences), poor isolation (e.g. varying amounts of ribosomal RNA), or other library prep issues (e.g. adaptor dimers).
Additional SAMtools tricks
Extract/print sub alignments in BAM format.
If no region is specified in samtools view command, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’
Exercise
Count the number of mapped reads overlapping between chromosome III 123456 and chromosome III 124456
Extract/print mapped sub alignments in BAM format
As mentioned above, a bam/sam file includes or does not include unmapped reads depending on mappers or options on mappers. If you use bwa with default options, the output bam includes unmapped reads. In order to extract mapped reads from a bam file, use -F option in samtools view command. -F INT means "Skip alignments with bits present in INT". In other words, -F INT filters reads that have the INT in their flag. Please take a look at page 4 on SAM specification. 0X4 flag is for segment unmapped.
Exercise
Count the number of reads mapped on chromosome III
Extract/print reversely mapped sub alignments in BAM format
If you have a strand-specific RNA-seq data, a mapping direction of a read is critical. For example, you may want to count only reversely-mapped reads on a (-)-direction gene. Directionality of mapped read is also recorded on flag.
Exercise 3
Count the number of reversely-mapped reads overlapping between chromosome III: 123456 and chromosome III: 124456
Hint: flag 0x10 = SEQ being reverse complemented
Instead of using samtools, you can do the same (or more sophisticated) task by using linux command tools such as sed and awk.
Exercise 4
Count the total number of mappped reads in 'yeast_chip.sam' (There are at least three ways)
Before moving on the next step, what on earth is awk? awk is a program language that processes line by line in order to manipulate or extract text. There are a lot of built-in functions in awk, and you can use for-loop, conditional, array, and so on in awk to do complicated tasks. However, learning just three basic concepts can make your analysis a lot easier: i) FS (Field Separator), OFS (Output Field Separator) ii) $integer (field number) iii) arithmetic
Guess the output of a following command:
cat yeast_chip.sam | grep -v '^@' | awk 'BEGIN{OFS=":"} {print $1,$2+10000,$3}' | head
More advanced info from a sam file
Since the sam file is tab-delimited text, it's easy to use linux command line tools to tell you other useful information quickly. For example, if you'd like to get a quick histogram of the CIGAR scores in your sam file, you could do this:
head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $6}' | sort | uniq -c | sort -n -r | head
Instead of the CIGAR score (field 6, "$6" to awk), a histogram field 9 might be more interesting to the person who made your library. If your mapper wrote to field 9 right.
Keep in mind you have to do this on an UNSORTED BAM file - where the reads are not sorted by genome location. If you try to look at at sorted BAM file, the "head" command will show you the first part of the first reference sequence rather than a random sample of your reads.
For sparse mapping data like ChIP-Seq or RNA-Seq data, you can quickly find the most heavily covered spot in your genome with something like this:
head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $3"\t"$4}' | sort | uniq -c | sort -n -r | head
If you're looking for the most heavily covered 1 kb region, let awk
do the (crude) math for you:
head -100000 yeast_chip.sam | awk 'BEGIN {FS="\t"} {print $3"\t"int($4/1000)}' | sort | uniq -c | sort -n -r | head
Finally, note that the SAM file specification says that all fields after field 11 are custom, in the format TAG:VTYPE:VALUE. This can be very confusing because multiple mappers may use the same two letter code for TAG but mean something different - know your mapper!
SAM files act as repositories
Although most mappers assume fastq input files, the SAM file concept is intended to be a working repository of sequences at any stage of analysis. It is general enough to hold the raw data from multiple different samples within one SAM file so that, for example, a Bayesian genotyping tool can formulate a stronger association with a putative alternate allele when it scans across an entire family rather than separately through individuals. This information is encoded in the RG field in the SAM file header and on each raw read.
After mapping, other programs can be used to correct for false positive SNP calls near indels, or adjust base quality values by assessing actual sequencing errors, etc. and then store that data back in the BAM file.
More Exercises
Explore the sam files created by various mappers:
- Compare similarities and differences in mapping stats.
- Parse a SAM file to count the number of unique start sites in your mapping data; again, compare mappers.
- Look at the different optional fields (field 12) produced by different mappers
- If time permits, try different options on some mappers and use
samtools flagstat
orawk
to look at the differences.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.