Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

Overview

Once you've mapped your NGS data, it is prudent to check/validate the data before investing time in analysis. Over time, you can develop quite extensive QC measures; they may be based on tools like the popular fastqc and samstat tools - they can report quality value distributions, look for erroneously abundant sequences, etc. Keep in mind that appropriate thresholds are dependent on the application.

Learning Objectives

In this tutorial, you will:

  • Learn about SAM/BAM format and how to index the output from mapping for further analysis.
  • Extract information about how reads were mapped from a SAM/BAM file.

Theory

SAM/BAM files act as repositories

...

Although . Although most mappers assume FASTQ input files and output SAM files, the SAM file concept is intended to be a working repository/database of sequences that can be used at any stage of analysis. It is general enough to hold the alignments from multiple different samples within one BAM file so that, for example, a Bayesian genotyping tool can formulate a stronger association with a putative alternate allele when it scans across an entire family rather than separately through individuals. This information is encoded in the RG field in the SAM file header and on each raw read.

After mapping, other programs can be used to correct for false positive SNP calls near indels, or adjust base quality values by assessing actual sequencing errors, etc. and then store that data back in the BAM file.

Table of Contents

Table of Contents

Warning

When following along here, please start an idev session for running any example commands:

Code Block
idev -m 60 -q development

Output files of mapping

Mappers spit out SAM/BAM files as output files. BAM is just a binary format of SAM.

Converting SAM files to BAM files

Here is a specification of SAM format SAM specification.

...

Code Block
titleCopy a exercise SAM file to your SCRATCH directory
cds
mkdir samtools_exercise
cd samtools_exercise
cp /corral-repl/utexas/BioITeam$BI/web/yeast_stuff/yeast_chip.sam .

...

Expand
titlesolution
Code Block
samtools view -bS yeast_chip.sam > yeast_chip.bam
samtools sort yeast_chip.bam yeast_chip_sort
samtools index yeast_chip_sort.bam

Mapping stats with samstat

You can quickly profile the alignments in a BAM file using the samstats command (which we previously used to evaluate raw FASTQ read files).

Running samstat on BAM files

Code Block
titleRunning samstat on BAM example
# setup
cds
mkdir samstat_test2
cd samstat_test2
cp $BI/web/yeast_stuff/yeast_chip_sort.bam .

# run the program
$BI/bin/samstat yeast_chip_sort.bam

...

Note that by default, samstat only considers mapped reads for BAM files, although this behavior can be changed by piping the subset of reads you want analyzed to samstat from a samtools view command.

Mapping stats with SAMtools

Code Block
samtools flagstat <bamfile>

...

Expand
titlesolution
Code Block
samtools flagstat yeast_chip.bam

Watch out for mapper-specific data

Not all mappers actually write data to that bitfield, or may not populate a field if you don't instruct the mapper right; for example, if you have paired-end data but tell the mapper to map both reads as single-end reads, you won't get pairing stats.

...

Expand
titlesolution
Code Block
samtools idxstats yeast_chip_sort.bam

Additional SAMtools tricks

Extract/print sub alignments in BAM format.

If no region is specified in samtools view command, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’

...

Expand
titlesolution
Code Block
samtools view <bamfile> chrIII:123456-124456 | wc -l

Extract/print mapped sub alignments in BAM format

As mentioned above, a bam/sam file includes or does not include unmapped reads depending on mappers or options on mappers. If you use bwa with default options, the output bam includes unmapped reads. In order to extract mapped reads from a bam file, use -F option in samtools view command. -F INT means "Skip alignments with bits present in INT". In other words, -F INT filters reads that have the INT in their flag. Please take a look at page 4 on SAM specification. 0X4 flag is for segment unmapped.

...

Expand
titlesolution
Code Block
samtools view -F 0X04 <bamfile> chrIII | wc -l

Extract/print reversely mapped sub alignments in BAM format

If you have a strand-specific RNA-seq data, a mapping direction of a read is critical. For example, you may want to count only reversely-mapped reads on a (-)-direction gene. Directionality of mapped read is also recorded on flag.

...

Instead of using samtools, you can do the same (or more sophisticated) task by using linux command tools such as sed and awk.

More advanced info from a SAM file

Since the sam file is tab-delimited text, it's easy to use linux command line tools to tell you other useful information quickly. For example, if you'd like to get a quick histogram of the CIGAR scores in your sam file, you could do this:

...

Finally, note that the SAM file specification says that all fields after field 11 are custom, in the format TAG:VTYPE:VALUE. This can be very confusing because multiple mappers may use the same two letter code for TAG but mean something different - know your mapper!

More Exercises

Explore the SAM files created by various mappers:

...