Table of Contents |
---|
Introduction to Samtools - manipulating and filtering bam files
As Nathan showed you yesterday, the main type of output from aligning reads to a databases is a binary alignment file, or BAM file. These files are compressed, so they can't be viewed using standard unix file viewers such as more, less and head. Samtools allows you to manipulate the .bam files - they can be converted into a non-binary format (SAM format specification here) and can also be ordered and sorted based on the quality of the alignment. This is a good way to remove low quality reads, or make a BAM file restricted to a single chromosome.
...
Code Block | ||||
---|---|---|---|---|
| ||||
ssh user@login8.stampede.tacc.utexas.edu cds mkdir samtools cd samtools cp /scratch/02423/nsabell/core_ngs/alignment/bam/yeast_pairedend.bam . |
Sorting and Indexing a bam file: samtools index, sort
Now that we have a BAM file, we need to index it. All BAM files need an index, as they tend to be large and the index allows us to perform computationally complex operations on these files without it taking days to complete.
...
Expand | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||
Use ls -lah to see what files you made and how large they are. This is what mine look like:
Note how small the index file is! |
Samtools flags and mapping rate: calculating the proportion of mapped reads in an aligned bam file
We have a sorted, indexed BAM file. Now we can use other samtools functionality to filter this file and count mapped vs unmapped reads in a given region. samtools allows you to sort based on certain flags that are specified on page 4 on the sam format specification. We'll focus on a couple, below.
...
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
I've put my output for each line in the comment area.
So the total proportion of reads that were unmapped on chromosome III is 4611/5070 or 90.9%, which is really high! Only 10% of reads on this chromosome were able to be mapped to the genome. |
Filtering bam files based on mapped status and mapping quality using samtools view
Mapping qualities are a measure of how likely a given sequence alignment to a location is correct. The lowest score is a mapping quality of zero, or mq0 for short. The reads map to multiple places on the genome, and we can't be sure of where the reads originated. To improve the quality of our data, we can remove these low quality reads from our sorted and indexed file.
...