Diploid genomes
The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:
- Map the raw data to a suitable reference
- Look for SNPs and small-scale indels
Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location.
If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.
Example: The CEU Trio from the 1000 Genomes Project
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
All the data we'll use is located here:
$BI/ngs_course/human_variation
This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref
with special references.
The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.
.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
We'll now show the steps for:
- Mapping
- Single-sample variant calling with samtools
- Multiple-sample variant calling with samtools
We'll return to this example data shortly to demonstrate a much more involved tool, GATK, to do the same steps.
Optional: Mapping Exercise
Let's use Anna's shell script align_bwa.sh
to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py
to create the job submission script.
Don't forget that for this to work, you need to have appended $BI/bin
to your path.
Move into your scratch directory and then try to figure out how to create and qsub
the align_bwa.sh
command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq
against the hg19
reference. Call the output "test".
Note that the input is paired-end data.
Single-sample variant calling with samtools
We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it).$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
With samtools, this is a two-step process:
samtools mpileup
command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.bcftools
with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.
Here are the commands, piped together:
samtools mpileup -uf $BI/ref_genome/fasta/ucsc/ucsc.hg19.fasta \ $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools view -vcg - > test.raw.bcf
Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa
. We could have chosen another mapper if we'd wanted to though.
Exercise:
Write a modified version of the command above to use the proper reference into a commands
file, create the job submission script with launcher_creator.py
and submit the job.
After your job completes
You can examine some of the variant calls with:
more test.raw.vcf
Note the extensive header information, followed by an ordered list of variants.
A bcftools
auxiliary perl script called vcfutils.pl
can provide some useful stats. It's safe to run this on the head node since it's quick. This is not part of a standard TACC module but it's in our common $BI/bin directory.
vcfutils.pl qstats test.raw.vcf
Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger. Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv).
You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.
Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
Multiple-sample variant calling with samtools
This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:
By providing separate bam files for each sample, like this:
samtools multi-sample variants: separate bam filessamtools mpileup -uf hs37d5.fa \ NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools view -vcg - > all.samtools.vcf
By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools
@RG
tags.samtools multi-sample variants: one or more bam files using @RGsamtools mpileup -uf hs37d5.fa all.bam | bcftools view -vcg - > all.raw.vcf
The output file from the first option is in $BI/ngs_course/human_variation
CAVEAT: If you intend to use the second option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse
or sampe
stage of mapping with bwa with the -r
option, using samtools merge
, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. Note that our align_bwa.sh
script takes care of this detail for us.
Other notes on bcftools
bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.
You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.
Side-note on samtools mpileup
If you don't need a variant caller that uses a full Bayesian inference engine with robust filtering, protections, etc. but just want to look for weird base-artifacts within your mapped data, check out samtools mpileup
output directly.
GATK: A much more sophisticated option
http://gatkforums.broadinstitute.org/discussion/44/base-quality-score-recalibration-bqsr