Annotating Variants
Annotating Variants: Introduction
As we've already seen, determining the presence or absence of a variant from NGS data is not trivial. It is software dependent and has inherent trade-offs between sensitivity and specificity. Inevitably, the number of putative variants in a real data set is very large; for example the first samples from the 1000 Genomes project typically found 0.1% variants (3 million variants), approximately 10% of which had never been previously observed (300,000 novel variants per individual.) False positive discovery rates are also typically very high at this stage.
Auxiliary data is often used to reduce putative variants without compromising sensitivity. Examples of auxiliary data include other samples within a cohort, existing SNP databases, gene or other feature annotations, and sample-specific information such as pedigree:
- By comparing genotypes across a set of samples and defining one as "reference" (or "wild type") enables other samples to be properly genotyped (i.e. 0/0 for hom. WT, 0/1 for het, 1/1 for hom. alt)
- Existing SNP databases such as dbSNP or the
vcf
files from the 1000 genomes project may be used to reject "common" variants under the assertion that "common" means "non-disease causing". - Gene annotations allow for codon analysis to determine whether mutations are synonymous, non-synonymous, nonsense, mis-sense, or create early stop codons.
- Pedigree information is particularly effective in mendelian autosomal recessive diseases; filtering for heterozygous mutations in parents and unaffected siblings which are homozygous in the proband usually yields a very small set of candidate variants.
Variant annotation tools perform the function of combining the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants. In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf
file combined with other auxiliary data.
Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf
files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools. Our objective
Pre-packaged programs
Annovar - one of the simpler variant annotators available
Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.
Annovar comes pre-packaged with human auxiliary data which is updated by the authors on a regular basis. It is a well-constructed package in that there is one core program annotate_variation.pl
which can perform a variety of different types of annotation AND download the reference databases required.
The authors have also included a wrapper script summarize_anovar.pl
which runs a fairly comprehensive set of annotations automatically.
This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script annovar_pipe.sh
around the existing summarize_annovar.pl
wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.
Exercise:
First, look at the code for our annovar_pipe.sh
command. Here is an easy one-liner to cat
the contents of a script (note ` is a back-tick, not apostrophe):
cat `which annovar_pipe.sh`
This script simply does a format conversion and then calls summarize_annovar.pl
. Now let's run it on all the vcf files - you could simply edit the commands
file and type in the 6 lines, or you can use this fancier command line that calls Perl to custom-create the 6 command lines needed and put them straight into commands
:
ls $BI/ngs_course/human_variation/N*.vcf | \ perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' \ > commands
launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands qsub annovar.sge
While Annovar is running, have a look at the code to annovar_pipe.sh
summarize_annovar.pl
Other variant annotators:
- http://www.yandell-lab.org/software/vaast.html
- http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
- http://www.bioconductor.org/help/workflows/variants/
- http://vat.gersteinlab.org/
- http://code.google.com/p/mu2a/
Linux utilities
cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "0/1" > NA12878.raw.vcf.simple.het cat NA12878.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "1/1" > NA12878.raw.vcf.simple.hom
cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "0/1" | sort > NA12891.raw.vcf.simple.het cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "0/1" | sort > NA12892.raw.vcf.simple.het cat NA12891.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "1/1" | sort > NA12891.raw.vcf.simple.hom cat NA12892.raw.vcf | awk 'BEGIN {FS="\t"} {print $2 "\t" substr($10,1,3) "\t" $4 "\t" $5}' \ | sort -n | grep "1/1" | sort > NA12892.raw.vcf.simple.hom
- Now count how many GT are het in both of the second two (parents) but hom in the first (child):
(would you have expected this result?)
join NA12892.raw.vcf.simple.het NA12891.raw.vcf.simple.het > both.het join both.het NA12878.raw.vcf.simple.hom | wc -l
- Now find which GT are hom in both of the second two (parents) but het in the first (child):
join NA12892.raw.vcf.simple.hom NA12891.raw.vcf.simple.hom > both.hom join both.hom NA12878.raw.vcf.simple.het | wc -l
Compare samtools to GATK on exome 20.
Diversions
These are oddities:
join NA12892.raw.vcf.simple NA12891.raw.vcf.simple | awk '{if ($3!=$6 || $4!=$7) {print}}'
Notes
Variants consist of single base base changes, insertions and deletions, and larger scale structural changes. "Larger scale" is usually defined relative to the capabilities of the technology; for example, a "small indel" usually means "detectable within a single sequence read". In 2009, sequence reads were about 50 bp but in 2011 they were 100 bp.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.