Annovar 2023
Annotating Variants: Introduction
As we've already seen, determining the presence or absence of a variant from NGS data is not trivial. It is software dependent and has inherent trade-offs between sensitivity and specificity. Inevitably, the number of putative variants in a real data set is very large; for example the first samples from the 1000 Genomes project typically found 0.1% variants (3 million variants), approximately 10% of which had never been previously observed (300,000 novel variants per individual.) False positive discovery rates are also typically very high at this stage.
Auxiliary data is often used to reduce putative variants without compromising sensitivity. Examples of auxiliary data include other samples within a cohort, existing SNP databases, gene or other feature annotations, and sample-specific information such as pedigree:
- By comparing genotypes across a set of samples and defining one as "reference" (or "wild type") enables other samples to be properly genotyped (i.e. 0/0 for hom. WT, 0/1 for het, 1/1 for hom. alt)
- Existing SNP databases such as dbSNP or the
vcf
files from the 1000 genomes project may be used to reject "common" variants under the assertion that "common" means "non-disease causing". - Gene annotations allow for codon analysis to determine whether mutations are synonymous, non-synonymous, nonsense, mis-sense, or create early stop codons.
- Pedigree information is particularly effective in mendelian autosomal recessive diseases; filtering for heterozygous mutations in parents and unaffected siblings which are homozygous in the proband usually yields a very small set of candidate variants.
Variant annotation tools perform the function of combining the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants. In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf
file combined with other auxiliary data.
Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf
files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools.
Annovar - one of the most powerful yet simple to run variant annotators available
Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.
Annovar comes pre-packaged with human auxiliary data which is updated by the authors on a regular basis. It is a well-constructed package in that there is one core program annotate_variation.pl
which can perform a variety of different types of annotation AND download the reference databases required.
The authors have also included a wrapper script summarize_anovar.pl
which runs a fairly comprehensive set of annotations automatically. You may be asking yourself "where can I find this awesome program?", but hopefully by now your assumption is that it is either on conda, a TACC module or in the BioITeam folder. Generally speaking "programs" that consist of a series of scripts without many complex dependencies can easily be installed in the $BI folders. While the most popular programs will eventually be turned into modules. Despite its power, you can find this program in the $BI folders.
This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script annovar_pipe.sh
around the existing summarize_annovar.pl
wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.
Running Annovar
Get some data:
Possible errors on idev nodes
As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.
First we want to move to a new location on $SCRATCH
Get access to annovar
Unfortunately we have finally found a program that conda won't install for us. In a related matter, if you look at the annovar page itself, accessing the newest version of annovar is actually behind a registration wall, which is uncommon though not without precedent. Here we will therefore work with an old version of annovar, though if you use it in your own work it is 100% suggested/encouraged/etc that you work with the newest version.
The final complication is while the BioITeam has all of the required things to run annovar, stampede2 compute nodes can't access them. This means we need to copy a database of annotations, and several scripts to our scratch directory in order to run the analysis.
cp -r $BI/ref_genome/annovar/hg19_annovar_db .
This folder is very large, it is expected to take several minutes to copy.
cp /corral-repl/utexas/BioITeam/bin/annotate_variation.pl . cp /corral-repl/utexas/BioITeam/bin/convert2annovar.pl . cp /corral-repl/utexas/BioITeam/bin/summarize_annovar.pl . cp /corral-repl/utexas/BioITeam/bin/annovarPipe.sh .
Setting up the commands
The BioITeam has set up 2 helpful wrapper commands for using annovar. The first (summarize_annovar.pl) calls annovar and summarizes the results. The second (annovarPipe.sh) does some file conversions to prepare for annovar and then calls the summarize_annovar.pl script.
Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file using perl. As you become more proficient with the command line, it is likely you will use various piping techniques such as these to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file
:
ls *.vcf | \ perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands
Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named annovar_commands.
submitting the job
As you may have guessed when we started creating a commands file, this analysis is headed for the job queue system. As we have done elsewhere: copy the launcher file, and make relevant edits to the analysis we are attempting to perform.
cp /corral-repl/utexas/BioITeam/gva_course/GVA.launcher.slurm annovar.slurm nano annovar.slurm sbatch annovar.slrum
Note that the above block does not include how to make the edits, nor the saving and closing of the slurm file. The needed edits are:
Line number | As is | To be |
---|---|---|
16 | #SBATCH -J jobName | #SBATCH -J spades |
17 | #SBATCH -n 1 | #SBATCH -n 6 |
22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
29 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=annovar_commands |
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
A 12 hour run is requested because while I have been able to verify the analysis is working with a subset of the data, I have not been able to get the full analysis to complete. I am assuming that it will complete in less than 12 hours and will again update this page when I have verified this.
Again use ctl-o and ctl-x to save the file and exit.
Analyzing the results
Accessing pre-computed results
ANNOVAR output
Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:
NA12878.chrom20.samtools.vcf.exome_summary.csv NA12878.chrom20.samtools.vcf.exonic_variant_function NA12878.chrom20.samtools.vcf.genome_summary.csv NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered NA12878.chrom20.samtools.vcf.hg19_avsift_dropped NA12878.chrom20.samtools.vcf.hg19_avsift_filtered NA12878.chrom20.samtools.vcf.hg19_esp5400_all_dropped NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered NA12878.chrom20.samtools.vcf.hg19_genomicSuperDups NA12878.chrom20.samtools.vcf.hg19_ljb_all_dropped NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way NA12878.chrom20.samtools.vcf.hg19_snp132_dropped NA12878.chrom20.samtools.vcf.hg19_snp132_filtered NA12878.chrom20.samtools.vcf.log NA12878.chrom20.samtools.vcf.variant_function
The exome_summary.csv
is probably the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):
Func | exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic |
Gene | The common gene name |
ExonicFunc | frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown |
AAChange (in gene coordinates) | |
Conserved (i.e. SNP is in a conserved region) | based on the UCSC 46-way conservation model |
SegDup (snp is in a segmental dup. region) | |
ESP5400_ALL | Alternate Allele Frequency in 3510 NHLBI ESP European American Samples |
1000g2010nov_ALL | Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele). |
dbSNP132 | The id# in dbSNP if it exists |
AVSIFT | The AVSIFT score of how deleterious the variant might be |
LJB_PhyloP | Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". |
LJB_PhyloP_Pred | |
LJB_SIFT | SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. |
LJB_SIFT_Pred | |
LJB_PolyPhen2 | Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15. |
LJB_PolyPhen2_Pred | |
LJB_LRT | Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. |
LJB_LRT_Pred | |
LRT_MutationTaster | Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point. |
LRT_MutationTaster_Pred | |
LJB_GERP++ | higher scores are more deleterious |
Chr | |
Start | |
End | |
Ref | Reference base |
Obs | Observed base-pair or variant |
SNP Quality value | |
filter information | |
(ALL the VCF info is here!!) | |
GT:PL:GQ for each file! |
Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.
Scavenger hunts! and command line building
Other variant annotators:
- http://www.yandell-lab.org/software/vaast.html
- http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
- http://www.bioconductor.org/help/workflows/variants/
- http://vat.gersteinlab.org/
- http://code.google.com/p/mu2a/
Return to GVA2023 course page.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.