Filtering
On any variant caller you use, there will be an inherent trade-off between sensitivity and specificity. Typically, you would carry forward as much data as practical at each processing step, deferring final judgement until later so you have more information. For example, you might not want to exclude low coverage regions in one sample from your raw data because you may be able to infer a genotype based on family information later.
This typically leads to NGS pipelines that maximize sensitivity, leading to a substantial number of false positives. Filtering after variant calling can be useful to eliminate false positives based on all the data available after numerous analyses.
In the samtools/bcftools world, the vcfutils.pl
script provides a means to filter SNPs on many criteria.
Exercises
Explore some filter settings for vcfutils.pl varFilter
to see how many SNPs get filtered out, using the linux tool xargs
to do a parameter sweep.
# First - create a tiny shell script to run vcfutils and take a single parameter: echo "#\!/bin/bash" > tiny.sh echo "echo \"Sweeping with vcfutils.pl, min read depth of: \$1\" " >> tiny.sh echo "vcfutils.pl varFilter -Q 20 -d \$1 test.raw.vcf | grep -v '^#' | wc -l " >> tiny.sh # Now make it executable chmod +x tiny.sh # Use xargs to do a lovely sweep echo 2 3 4 5 6 7 | xargs -n 1 tiny.sh
Exercise
Try modifying tiny.sh on your own to sweep through other filter parameters.
Exercise
Use bedtools and some linux built-in commands to compare the single-sample vcf file(s) to the multiple-sample vcf file (you might need module load bedtools
unless you've installed that in your .profile
). Do the following:
- Count the total number of variants called in
test.raw.vcf
andall.samtools.vcf
- Count how many of the SNPS in these two files are common
- Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf but NOT in all.samtools.vcf
- Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf
This data shows that most of the variants that were called on sample NA12878 but are not in the multi-sample variant calls have lower quality.
For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.
Virmid - an advanced auto-screener
GATK also has a host of VCF related screeners/joiners/overlapers etc.