...
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
This linux one-liner should give you a snapshot of data sufficient to figure it out:
|
Look at the differences between single- and multiple-sample SNP calls, and between Samtools/Bcftools SNP calls and GATK SNP calls
How many SNPs were called in each case?
Expand | ||
---|---|---|
| ||
for file in `ls *.vcf`; do echo "File: $file `cat $file | grep -v '^#' | wc -l`"; done |
What's the overlap between all the single-sample SNP calls aggregated together and the multi-sample SNP calls?
Expand | ||
---|---|---|
|
In theory, GATK does a much better job ruling out false positives. But are there some SNPs GATK calls with high confidence that Samtools doesn't call at all?
Expand | ||
---|---|---|
| ||
What's going on here: 1) Yank out all the variant calls (comments start with '#') and add a string of "AAAAAAA" to them to make "sort" do it's job in a way that "join" will later like 2) Join the two files using their chromosome position as the join field, but also include any lines from the GATK file that DON'T match the samtools file. Use "awk" to figure out which ones came only from GATK (they are missing a bunch of fields from the samtools variant calls), look only for those that GATK has labeled "PASS" and write them to a file. 3) Sort the resultant file on the variant quality value - take the top 10 lines. You will note that many of these are complex variants, particularly insertions, so it's not too surprising that GATK does better. But here's a SNP that GATK does much better on: chr21:34278313 It has an interesting quantitative signature though... you might want to look at it in IGV. |
Other notes on bcftools
bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.
...