...
Expand |
---|
|
Code Block |
---|
title | Comparison of single- and multiple-sample vcf files using linux and bedtools |
---|
|
# This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
grep -c -v '^#' test.raw.vcf
# This command just counts the # of called variants in all 3 individuals
grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf
# Found out how many are common between the two
intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l
# Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
# Look at all the qualities from the NA12878 variants
grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
|
|
For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.
Expand |
---|
|
Code Block |
---|
title | Output of comparison of single- and multiple-sample vcf files |
---|
|
login2$ # This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
login2$ grep -c -v '^#' test.raw.vcf
58049
login2$
login2$ # This command just counts the # of called variants in all 3 individuals
login2$ grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf
80972
login2$
login2$ # Found out how many are common between the two
login2$ intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l
52039
login2$
login2$ # Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
login2$ subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
> awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 8.70595 Max qual: 212
login2$
login2$ # Look at all the qualities from the NA12878 variants
login2$ grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 43.97 Max qual: 225
| So you can see |
This data shows that most of the variants that were called on sample NA12878 but are not in the multi-sample variant calls
...
have lower quality.
For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.
Other notes on bcftools
bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.
...