Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Expand
Solution
Solution
Code Block
titleComparison of single- and multiple-sample vcf files using linux and bedtools
# This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
grep -c -v '^#' test.raw.vcf

# This command just counts the # of called variants in all 3 individuals
grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf

# Found out how many are common between the two
intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l

# Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
   awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'

# Look at all the qualities from the NA12878 variants
grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'

For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.

Expand
Output
Output
Code Block
titleOutput of comparison of single- and multiple-sample vcf files
login2$ # This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
login2$ grep -c -v '^#' test.raw.vcf
58049
login2$
login2$ # This command just counts the # of called variants in all 3 individuals
login2$ grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf
80972
login2$
login2$ # Found out how many are common between the two
login2$ intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l
52039
login2$
login2$ # Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
login2$ subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
>    awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 8.70595	Max qual: 212
login2$
login2$ # Look at all the qualities from the NA12878 variants
login2$ grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'
Average qual: 43.97	Max qual: 225
So you can see

This data shows that most of the variants that were called on sample NA12878 but are not in the multi-sample variant calls

...

have lower quality.

For major bonus points and a great THANK YOU from Scott, compute the mean and standard deviation of the intersected and subtracted SNPs from NA12878 vs all and then perform a t-test to make sure the differences are statistically significant using only linux command line tools (probably in a shell script). Yes, it's probably easier in Python, Perl, or R.

Other notes on bcftools

bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.

...