Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Count the total number of variants called in test.raw.vcf and all.samtools.vcf
  2. Count how many of the SNPS in these two files are common
  3. Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf but NOT in all.samtools.vcf
  4. Calculate the average and maximum quality values for the SNPs that are in test.raw.vcf
Expand
Hints
Hints
  1. Investigate grep -c
  2. Try intersectBed and wc
  3. Try subtractBed and some awk
    Note that for intersections and subtractions on structured data like this, you can use the linux join command too.
Expand
Solution
Solution
Code Block
titleComparison of single- and multiple-sample vcf files using linux and bedtools
# This command just counts the # of called variants in test.raw.vcf (from individual NA12878)
grep -c -v '^#' test.raw.vcf

# This command just counts the # of called variants in all 3 individuals
grep -c -v '^#' $BI/ngs_course/human_variation/all.samtools.vcf

# Found out how many are common between the two
intersectBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | wc -l

# Take all those that are not in all.samtools.vcf and examine their quality (in field 6 of the vcf file)
subtractBed -a test.raw.vcf -b $BI/ngs_course/human_variation/all.samtools.vcf | \
   awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'

# Look at all the qualities from the NA12878 variants
grep -v '^#' test.raw.vcf | awk 'BEGIN {max=0} {sum+=$6; if ($6>max) {max=$6}} END {print "Average qual: "sum/NR "\tMax qual: " max}'

...