Code Block |
---|
tacc:/scratch/01057/sphsmith/human_variation$ cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c
12 0/0 0/1 0/0
5 0/0 0/1 0/1
3 0/1 0/0 0/0
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0
1 1/1 0/1 0/0 |
Here are the steps going into this command: 1) Dump the contents of all.samtools.vcf 2) Take the first 10,000 lines 3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). 4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...) 5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields. 6) Take the last 100 lines, sort them, then count the unique lines Here is my interpretation of the data: 1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block. 2) The most telling data: the child will have heterozygous SNPs from two homozygous parents. 3) So all this data is consistent with column 1 (NA12878) being the child: 12 0/0 0/1 0/0 5 0/0 0/1 0/1 4 0/1 0/0 0/1 8 0/1 0/0 1/1 43 0/1 0/1 0/0 24 0/1 1/1 0/0 "Outlier" data are: 3 0/1 0/0 0/0 1 1/1 0/1 0/0 This is, in fact, the correct assessment - NA12878 is the child. |