Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.

All the data we'll use is located here:

Code Block
titleDiploid genome (human) example files
$BI/ngs_course/human_variation

This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref with special references.

The 1000 Genomes project is really oriented to producing .vcf filesKeep in mind that this type of trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases.  The most prominent publication in this area is this first example of whole exome sequencing saving a life.  There are many other publications since and some review articles such as this one.  Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation.  But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.

All the data we'll use is located here:

Code Block
titleDiploid genome (human) example files
$BI/ngs_course/human_variation

This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref with special references.

The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.

...

We'll return to this example data shortly to demonstrate a much more involved tool, GATK, to do the same steps.

Note if you're trying this on Stampede: The $BI directory is not accessible from compute nodes on Stampede so you will need to make a copy of your data on $SCRATCH and update file locations accordingly to get this demo to run.

...

Expand
titleExpand here if you'd like to try this on your own...

Let's use Anna Battenhouse's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script.

Don't forget that for this to work, you need to have appended $BI/bin to your path.

Expand
Show me how to modify my path...
Show me how to modify my path...

BI="/corral-repl/utexas/BioITeam"
PATH=$PATH:$BI/bin
export PATH

Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test".

 

Expand
Solution
Solution
Code Block
titleMake job submission script for mapping & variant calling
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1 > aln.test.log 2>&1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
module load bwa
module load samtools
qsub map_call.sge

Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)


Expand
Output
Output
Code Block
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586249 ("map_call") has been submitted

Note that the input is paired-end data.

Expand
Output
Output

Your directory should have content like this when done (from ls -lt):

Code Block
titleMapping output
-rw-r--r-- 1 sphsmith G-801020      5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020       392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020   2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020     13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020  45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020  45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020         0 May 20 22:26 map_call.pe586338

and samtools flagstat test.sorted.bam should yield:

Code Block
titlesamtools flagstat results
Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)

...

Here are the commands, piped together :(ONLY run this directly if you are in an idev session - NOT on a head node!):

Code Block
titleCalling variants using samtools and bcftools
samtools mpileup -uf $BI/ref_genome/fasta/ucsc/ucsc.hg19.fasta/hs37d5.fa \
  $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  | bcftools view -vcg - > test.raw.vcf

...

 

or via qsub:

Code Block
launcher_creator.py -n samtools_test -b "samtools mpileup -uf ref/hs37d5.fa

...

Exercise:

Write a modified version of the command above to use the proper reference into a commands file, create the job submission script with launcher_creator.py and submit the job.

login1$
 NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > test.raw.vcf" -t 01:00:00 -q development -a CCBB -m samtools
qsub samtools_test.sge
Expand
SolutionSolution

As we did for mapping, we need to submit these to the Lonestar queue:

Code Block
titleSubmission of variant calling commands
echo "samtools mpileup -uf $BI/ngs_course/human_variation/ref/hs37d5.fa \
   $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
   | bcftools view -vcg - > test.raw.vcf" > commands
launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
qsub call_vars.sge
Expand
OutputOutput
Code Block

 

Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa. We could have chosen another mapper if we'd wanted to though.

Exercise:

Write a modified version of the command above to use the proper reference into a commands file, create the job submission script with launcher_creator.py and submit the job.

Expand
Solution
Solution

As we did for mapping, we need to submit these to the Lonestar queue:

Code Block
titleSubmission of variant calling commands
echo "samtools mpileup -uf /corral-repl/utexas/BioITeam$BI/ngs_course/human_variation/ref/hs37d5.fa \
>   /corral-repl/utexas/BioITeam$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
>   | bcftools view -vcg - > test.raw.vcf" > commands
login1$
launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs
Expand
Output
Output
Code Block
login1$ echo "samtools mpileup -uf /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa \
>   /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
>   | bcftools view -vcg - > test.raw.vcf" > commands
login1$ launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you don'tspecified already have the maximum number of jobs in queue developmenta queue...
--> Setting project...
--> Checking that youryou timespecified limita isn't over the maximumparallel environment...
--> Checking available allocationthat you specified a valid parallel environment name...
--> Submitting job...

Your job 586369 ("call_vars") has been submitted

After your single-sample variant calling job completes

...

If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this:
ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf
and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data.

You can examine some of the variant calls with:

Code Block
titleScanning a vcf file
more test.raw.vcf

Note the extensive header information, followed by an ordered list of variants.

bcftools auxiliary perl script called vcfutils.pl can provide some useful stats. It's safe to run this on the head node since it's quick. This is not part of a standard TACC module but it's in our common $BI/bin directory.

Code Block
titleRunning vcfutils.pl
vcfutils.pl qstats test.raw.vcf
Expand
OutputOutput
Code Block
login1$ vcfutils.pl qstats test.raw.vcf
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
186	1011	1011	757	0	2.9803	0.0000	0.0000	2.9803
142	2036	2036	1441	0	2.4218	0.0000	0.0000	2.0059
122	3091	3091	2156	0	2.3059	0.0000	0.0000	2.1029
109	4177	4177	2925	0	2.3363	0.0000	0.0000	2.4259
99.5	5237	5237	3647	0	2.2937	0.0000	0.0000	2.1361
92	6273	6273	4354	0	2.2689	0.0000	0.0000	2.1489
85.5	7328	7328	5105	0	2.2964	0.0000	0.0000	2.4704
79	8352	8352	5811	0	2.2869	0.0000	0.0000	2.2201
74	9369	9369	6497	0	2.2622	0.0000	0.0000	2.0725
69	10553	10553	7311	0	2.2551Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586369 ("call_vars") has been submitted

After your single-sample variant calling job completes

Expand
If you don't want to wait, click here...
If you don't want to wait, click here...

If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this:
ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf
and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data.

You can examine some of the variant calls with:

Code Block
titleScanning a vcf file
more test.raw.vcf

Note the extensive header information, followed by an ordered list of variants.

bcftools auxiliary perl script called vcfutils.pl can provide some useful stats. It's safe to run this on the head node since it's quick. This is not part of a standard TACC module but it's in our common $BI/bin directory.

Code Block
titleRunning vcfutils.pl
vcfutils.pl qstats test.raw.vcf
Expand
Output
Output
Code Block
login1$ vcfutils.pl qstats test.raw.vcf
QUAL #non-indel #SNPs #transitions #joint ts/tv #joint/#ref #joint/#non-indel
186	1011	1011	757	0	2.9803	0.0000	0.0000	2.20009803
65142	117822036	117822036	81761441	0	2.26734218	0.0000	0.0000	2.37640059
61122	130593091	130593091	90902156	0	2.29023059	0.0000	0.0000	2.51791029
57109	142804177	142804177	99452925	0	2.29413363	0.0000	0.0000	2.33614259
5399.5	153015237	153015237	106563647	0	2.29412937	0.0000	0.0000	2.29351361
48.892	163236273	163236273	113474354	0	2.28032689	0.0000	0.0000	2.08761489
4585.5	174607328	174607328	121225105	0	2.27092964	0.0000	0.0000	2.14094704
41.879	186398352	186398352	128995811	0	2.24722869	0.0000	0.0000	12.93282201
39.874	196609369	196609369	135726497	0	2.22932622	0.0000	0.0000	12.93390725
37.869	2101310553	2101310553	144967311	0	2.22432551	0.0000	0.0000	2.15382000
35.865	2230911782	2230911782	153808176	0	2.21972673	0.0000	0.0000	2.14563764
33.861	2356813059	2356813059	162519090	0	2.22102902	0.0000	0.0000	2.24485179
31.857	2484614280	2484614280	171019945	0	2.20802941	0.0000	0.0000	12.98603361
29.853	2617115301	2617115301	1797510656	0	2.19312941	0.0000	0.0000	12.93792935
2848.8	2730816323	2730816323	1868811347	0	2.16802803	0.0000	0.0000	12.68160876
2645	2865417460	2865417460	1955112122	0	2.14782709	0.0000	0.0000	12.78671409
2441.8	2994718639	2994718639	2037112899	0	2.12732472	0.0000	0.0000	1.73369328
2239.8	3105019660	3105019660	2106513572	0	2.10972293	0.0000	0.0000	1.69689339
2037.8	3208121013	3208121013	2171914496	0	2.09602243	0.0000	0.0000	12.73471538
1735.18	3325122309	3325122309	2244615380	0	2.07742197	0.0000	0.0000	12.64111456
1333.28	3444723568	3444723568	2323816251	0	2.07322210	0.0000	0.0000	12.96042448
1031.48	3575124846	3575124846	2406717101	0	2.05982080	0.0000	0.0000	1.74539860
929.538	3677226171	3677226171	2472417975	0	2.05211931	0.0000	0.0000	1.80499379
8.6528	3802327308	3802327308	2553918688	0	2.04571680	0.0000	0.0000	1.86936816
7.826	3930128654	3930128654	2636019551	0	2.03691478	0.0000	0.0000	1.79657867
6.9824	4066529947	4066529947	2719620371	0	2.01921273	0.0000	0.0000	1.58337336
6.222	4239431050	4239431050	2824321065	0	12.99581097	0.0000	0.0000	1.53526968
5.4620	4401832081	4401832081	2921121719	0	12.97280960	0.0000	0.0000	1.47567347
417.771	4555333251	4555333251	3016822446	0	12.96090774	0.0000	0.0000	1.65576411
413.132	4702034447	4702034447	3108123238	0	12.95000732	0.0000	0.0000	1.64809604
310.544	4857235751	4857235751	3206324067	0	12.94220598	0.0000	0.0000	1.72287453
39.0153	5046336772	5046336772	3313924724	0	12.91290521	0.0000	0.0000	1.3202

Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger.  Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv).

You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

...

Code Block
titleSummary stats - indels and het vs. hom. alleles
login1$ grep -c INDEL test.raw.vcf
3432
login1$ cat test.raw.vcf | awk 'BEGIN {FS=";"} {for (i=1;i<=NF;i++) {if (index($i,"AF1")!=0) {print $i} }}' | \
awk 'BEGIN {FS="="} {print int($2*10)/10}' | sort | uniq -c | sort -n -r | head
  36757 1
  19404 0.5
   1819 0.4
     36 0.6
     17 0.8
     16 0.7
      1 0

Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with samtools

This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:

  1. By providing separate bam files for each sample, like this:

    Code Block
    titlesamtools multi-sample variants: separate bam files
    samtools mpileup -uf hs37d5.fa \
      NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
        | bcftools view -vcg - > all.samtools.vcf
    
  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    Code Block
    titlesamtools multi-sample variants: one or more bam files using @RG
    samtools mpileup -uf hs37d5.fa all.bam | bcftools view -vcg - > all.raw.vcf
    

The output file from the first option is in $BI/ngs_course/human_variation

...

8049
8.65	38023	38023	25539	0	2.0457	0.0000	0.0000	1.8693
7.8	39301	39301	26360	0	2.0369	0.0000	0.0000	1.7965
6.98	40665	40665	27196	0	2.0192	0.0000	0.0000	1.5833
6.2	42394	42394	28243	0	1.9958	0.0000	0.0000	1.5352
5.46	44018	44018	29211	0	1.9728	0.0000	0.0000	1.4756
4.77	45553	45553	30168	0	1.9609	0.0000	0.0000	1.6557
4.13	47020	47020	31081	0	1.9500	0.0000	0.0000	1.6480
3.54	48572	48572	32063	0	1.9422	0.0000	0.0000	1.7228
3.01	50463	50463	33139	0	1.9129	0.0000	0.0000	1.3202

Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger.  Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv).

You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.

Expand
Linux one-liner stats for this case
Linux one-liner stats for this case
Code Block
titleSummary stats - indels and het vs. hom. alleles
login1$ grep -c INDEL test.raw.vcf
3432
login1$ cat test.raw.vcf | awk 'BEGIN {FS=";"} {for (i=1;i<=NF;i++) {if (index($i,"AF1")!=0) {print $i} }}' | \
awk 'BEGIN {FS="="} {print int($2*10)/10}' | sort | uniq -c | sort -n -r | head
  36757 1
  19404 0.5
   1819 0.4
     36 0.6
     17 0.8
     16 0.7
      1 0

Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with samtools

This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:

  1. By providing separate bam files for each sample, like this:

    Code Block
    titlesamtools multi-sample variants: separate bam files
    samtools mpileup -uf ref/hs37d5.fa \
      NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
        | bcftools view -vcg - > all.samtools.vcf
    
  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    Code Block
    titlesamtools multi-sample variants: one or more bam files using @RG
    samtools mpileup -uf hs37d5.fa all.bam | bcftools view -vcg - > all.raw.vcf
    

The output file from the first option is in $BI/ngs_course/human_variation

CAVEAT: If you intend to use the second option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. Note that our align_bwa.sh script takes care of this detail for us.

Identify the lineage

If genetics works, you should be able to identify the child based strictly on the genotypes.  Can you do it?

Expand
titleHint...

You're trying to find the genotypes in the all.samtools.vcf and all.GATK.vcf files, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two. 

Expand
titleOne solution...

This linux one-liner should give you a snapshot of data sufficient to figure it out:

Code Block
cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c
Expand
titleHere's the output and discussion
Code Block
tacc:/scratch/01057/sphsmith/human_variation$ cat all.samtools.vcf | head -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' '/g | awk '{print $2"\t"$5"\t"$8}' | tail -100 | sort | uniq -c 
     12 0/0	0/1	0/0
      5 0/0	0/1	0/1
      3 0/1	0/0	0/0
      4 0/1	0/0	0/1
      8 0/1	0/0	1/1
     43 0/1	0/1	0/0
     24 0/1	1/1	0/0
      1 1/1	0/1	0/0

 

Here are the steps going into this command:

1) Dump the contents of all.samtools.vcf

2) Take the first 10,000 lines

3) If the variant quality score is greater than 500, then print fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes).

4) Filter for only lines that have at least one homozygous SNP (exercise to the reader to understand why...)

5) Break the genotype call apart from other information about depth: "sed" turns the colons into spaces so that awk can just print the genotype fields.

6) Take the last 100 lines, sort them, then count the unique lines

 

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all this data is consistent with column 1 (NA12878) being the child:

	 12 0/0	0/1	0/0
5 0/0 0/1 0/1
4 0/1 0/0 0/1
8 0/1 0/0 1/1
43 0/1 0/1 0/0
24 0/1 1/1 0/0

"Outlier" data are:

      3 0/1	0/0	0/0
      1 1/1	0/1	0/0
 

This is, in fact, the correct assessment - NA12878 is the child.

 

 

GATK

GATK deserves it's own page which is here, but we've already run it and will now look at some differences in the SNP calls.

Look at the differences between single- and multiple-sample SNP calls, and between Samtools/Bcftools SNP calls and GATK SNP calls

How many SNPs were called in each case?

Expand
titleTry it your own way, or try this one-line command...
 for file in `ls *.vcf`; do echo "File: $file `cat $file | grep -v '^#' | wc -l`"; done

 

What's the overlap between all the single-sample SNP calls aggregated together and the multi-sample SNP calls?

Expand
Code Block
cat NA128*samtools.vcf | grep -v '^#' | awk '{print $2}' | sort | uniq > all.single.samtools.snps
cat all.samtools.vcf | grep -v '^#' | awk '{print $2}' | sort | uniq > all.mutiple.samtools.snps
comm all.single.samtools.snps all.mutiple.samtools.snps | awk 'BEGIN {print "Only in single\tOnly in multiple\tIn both"; FS="\t"} {i[NF]++} END {print i[1]"\t"i[2]"\t"i[3]}'



In theory, GATK does a much better job ruling out false positives.  But are there some SNPs GATK calls with high confidence that Samtools doesn't call at all?

Expand
titleHere are four commands to the answer...
Code Block
grep -v '^#' all.GATK.vcf | awk '{$2=$2"AAAAAAA"; print}' | sort -k 2 > g.v
grep -v '^#' all.samtools.vcf | awk '{$2=$2"AAAAAAA"; print}' | sort -k 2 > s.v
join -a 1 -1 2 -2 2 g.v s.v | awk '{if (NF==12) {print}}' | grep PASS > g.v.pass.only
cat g.v.pass.only | sort -k 6 -n -r | head

What's going on here:

1) Yank out all the variant calls (comments start with '#') and add a string of "AAAAAAA" to them to make "sort" do it's job in a way that "join" will later like

2) Join the two files using their chromosome position as the join field, but also include any lines from the GATK file that DON'T match the samtools file. Use "awk" to figure out which ones came only from GATK (they are missing a bunch of fields from the samtools variant calls), look only for those that GATK has labeled "PASS" and write them to a file.

3) Sort the resultant file on the variant quality value - take the top 10 lines.


You will note that many of these are complex variants, particularly insertions, so it's not too surprising that GATK does better. But here's a SNP that GATK does much better on:

chr21:34278313

It has an interesting quantitative signature though... you might want to look at it in IGV.


Other notes on bcftools

bcftools has many other sub-commands such as performing association tests and estimating allele frequency spectrums.

You can't get gene-context information from bcftools - you have to shift to variant annotation tools to do that.

Side-note on samtools mpileup

...