/
Calling variants in diploid or multiploid genomes

Calling variants in diploid or multiploid genomes

Diploid genomes

The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:

  1. Map the raw data to a suitable reference
  2. Look for SNPs and small-scale indels

Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.

All the data we'll use is located here:

Diploid genome (human) example files
$BI/ngs_course/human_variation

This directory contains raw data (the .fastq files), mapped data (the .bam files) and variant calls (the .vcf files). It also contains the subdirectory ref with special references.

The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.

.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".

We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.

We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.

We'll now show the steps for:

  1. Mapping
  2. Single-sample variant calling with samtools
  3. Multiple-sample variant calling with samtools

We'll return to this example data shortly to demonstrate a much more involved tool, GATK, to do the same steps.

Mapping Exercise

Let's use Anna's shell script align_bwa.sh to align the fastq files to the hg19 version of the genome, and the python program launcher_creator.py to create the job submission script.

Don't forget that for this to work, you need to have appended $BI/bin to your path.

 Show me how to modify my path...

BI="/corral-repl/utexas/BioITeam"
PATH=$PATH:$BI/bin
export PATH

Move into your scratch directory and then try to figure out how to create and qsub the align_bwa.sh command to align the data file $BI/ngs_course/human_variation/allseqs_R1.fastq against the hg19 reference. Call the output "test".

 Solution
Make job submission script for mapping & variant calling
echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1 > aln.test.log 2>&1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
module load bwa
module load samtools
qsub map_call.sge

Caution: If you are using this example outside an SSC course, you must use the -a option to specify a valid allocation (e.g. BME_2012)

 Output
login1$ echo "align_bwa.sh $BI/ngs_course/human_variation/allseqs_R1.fastq test hg19 1" > commands
launcher_creator.py -l map_call.sge -n map_call -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub map_call.sge" to queue your job.
login1$ qsub map_call.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586249 ("map_call") has been submitted

Note that the input is paired-end data.

 Output

Your directory should have content like this when done (from ls -lt):

Mapping output
-rw-r--r-- 1 sphsmith G-801020      5732 May 20 23:01 map_call.o586338
-rw------- 1 sphsmith G-801020       392 May 20 23:01 test.flagstat.txt
-rw------- 1 sphsmith G-801020   2175952 May 20 23:01 test.sorted.bam.bai
-rw------- 1 sphsmith G-801020 334782188 May 20 23:01 test.sorted.bam
-rw-r--r-- 1 sphsmith G-801020     13135 May 20 23:00 map_call.e586338
-rw------- 1 sphsmith G-801020 411244396 May 20 23:00 test.bam
-rw------- 1 sphsmith G-801020  45695040 May 20 22:49 test.read2.sai
-rw------- 1 sphsmith G-801020  45372400 May 20 22:39 test.read1.sai
-rw-r--r-- 1 sphsmith G-801020         0 May 20 22:26 map_call.pe586338

and samtools flagstat test.sorted.bam should yield:

samtools flagstat results
Running flagstat...
4546280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
3992274 + 0 mapped (87.81%:nan%)
4546280 + 0 paired in sequencing
2273140 + 0 read1
2273140 + 0 read2
40290 + 0 properly paired (0.89%:nan%)
3636946 + 0 with itself and mate mapped
355328 + 0 singletons (7.82%:nan%)
44128 + 0 with mate mapped to a different chr
15634 + 0 with mate mapped to a different chr (mapQ>=5)

Single-sample variant calling with samtools

We would normally use the BAM file from the previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it).
$BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

With samtools, this is a two-step process:

  1. samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.
  2. bcftools with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.

Here are the commands, piped together:

Calling variants using samtools and bcftools
samtools mpileup -uf $BI/ref_genome/fasta/ucsc/ucsc.hg19.fasta \
  $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  | bcftools view -vcg - > test.raw.bcf

Note that the reference chosen for mpileup must be exactly the same as the reference used to create the bam file. The 1000 genomes project has created it's own reference and so the command listed above won't work - we have to use the 1000 genomes reference which is $BI/ngs_course/human_variation/ref/hs37d5.fa. We could have chosen another mapper if we'd wanted to though.

Exercise:

Write a modified version of the command above to use the proper reference into a commands file, create the job submission script with launcher_creator.py and submit the job.

 Solution

As we did for mapping, we need to submit these to the Lonestar queue:

Submission of variant calling commands
echo "samtools mpileup -uf $BI/ngs_course/human_variation/ref/hs37d5.fa \
   $BI/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
   | bcftools view -vcg - > test.raw.vcf" > commands
launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
qsub call_vars.sge
 Output
login1$ echo "samtools mpileup -uf /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa \
>   /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
>   | bcftools view -vcg - > test.raw.vcf" > commands
login1$ launcher_creator.py -l call_vars.sge -n call_vars -t 01:00:00 -j commands
Job file has 1 lines.
Using 12 cores.
Launcher successfully created. Type "qsub call_vars.sge" to queue your job.
login1$ qsub call_vars.sge
-----------------------------------------------------------------
-- Welcome to the Lonestar4 Westmere/QDR IB Linux Cluster --
-----------------------------------------------------------------
--> Checking that you specified -V...
--> Checking that you specified a time limit...
--> Checking that you specified a queue...
--> Setting project...
--> Checking that you specified a parallel environment...
--> Checking that you specified a valid parallel environment name...
--> Checking that the number of PEs requested is valid...
--> Ensuring absence of dubious h_vmem,h_data,s_vmem,s_data limits...
--> Requesting valid memory configuration (23.4G)...
--> Verifying HOME file-system availability...
--> Verifying WORK file-system availability...
--> Verifying SCRATCH file-system availability...
--> Checking ssh setup...
--> Checking that you didn't request more cores than the maximum...
--> Checking that you don't already have the maximum number of jobs...
--> Checking that you don't already have the maximum number of jobs in queue development...
--> Checking that your time limit isn't over the maximum...
--> Checking available allocation...
--> Submitting job...

Your job 586369 ("call_vars") has been submitted

After your job completes

 If you don't want to wait, click here...

If you don't want to wait, CHANGE TO A NEW DIRECTORY and do this:
ln -s $BI/ngs_course/human_variation/NA12878.chrom20.samtools.vcf test.raw.vcf
and continue with the rest of these exercises. Remember to return to your job directory if you want to see your actual data.

You can examine some of the variant calls with: