Human Trios -- GVA2020

Overview

Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases.  The most prominent publication in this area is this first example of whole exome sequencing saving a life.  There are many other publications since and some review articles such as this one.  Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation.  But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.

Learning Objectives

  1. Review differences in calling mutations on higher level organisms
  2. Call SNVs from multiple samples simultaneously
  3. Determine relationship between the individuals based on the results

Diploid genomes

The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:

  1. Map the raw data to a suitable reference
  2. Look for SNVs.

Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location. If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.

Get some data

Many example datasets are available from the 1000 Genomes Project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Known as the CEU Trio from the 1000 genomes project  their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run FAR more quickly, we'll focus on data only from chromosome 20.

All the data we'll use is located here:

Diploid genome (human) example files. This directory is very large and will take a few minutes to copy. Read ahead while the copy command runs
mkdir $SCRATCH/GVA_Human_trios
cd $SCRATCH/GVA_Human_trios
cp -r $BI/ngs_course/human_variation raw_files
ls raw_files

This directory contains the following:

  1. compressed raw data (the .fastq.gz files)
  2. mapped data (the .bam files)
  3. variant calls (the .vcf files)
  4. the subdirectory ref with special references
  5. .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
  6. We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
  7. We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
  8.  The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project. 

Single-sample variant calling with samtools

We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.

For now, the bam file we want to focus on is:

$SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

With samtools, this is a two-step process:

  1. samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.
  2. bcftools with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.


Remember to make sure you are on an idev done

It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes.

If you need more information or help re-launching a new idev node, please see this tutorial.

DO NOT run these commands on the head node

Calling variants using samtools and bcftools. Note the last command is quite long and may wrap around 2 lines your monitor or extend to the right of what you can see without scrolling over
mkdir samtools_example
cd samtools_example
module load samtools
samtools mpileup -u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools call -v -c - > trios_tutorial.raw.vcf

The above command is expected to take ~5 minutes but will not produce many lines of output leading students to worry their terminal has locked up. More than likely this is not the case but get my attention on zoom if you think its been running for more than 5 minutes with no change in output.

One potential issue with this type of approach is that vcf files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic. 

Multiple-sample variant calling with samtools

Not being able to tell between no data and wildtype is not the end of the world for a single sample, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:

  1. By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.

    Warning, DO NOT run this command yet.

    samtools multi-sample variants: separate bam files
    #samtools mpileup -uf $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa \
      $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
        | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf
    


    The output file from this option is in 
    $BI/ngs_course/human_variation as all.samtools.vcf


  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    Do not run this command

    This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.

    samtools multi-sample variants: one or more bam files using @RG
    samtools mpileup -u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa all.bam | bcftools call -v -c - > trios_tutorial.all.raw.vcf
    

Based on the discussion above, we are selecting the first solution and providing details of how this command should be run.

samtools multi-sample variants: separate bam files
mkdir $SCRATCH/GVA_Human_trios/multi_sample
cd $SCRATCH/GVA_Human_trios
samtools mpileup -uf $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa \
  $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
    | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf

This command again will generate very little output and take ~10 minutes to complete.

Identify the lineage

If genetics works, you should be able to identify the child based strictly on the genotypes.  Can you do it?

 Hint...

You're trying to find the genotypes in the trios_tutorial.all.samtools.vcf file, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two. 

This linux one-liner should give you a snapshot of data sufficient to figure it out:
cat trios_tutorial.all.samtools.vcf | tail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' | tail -100 | sort | uniq -c | sort -n -r
 Explanation of command

Here are the steps going into this command:

  1. cat trios_tutorial.all.samtools.vcf |
    1. Dump the contents of trios_tutorial.all.samtools.vcf and pipe it to the next command
  2. tail -10000 |
    1. Take the last 10,000 lines and pipe it to the next command. As the top of the file has header information, the last lines are all data
  3. awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | 
    1. If the variant quality score (the 6th column or $6) is greater than 500, then print the following fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). and pipe to next command
  4. grep "0/0" |
    1. Filter for only lines that have at least one homozygous SNP and pipe them to the next command
    2.  Think about genetics and why this is important. If you aren't sure ask.
  5. sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' |
    1. Break the genotype call apart from other information about depth: "sed" turns the colons into tabs so that awk can just print the genotype fields. and pipe to next output
  6. tail -100 | sort | uniq -c | sort -n -r
    1. Take the last 100 lines. 100 is used to ensure we get some good informative counts, but not so many that noise becomes a significant problem.
    2. sort them
    3. then count the unique lines
    4. sort them again, in numeric order, and print them in reversed order
example output of sample solution
     34 0/1	0/1	0/0
     20 0/1	0/0	0/1
     20 0/0	0/1	0/0
     14 0/1	1/1	0/0
      6 0/0	0/1	0/1
      4 1/1	1/1	0/0
      1 1/1	0/1	0/0
      1 0/1	0/0	0/0

Given this information can you make any determination about family structure for these 3 individuals? The first column is the number of occurrences generated by the uniq -c command in our large 1 liner, with the following 3 columns being the different individual samples. So consider that while some lines may not show a viable mendelian inheritance pattern, you should weight things according to how many times each scenario occurred as our filtering was fairly limited.

 Discussion of the output

Overall this data is consistent with column 1 (NA12878) being the child. Lines marked with an * are inconsistant:

     34 0/1	0/1	0/0
     20 0/1	0/0	0/1
     20 0/0	0/1	0/0
     14 0/1	1/1	0/0	# middle can't be child
      6 0/0	0/1	0/1
      4 1/1	1/1	0/0	* No mendelian combination exists
      1 1/1	0/1	0/0	*
      1 0/1	0/0	0/0	*
 

This is, in fact, the correct assessment - NA12878 is the child.

Going further

Can you modify the large 1 liner command to be more strict, or to include more examples such that you eliminate the non-mendelian inheritance situations and consider a larger number of loci?

This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.


Return to GVA2020 page.