Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. compressed raw data (the .fastq.gz files)
  2. mapped data (the .bam files)
  3. variant calls (the .vcf files)
  4. the subdirectory ref with special references
  5. .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
  6. We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
  7. We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
  8.  The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project. 

 

 

Here we show the steps for:

  1. Mapping
  2. Single-sample variant calling with samtools
  3. Multiple-sample variant calling with samtools

We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

Single-sample variant calling with samtools

...

Warning
titleMake sure you are on an idev node, or submit as job

This command will take quite a bit of time to complete. While it is running on an idev node, see if you can figure out what each of the options in the mpileup and bcftools commands are doing

Code Block
titleCalling variants using samtools and bcftools
cds 
cd BDIB_Human_tutorial
mkdir samtools_example
cd samtools_example
module unloadload samtools
samtools mpileup -u -uff $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools viewcall -v -vcgc - > trios_tutorial.raw.vcf

...

  1. By providing separate bam files for each sample:. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.

    Wait to run this command until you have after the explanation of the 2 methods and creation of new directories
    Warning
    titleDo not Warning, DO NOT run this command yet
    .
    Code Block
    titlesamtools multi-sample variants: separate bam files
    samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa \
      $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/BDIB_Human_tutorial/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
      $SCRATCH/BDIB_Human_tutorial/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
        | bcftools viewcall -v -vcgc - > multi_sample/trios_tutorial.all.samtools.vcf
    

    Note that the trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.


    The output file from this option is in 
    $BI/ngs_course/human_variation as all.samtools.vcf

     

     

  2. By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

    Warning
    titleDo not run this command

    This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.

    Code Block
    titlesamtools multi-sample variants: one or more bam files using @RG
    samtools mpileup -ufu -f $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa all.bam | bcftools viewcall -v -vcgc - > trios_tutorial.all.raw.vcf
    

...

Code Block
titlesamtools multi-sample variants: separate bam files
cds
cd BDIB_Human_tutorial
mkdir multi_sample
samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa \
  $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/BDIB_Human_tutorial/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/BDIB_Human_tutorial/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
    | bcftools call view-v -vcgc - > multi_sample/trios_tutorial.all.samtools.vcf

...

Code Block
languagebash
titleThis linux one-liner should give you a snapshot of data sufficient to figure it out:
collapsetrue
cat trios_tutorial.all.samtools.vcf | headtail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/' \t'/g | awk '{print $2"\t"$5$4"\t"$8$6}' | tail -100 | sort | uniq -c | sort -n -r
Expand
titleExplanation of command

Here are the steps going into this command:

  1. cat trios_tutorial.all.samtools.vcf |
    1. Dump the contents of trios_tutorial.all.samtools.vcf and pipe it to the next command
  2. headtail -10000 |
    1. Take the firstlast 10,000 lines and pipe it to the next command. As the top of the file has header information, the last lines are all data
  3. awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | 
    1. If the variant quality score (the 6th column or $6) is greater than 500, then print the following fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). and pipe to next command
  4. grep "0/0" |
    1. Filter for only lines that have at least one homozygous SNP and pipe them to the next command
    2.  Think about genetics and why this is important. If you aren't sure ask us.
  5. sed s/':'/' \t'/g | awk '{print $2"\t"$5$4"\t"$8$6}' |
    1. Break the genotype call apart from other information about depth: "sed" turns the colons into spaces tabs so that awk can just print the genotype fields. and pipe to next output
  6. tail -100 | sort | uniq -c | sort -n -r
    1. Take the last 100 lines. 100 is used to ensure we get some good informative counts, but not so many that noise becomes a significant problem.
    2. sort them, 
    3. then count the unique lines
    4. sort them again, in numeric order, and print them in reversed order
No Format
titleexample output of sample solution
collapsetrue
     1234 0/01	0/1	0/0
     20 5 0/01	0/10	0/1
      320 0/10	0/01	0/0
     14 4 0/1	01/01	0/10
      86 0/10	0/1	0	1/1
     43 4 01/1	01/1	0/0
      241 01/1	10/1	0/0
      1 10/1	0/10	0/0
Expand
titleDiscussion of the output

Here is my interpretation of the data:

1) This method effectively looks at a very narrow genomic region, probably within a homologous recombination block.

2) The most telling data: the child will have heterozygous SNPs from two homozygous parents.

3) So all Overall this data is consistent with column 1 (NA12878) being the child. Lines marked with an * are inconsistant:

	 12     34 0/01	0/1	0/0
520 0/01 0/10 0/1
420 0/10 0/01 0/1
0 814 0/1 1/1 0/0 1/1
# middle can't be child 436 0/10 0/1 0/0
1 244 01/1 1/1 0/0

"Outlier" data are:

	* No mendelian combination exists
      31 01/1	0/01	0/0	*
      1 10/1	0/10	0/0	*
 

This is, in fact, the correct assessment - NA12878 is the child.

...

This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.

Return to GVA2017 page.