compressed raw data (the .fastq.gz files)
mapped data (the .bam files)
variant calls (the .vcf files)
the subdirectory ref with special references
.bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf. (these files are from old versions)
The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.

Single-sample variant calling with

...

bcftools

We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.

...

$SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

With samtoolsbcftools, this is a two-step process:

samtools bcftools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.
bcftools call with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.

...

Warning

title	Remember to make sure you are on an idev done

It is unlikely that you are currently on an idev node as copying the files while on an idev node causes problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes.

If you need more information or help re-launching a new idev node, please see this tutorial.

DO NOT run these commands on the head nodeYou should request at least 60 minutes on the idev session to make sure the commands have time to finish running.

Recall that we installed samtools on our GVA2021 conda installationand bcftools in our GVA-SNV conda environment. Make sure you have activated your GVA2021 GVA-SNV environment and you have access to samtools and bcftools version 1.915.1

Expand

title	Click here for commands and expected output if you are not sure

Code Block

language	bash
title	activate conda envrionment

conda activate GVA2021GVA-SNV

Code Block

language	bash
title	Make sure you have the expected versions of samtools and bcftools installed

samtools --version
bcftools --version

samtools --version output:

No Format
samtools 1.15.91 Using htslib 1.15.91 Copyright (C) 20182022 Genome Research Ltd. # Followed by a bunch of compilation details.

bcftools --version output:

No Format

bcftools 1.915.1
Using htslib 1.15.91
Copyright (C) 20182022 Genome Research Ltd.
License ExpatGPLv3+: The MIT/Expat license GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

...

Expand

title	What to do if you do not get version 1.15.9 1 for both samtools and bcftools in the above version checks?

The #1 most likely issue here will be that you see samtools version 1.7 rather than 1.9 and will suggest that somewhere in other modifications of the GVA2021 environment, we introduced a change to either samtools or opensssl. If you experience this error, let me know, I'm trying to track down if this is something that a tutorial specifically caused, or if its an error of installing some other tool into this environment mistakenly.
The simplest solution will be to create a new conda environment, and install just samtools, bcftools, and openssl verison 1.0 in it, as we did when we originally installed them into the GVA2021 environment.

Code Block

language	bash

conda create --name GVA-samtools -c bioconda samtools bcftools openssl=1.0
conda activate GVA-samtools

If you need to do this, make sure you are in the new GVA-samtools conda environment before using any of the samtools commands listed here, or switch to it when you get the error message about libcriptoIf you are not seeing the correct versions, there is either a problem activating or creating your environment. Either try to activate the environment again, go back to the SNV tutorial, or ask for help before continuing.

Code Block

language	bash
title	Calling variants using samtools and bcftools. Note the last bcftools command is quite long and may wrap around 2 lines your monitor or extend to the right of what you can see without scrolling over
linenumbers	true

mkdir samtools_example
cd samtools_example
samtools mpileup -cd $SCRATCH/GVA_Human_trios 
bcftools mpileup --threads 64 -O u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools call --threads 64 -v -c - > trios_tutorial.raw.vcf

The above command may take >30 ~20 minutes to run based on some student's experience and will not produce many lines of output leading students to worry their terminal has locked up. More than likely this is not the case, but you can try hitting the return key 1 or 2 times to see if your terminal adds a blank line to verify. As this command is taking so long it is recomended recommended that you read ahead and/or switch to reading another tutorial in another terminal window while it runs, just be aware that you should not start 2 idev sessions in 2 different windowswhile waiting for this command to finish.

One potential issue with this type of approach is that vcf files only record variation that can be seen with the data provided. When all reads mapping to a given location exactly match the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with

...

bcftools

Not being able to tell between no data and wildtype is not the end of the world for a single sample, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

To instruct samtools bcftools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools bcftools allows two methods to do this:

By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.

Note

title	Warning, DO NOT run this command yet.

Code Block

language	bash
title	samtools bcftools multi-sample variants: separate bam files

#samtools#bcftools mpileup --threads 64 -O u -uf $SCRATCH/GVA_Human_trios/f raw_files/ref/hs37d5.fa \
  $SCRATCH/GVA_Human_trios/
  raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
    | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtoolsmulti-sample.vcf

The output file from this option is $BI/gva_course/GVA2021.all.samtools.vcf.samtools.vcf GVA.multi-sample.vcf if you want to work with it without having to wait on the analysis to run personally.

By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.

Note

title	Do not run this command

This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the samse or sampe stage of mapping with bwa with the -r option, using samtools merge, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.

Code Block

title	samtools multi-sample variants: one or more bam files using @RG

samtoolsbcftools mpileup --threads 64 -O u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa all<all_with@RG.bambam> | bcftools call --threads 68 -v -c - > trios_tutorial.all.raw.vcf

Tip

title	An observation for the individuals interested in phylogenies

I have a hunch that the use of read tags could be useful for your analysis IF you have a finite set of samples that you intend to analyze, have all the data available at once, and are willing to repeat your analysis from near scratch in the event of either of these factors changing. This is an excellent example of the importance of finding papers that have done similar analyses, and mimicking as much of what they have done as possible/reasonable.

Based on the discussion above, we are selecting the first solution and providing details of how this command should be run. As As this command will generate very little output and take ~45 ~30 minutes to complete, you are once again reminded that the output file from this option is available $BI/gva_course/GVA2021.all.samtoolsGVA.multi-sample.vcf if you want to work with it without having to wait on the analysis to run personally.

Code Block

title	samtools multi-sample variants: separate bam files

mkdircd $SCRATCH/GVA_Human_trios/multi_sample
cd $SCRATCH/GVA_Human_trios
samtools mpileup -uf $SCRATCH/GVA_Human_trios/
bcftools mpileup --threads 64 -O u -f raw_files/ref/hs37d5.fa \
  $SCRATCH/GVA_Human_trios/
  raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
  $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
    | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtoolsmulti-sample.vcf

The observant students who start this command might notice that this time mpileup does give a message that it is working with 3 samples, and 3 input files (while all previous analysis have used 1 file and 1 sample).

...

Expand

title	Hint...

You're trying to find the genotypes in the trios_tutorial.all.samtools.vcf filemulti-sample.vcf file, and then use your knowledge of Mendelian inheritance to figure out which of the three samples is the only one that could be a child of the other two.

...

Tip

title	If you are working with provided data rather than generating the data yourself

Notice that if you are working with data provided rather than data generated you may have 2 different file names "trios_tutorial.all.samtoolsmulti-sample.vcf" if you analyzed the data yourself and "GVA2021.all.samtoolsGVA.multi-sample.vcf" if you copied the data from the BioITeam directory.

In the next code block you will either need to change the name of the file that you print to the screen with the cat command, or you need to change the name of the file using the move command.

...

Code Block

language	bash
title	This linux one-liner should give you a snapshot of data sufficient to figure it out:
collapse	true

cat trios_tutorial.all.samtoolsmulti-sample.vcf | tail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' | tail -100 | sort | uniq -c | sort -n -r

...

Expand

title	Explanation of command

Here are the steps going into this command:

```
cat trios_tutorial.all.samtoolsmulti-sample.vcf |
```
1. Dump the contents of trios_tutorial.allmulti-sample.samtools.vcf and pipe it to the next command

tail -10000 |

Take the last 10,000 lines and pipe it to the next command. As the top of the file has header information, the last lines are all data

```
awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | 
```
1. If the variant quality score (the 6th column or $6) is greater than 500, then print the following fields 2 (SNP position), 10, 11, and 12 (the 3 genotypes). and pipe to next command
```
grep "0/0" |
```
1. Filter for only lines that have at least one homozygous SNP and pipe them to the next command
2. Think about genetics and why this is important. If you aren't sure ask.
```
sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' |
```
1. Break the genotype call apart from other information about depth: "sed" turns the colons into tabs so that awk can just print the genotype fields. and pipe to next output
```
tail -100 | sort | uniq -c | sort -n -r
```
1. Take the last 100 lines. 100 is used to ensure we get some good informative counts, but not so many that noise becomes a significant problem.
2. sort them
3. then count the unique lines
4. sort them again, in numeric order, and print them in reversed order

For more information about the individual commands and their options https://explainshell.com/ is a good jumping off point.

...

Expand

title	Discussion of the output

Overall this data is consistent with column 1 (NA12878) being the child. Lines marked with an * are inconsistent:

     35 0/1	0/1	0/0
     20 0/0 0/1 0/0
     18 0/1	0/0	0/1
     14 0/1	1/1	0/0	# middle can't be child
      6 0/0	0/1	0/1
      5 1/1	1/1	0/0	* No mendelian combination exists
      1 1/1	0/1	0/0	*
      1 0/1	0/0	0/0	*

This is, in fact, the correct assessment - NA12878 is the child.

Going further

Refining your analysis

...

0/0	*

This is, in fact, the correct assessment - NA12878 is the child.

Going further

Refining your analysis

Can you modify the large 1 liner command to be more strict, or to include more examples such that you eliminate the non-mendelian inheritance situations and consider a larger number of loci?

Scope and usefulness

This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.
This is the first step of building a phylogeny of 3 related individuals. Expanding this could be of use for genetic counseling or larger phylogenetic analysis.

Return to GVA2021 GVA2022 page.

Versions Compared

Old Version 1

New Version Current

Key

Single-sample variant calling with

bcftools

Multiple-sample variant calling with

bcftools

Going further

Refining your analysis

Going further

Refining your analysis

Scope and usefulness

Page Comparison

Versions Compared

Old Version 1

New Version Current

Key

Single-sample variant calling with

bcftools

Multiple-sample variant calling with

bcftools

Going further

Refining your analysis

Going further

Refining your analysis

Scope and usefulness