...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
mkdir samtools_example cd samtools_example samtools mpileup -cd $SCRATCH/GVA_Human_trios bcftools mpileup --threads 64 -O u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools call --threads 64 -v -c - > trios_tutorial.raw.vcf |
The above command may take >30 ~20 minutes to run based on some student's experience and will not produce many lines of output leading students to worry their terminal has locked up. More than likely this is not the case, but you can try hitting the return key 1 or 2 times to see if your terminal adds a blank line to verify. As this command is taking so long it is recommended that you read ahead and/or switch to reading another tutorial in another terminal window while it runs, just be aware that you should not start 2 idev sessions in 2 different windowswhile waiting for this command to finish.
One potential issue with this type of approach is that vcf files only record variation that can be seen with the data provided. When all reads mapping to a given location exactly match the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
...
Not being able to tell between no data and wildtype is not the end of the world for a single sample, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
To instruct samtools bcftools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools bcftools allows two methods to do this:
By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.
Note title Warning, DO NOT run this command yet. Code Block language bash title samtools bcftools multi-sample variants: separate bam files #samtools#bcftools mpileup -uf $SCRATCH/GVA_Human_trios/--threads 64 -O u -f raw_files/ref/hs37d5.fa \ $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools call -v -c - > multi_sample/trios_tutorial.allmulti-sample.samtools.vcf
The output file from this option is$BI/gva_course/GVA2021.all.samtools.vcf.samtools.vcf GVA.multi-sample.vcf
if you want to work with it without having to wait on the analysis to run personally.By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools
@RG
tags.Note title Do not run this command This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the
samse
orsampe
stage of mapping with bwa with the-r
option, usingsamtools merge
, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.Code Block title samtools multi-sample variants: one or more bam files using @RG samtoolsbcftools mpileup --threads 64 -O u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa all<all_with@RG.bambam> | bcftools call --threads 68 -v -c - > trios_tutorial.all.raw.vcf
Tip title An observation for the individuals interested in phylogenies I have a hunch that the use of read tags could be useful for your analysis IF you have a finite set of samples that you intend to analyze, have all the data available at once, and are willing to repeat your analysis from near scratch in the event of either of these factors changing. This is an excellent example of the importance of finding papers that have done similar analyses, and mimicking as much of what they have done as possible/reasonable.
Based on the discussion above, we are selecting the first solution and providing details of how this command should be run. As this command will generate very little output and take ~45 minutes to complete, you are once again reminded that the output file from this option is available $BI/gva_course/GVA2021GVA.allmulti-sample.samtools.vcf
if you want to work with it without having to wait on the analysis to run personally.
Code Block | ||
---|---|---|
| ||
mkdircd $SCRATCH/GVA_Human_trios/multi_sample cd $SCRATCH/GVA_Human_trios samtools mpileup -uf $SCRATCH/GVA_Human_trios/ bcftools mpileup --threads 64 -O u -f raw_files/ref/hs37d5.fa \ $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools call -v -c - > multi_sample/trios_tutorial.allmulti-sample.samtools.vcf |
The observant students who start this command might notice that this time mpileup does give a message that it is working with 3 samples, and 3 input files (while all previous analysis have used 1 file and 1 sample).
...
Expand | ||
---|---|---|
| ||
You're trying to find the genotypes in the trios_tutorial. |
...
Tip | ||
---|---|---|
| ||
Notice that if you are working with data provided rather than data generated you may have 2 different file names "trios_tutorial.all.samtoolsmulti-sample.vcf" if you analyzed the data yourself and " |
...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cat trios_tutorial.all.samtoolsmulti-sample.vcf | tail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' | tail -100 | sort | uniq -c | sort -n -r |
...
Expand | ||
---|---|---|
| ||
Here are the steps going into this command:
For more information about the individual commands and their options https://explainshell.com/ is a good jumping off point. |
...
Going further
Refining your analysis
Can you modify the large 1 liner command to be more strict, or to include more examples such that you eliminate the non-mendelian inheritance situations and consider a larger number of loci?
...
Scope and usefulness
...
Scope and usefulness
- This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.
- This is the first step of building a phylogeny of 3 related individuals. Expanding this could be of use for genetic counseling or larger phylogenetic analysis.
Return to GVA2021 GVA2022 page.