Using FreeBayes and deepSNV to call variants in mixed populations
The program FreeBayes can be used to call variants in genomes of any ploidy, pooled samples, or mixed populations.
Installation for FreeBayes
This tutorial assumes that you have created the paths $WORK/src
and $HOME/local/bin
and added $HOME/local/bin
to your $PATH
. FreeBayes uses a git repository and requires the cmake build system to compile. You can install it with these commands:
...
This won't work on Lonestar! (You aren't an admin.) However, the make
command created the executables inside of the source tree, so we find and move them to our standard $HOME/local/bin
directory with the last command.
Installation for deepSNV
The newest R module on lonestar is version 2.14, but deep SNV requires R 2.15.
...
Code Block | ||
---|---|---|
| ||
login1$ R ... > source("http://bioconductor.org/biocLite.R") > biocLite("deepSNV") |
Example 1: Mixed E. coli population
A mixed population of E. coli from an evolution experiment was sequenced at several different time points (PMID: 19838166 , PMID:19776167). At generation 0 it consisted of a clone (cells grown from a colony with essentially no genetic variation), then additional samples were taken at 20K and 40K generations after which mutations arose and swept through the (asexual) population.
Data
The data files for this example are in the path:
...
We renamed the FASTA sequence from "gi|254160123|ref|NC_012967.1|" to "NC_012967" by changing the first line of the NC_012967.1.fasta sequence using a text editor.
Map Reads
Choose an appropriate program and map the reads. As for other variant callers, convert the mapped reads to BAM format, then sort and index the BAM file.
Additional exercises
- Determine the approximate depth of mapped read coverage for each sequencing data set.
- Try using different mappers or changing the default alignment settings to find more variants.
Run FreeBayes
FreeBayes can be used to treat the sample as a mixture of pooled samples. (In our case it is actually a mixture of >1 million bacteria, but we have nowhere near that level of coverage, so we give an arbitrary mixed ploidy of 100, which means we use a statistical model that predicts variants only with frequencies of 1%, 2%, 3%, ... 98%, 99%, 100%). This command runs pretty fast, so you can do it in interactive mode.
Code Block | ||
---|---|---|
| ||
login1$ freebayes --min-alternate-total 3 --ploidy 100 --pooled --vcf SRR032374.vcf \ --fasta-reference NC_012967.1.fasta SRR032374.sorted.bam |
Additional exercises
- Write a script or use a linux command to filter the output files to only contain variants that are predicted to have frequencies > 0.10 and < 0.90 or scores > 100.
- What does the
--min-alternate-total
option mean. Experiment with using other options and examine how they affect speed versus what is predicted. Beware that some choices of options can cause crashes. - Compare the variants predicted in samples SRR032374 and SRR032376.
Run deepSNV
deepSNV runs more slowly, so we will only look at a small region of the genome initially in interactive mode. (Why is it slower? Probably in part due to using a more sophisticated statistical model and in part because it is implemented in R instead of pure C.)
...
Code Block | ||
---|---|---|
| ||
login1$ R ... > library("deepSNV") > regions <- data.frame(chr="NC_012967", start = 1, stop=100000) > result = deepSNV(test = "SRR032376.sorted.bam", control = "SRR030252.sorted.bam", regions=regions) > sig_result <- summary(result, sig.level=0.05, adjust.method="BH") > pdf("SRR032374.pdf") > plot(mix) > dev.off() > write.csv(sig_result, "SRR032374") |
Additional exercises
- Create an R script to run from the command line to execute these commands, and try running the entire E. coli genome on TACC.
- Compare the variants predicted in samples SRR032374 and SRR032376.