Using FreeBayes and deepSNV to call variants in mixed populations

The program FreeBayes can be used to call variants in genomes of any ploidy, pooled samples, or mixed populations.

Installation for FreeBayes

This tutorial assumes that you have created the paths $WORK/src and $HOME/local/bin and added $HOME/local/bin to your $PATH.

login1$ module load git
login1$ mkdir -p $WORK/src && cd $WORK/src
login1$ git clone --recursive git://github.com/ekg/freebayes.git
login1$ cd freebayes
login1$ module load cmake
login1$ module load gcc
login1$ make
login1$ mv bin/* $HOME/local/bin

This command from the instructions attempts to install to a system-wide location as super-user:

sudo make install

This won't work on Lonestar! (You aren't an admin.) However, the make command created the executables inside of the source tree, so we find and move them to our standard $HOME/local/bin directory with the last command.

Installation for deepSNV

Requires R 2.15.

You can install your own version of R using the instructions below, but this takes a while to compile, so you can also just add this location to your path by adding this line to your ~/.profile_user file.

export PATH="/corral-repl/utexas/BioITeam/ngs_course/local/bin"

Once you have access to R 2.15, you can install deepSNV using these commands.

login1$ R
...
> source("http://bioconductor.org/biocLite.R")
> biocLite("deepSNV")

Example 1: Mixed E. coli population

A mixed population of E. coli from an evolution experiment was sequenced at several different time points (PMID:19776167). At generation 0 it consisted of a clone (cells grown from a colony with essentially no genetic variation), then additional samples were taken at 20K and 40K generations after which mutations arose and swept through the (asexual) population.

Data

The data for this example are in the path:

/corral-repl/utexas/BioITeam/ngs_course/FreeBayes/ecoli_mixed

File Name	Description
`SRR032374.fastq.gz`	Illumina reads, 0K generation individual clone from population
`SRR032374.fastq.gz`	Illumina reads, 20K generation mixed population
`SRR032376.fastq.gz`	Illumina reads, 40K generation mixed population
`NC_012967.1.fasta.gz`	E. coli B str. REL606 genome

The read files were downloaded from the ENA SRA study.

The reference genome file was downloaded from the NCBI Genomes page

Map Reads

Choose an appropriate program and map the reads. As for other variant callers, convert the mapped reads to BAM format, then sort and index the BAM file.

What is the approximate read-depth coverage for each file?

Run FreeBayes

FreeBayes can be used to treat the sample as a mixture of pooled samples. (In our case it is actually a mixture of >1 million bacteria, but we have nowhere near that coverage, so we give an arbitrary mixed ploidy of 100, which means we use a model that predicts variants only with frequencies of 1%, 2%, 3%, ... 98%, 99%, 100%).

login1$ freebayes --min-alternate-count 3 --ploidy 100 --pooled --vcf SRR032374.vcf \
        --fasta-reference NC_012967.1.fasta SRR032374.sorted.bam

Run deepSNV

deepSNV runs more slowly, so we will only look at a small region of the genome initially. (Probably in part due to differences in the statistical modeling and in part because it is implemented in R instead of C.)

login$ R
...
> regions <- data.frame(chr="gi|254160123|ref|NC_012967.1|", start = 1, stop=100000)
> mix = deepSNV(test = "SRR032374.sorted.bam", control = "SRR032376.sorted.bam", regions=regions)
> SNVs <- summary(mix, sig.level=0.05, adjust.method="BH")
> pdf("output_pdf")
> plot(mix)
> dev.off()
> write.csv(SNVs"")

As an exercise, create an R script to run all of these commands, and try running the entire chromosome on TACC.

Calling variants in multiploid genomes and mixed populations (FreeBayes)