Page Comparison

...

Code Block
export PATH="/corral-repl/utexas/BioITeam/ngs_course/local/bin:$PATH"

If you want to go through installing R 2.15 and deep SNV for yourself, here's how:

...

Once you have access to R 2.15, you can install deepSNV using these commands (which work for anyBioConductor package.

Code Block

title	Installing Bioconductor package deepSNV

login1$ R
...
> source("http://bioconductor.org/biocLite.R")
> biocLite("deepSNV")

...

A mixed population of E. coli from an evolution experiment was sequenced at several different time points (PMID: 19838166 , PMID:19776167). At generation 0 it consisted of a clone (cells grown from a colony with essentially no genetic variation), then additional samples were taken at 20K and 40K generations after which mutations arose and swept through the (asexual) population.

Data

The data files for this example are in the path:

Code Block
/corral-repl/utexas/BioITeam/ngs_course/ecoli_mixed

File Name	Description
`SRR032374SRR030252.fastq.gz`	Illumina reads, 0K generation individual clone from population
`SRR032374.fastq.gz`	Illumina reads, 20K generation mixed population
`SRR032376.fastq.gz`	Illumina reads, 40K generation mixed population
`NC_012967.1.fasta.gz`	E. coli B str. REL606 genome

...

The reference genome file was downloaded from the NCBI Genomes page.

Map Reads

Choose an appropriate program and map the reads. As for other variant callers, convert the mapped reads to BAM format, then sort and index the BAM file.

Additional exercises

What is Determine the approximate depth of mapped read -depth coverage for each file?sequencing data set.
Try using different mappers or changing the default alignment settings to find more variants.

Run FreeBayes

FreeBayes can be used to treat the sample as a mixture of pooled samples. (In our case it is actually a mixture of >1 million bacteria, but we have nowhere near that level of coverage, so we give an arbitrary mixed ploidy of 100, which means we use a statistical model that predicts variants only with frequencies of 1%, 2%, 3%, ... 98%, 99%, 100%). This command runs pretty fast, so you can do it in interactive mode.

Code Block

title	Example command for running FreeBayes

login1$ freebayes --min-alternate-count 3 --ploidy 100 --pooled --vcf SRR032374.vcf \
        --fasta-reference NC_012967.1.fasta SRR032374.sorted.bam

Additional exercises

Write a script or use a linux command to filter the output files to only contain variants that are predicted to have frequencies > 0.05 or scores > 1000.

Run deepSNV

deepSNV runs more slowly, so we will only look at a small region of the genome initially in interactive mode. (Why is it slower? Probably in part due to differences in the statistical modeling using a more sophisticated statistical model and in part because it is implemented in R instead of C.)

Useful Links

deepSNV website
deepSNV paper .. but UT library does not have access to this journal...
News article about deepSNV
deepSNV R module vignette

Code Block

title	Example deepSNV commands

login$ R
...
> regions <- data.frame(chr="gi|254160123|ref|NC_012967.1|", start = 1, stop=100000)
> mixresult = deepSNV(test = "SRR032374SRR032376.sorted.bam", control = "SRR032376SRR030252.sorted.bam", regions=regions)
> SNVssig_result <- summary(mixresult, sig.level=0.05, adjust.method="BH")
> pdf("output_SRR032374.pdf")
> plot(mix)
> dev.off()
> write.csv(SNVssig_result, "SRR032374")

Additional exercises

...

Create an

...

R script to run

...

from the command line to execute these commands, and try running the entire

...

E. coli genome on TACC.
Compare the variants predicted in samples SRR032374 and SRR032376.

Versions Compared

Old Version 5

New Version 6

Key

Data

Map Reads

Additional exercises

Run FreeBayes

Additional exercises

Run deepSNV

Additional exercises