Identifying mutations in microbial genomes (breseq)
Introduction
breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.
Input data / expectations:
- Haploid reference genome
- Relatively small (<20 Mb) reference genome
- Input FASTQ reads can be from any sequencing technology
- Average genomic coverage > 20-fold
- Less than ~1,000 mutations expected
- Detects SNVs and structural variants from single-end reads
- Produces annotated HTML output
You can learn a great deal more about breseq by reading the Online Documentation.
Here is a rough outline of the workflow in breseq with proposed additions.
Install breseq
Download breseq from Google code
See if you can install breseq and get it running from the installation instructions.
You will need Bowtie version 2.0.0-beta7 or later to run breseq. The version available on TACC by module laod is currently not this new.
Example 1: Bacteriophage lambda data set
First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.
Data
The data files for this example are in the path:
$BI/ngs_course/lambda_mixed_pop/data
Copy this directory to your $SCRATCH
space. Name it something other than data
. And cd
into it.
File Name |
Description |
Sample |
---|---|---|
|
Single-end Illumina 36-bp reads |
Evolved lambda bacteriophage mixed population genome sequencing |
|
Reference Genome |
Bacteriophage lambda |
Running breseq
Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue.
breseq -r lambda.gbk lambda_mixed_population.fastq > log.txt
A bunch of progress messages will stream by during the breseq run. They detail several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt
file as your job runs using tail -f
. The -f
option means to "follow" the file and keep giving you output from it as it gets bigger. You will need to wait for your job to start running before you can tail -f log.txt
.
Looking at breseq predictions
breseq will produce a lot of directories beginning 01_sequence_conversion
, 02_reference_alignment
, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on.
breseq will also produce two directories called: data
and output
.
First, copy the output
directory back to your desktop computer.
Inside of the output
directory is a file called index.html
. Open this in a web browser on your desktop and click around to take a look at the mutation predictions and summary information.
Optional Exercise: Running breseq in mixed population mode
The data set you are examining is actually of a mixed population of many different phage lambda genotypes descended from a clonal ancestor. You have run breseq in a mode wh
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.