Breseq Tutorial

Introduction

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Input data / expectations:

  • Haploid reference genome
  • Relatively small (<20 Mb) reference genome
  • Input FASTQ reads can be from any sequencing technology
  • Average genomic coverage > 30-fold
  • Less than ~1,000 mutations expected
  • Detects SNVs and SVs from single-end reads (does not use paired-end distance information)
  • Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.

This tutorial was reformatted from the most recent version found here. Our thanks to the previous instructors.

Objectives:

  • Use a very self contained/automated pipeline to identify mutations.
  • Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.

 

Example 1: Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Environment

To set your profile up to run breseq, we need to add "module load bowtie/2.1.0" to your profile.

Adding bowtie to your profile
cdh  #move to your home directory
echo "module load bowtie/2.1.0" >> .profile  #this command updates your profile to automatically load the bowtie module

After you've completed these commands, exit lonestar and re log in to re run your profile.  

Data

The data files for this example are in the path:

$BI/ngs_course/lambda_mixed_pop/data

Copy this directory to a new directory called BDIB_breseq in your $SCRATCH space and cd into it.

Click here for the solution
cds
mkdir BDIB_breseq_lambda
cp $BI/ngs_course/lambda_mixed_pop/data/* BDIB_breseq_lambda
cd BDIB_breseq_lambda
ls 

If the copy worked correctly you should see  the following 2 files:

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue or run on an idev node.

breseq prep and commands
idev  #idev starts an "interactive development" mode which allows you to run computationally intensive tasks
 
breseq -j 12 -r lambda.gbk lambda_mixed_population.fastq > log.txt

A bunch of progress messages will stream by during the breseq run which would be lost on the compute node if not for the redirection to the log.txt file. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. You will need to wait for your job to start running before you can tail -f log.txt.

The command that we used contains several parts with the following explanations:

partpuprose
-j 12Use 12 processors (the max available on lonestar nodes)
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping
> log.txtredirect the output the log.txt file

 

Looking at breseq predictions

breseq will produce a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer.

 We have previously covered using scp to transfer files, but here we present another detailed example. Click to expand.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
scp -r <username>@lonestar.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

Feel free to click around through the different mutations and examine their evidence when you have time, but first start the next breseq run so that it can be in the queue and completing while you look at the data. We will go over the different types of mutations and the evidence for them as a group towards the end of class today, but additional information on analyzing the output can be found at the following reference:

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

 

Example 2: E. coli data sets

Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger. You don't want to run them in interactive mode. We'll submit them to the TACC queue all at once.

Data

The data files for this example are in the following path. Go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_coli_clones:

location of data files
$BI/ngs_course/ecoli_clones/data

File Name

Description

Sample

SRR030252_1.fastq SRR030252_2.fastq

Paired-end Illumina 36-bp reads

0K generation evolved E. coli strain

SRR030253_1.fastq SRR030253_2.fastq

Paired-end Illumina 36-bp reads

2K generation evolved E. coli strain

SRR030254_1.fastq SRR030254_2.fastq

Paired-end Illumina 36-bp reads

5K generation evolved E. coli strain

SRR030255_1.fastq SRR030255_2.fastq

Paired-end Illumina 36-bp reads

10K generation evolved E. coli strain

SRR030256_1.fastq SRR030256_2.fastq

Paired-end Illumina 36-bp reads

15K generation evolved E. coli strain

SRR030257_1.fastq SRR030257_2.fastq

Paired-end Illumina 36-bp reads

20K generation evolved E. coli strain

SRR030258_1.fastq SRR030258_2.fastq

Paired-end Illumina 36-bp reads

40K generation evolved E. coli strain

NC_012967.1.fasta

Reference Genome

E. coli B str. REL606

Command to copy data files to new folder
cds
mkdir BDIB_breseq_coli_clones
cp -v $BI/ngs_course/ecoli_clones/data/* BDIB_breseq_coli_clones 

cd BDIB_breseq_coli_clones

Running breseq on TACC

breseq may take an hour to run on these sequences, so you should submit to the normal queue instead of the development queue on TACC and should give a run time of 3 hours as a conservative estimate. Since we have multiple data sets, this example will also give us an opportunity to run several commands as part of a single job on TACC, and use multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:

breseq -j 12 -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt

Notice we've added some additional options:

partpuprose
&> <XX>00K.log.txtRedirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file.
-o output_<xx>00kall of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and chande the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.

checking command before submitting to queues

It is often a good idea to try running a command that you are about to submit to the TACC queue yourself, just to be sure you have all the options and paths correct. Otherwise you will have to wait until it starts running on TACC in order to find out that it it failed immediately, which can be frustrating. Try running the command above on the terminal before using launcher_creator.py. If you include the &> option at the end, you will see nothing happen as all of the output is being directed to a new location. Count to ten slowly and then use control-c to cancel the command and use ls to make sure the output file is created and use tail or cat to make sure that the program is running rather than crashing.
 Click here for commands file example and launcher_creator.py generator
Example commands file
breseq -j 12 -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt
launcher_creator.py -q normal -t 3:00:00 -n launcher -a "UT-2015-05-18" -m "module load bowtie/2.1.0"
qsub launcher.sge

This will likely sit for some time in the launcher que, making it a good opportunity to work through the interrogating launcher queue portion of our linux tutorial if you didn't get the opportunity to earlier.

 

Examining breseq results

Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?

 Click here for a hint without the answer

You will want to use the tar command again, but you will need to use a wildcard to specify what goes into the compressed file, and only the output directories within each of the wildcard-matched directories.

click here to check your solution, or get the answer
tar -cvzf output.tgz output_*/output
 Here are the commands we showed you for the previous example (with the trick of getting a single compressed output directory you just learned) to transfer so you don't have to scroll back and forth. See if you can remember how to do it without going back over the lesson.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
tar -czvf output.tar.gz output_*/output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
scp -r <username>@lonestar.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

 

Click around in the results.

Optional: breseq utility commands

breseq includes a few utility commands that can be used on any BAM/FASTA set of files to draw an HTML read pileup or a plot of the coverage across a region.

It's easiest to run these commands from inside the main output directory (e.g., output_20K) of a breseq run. They use information in the data directory.

breseq bam2aln NC_012967:237462-237462
breseq bam2cov NC_012967:2300000-2320000

Additionally, the files in the data directory can be loaded in IGV if you copy them back to your desktop.

Optional: Install breseq

We have already installed breseq in $BI/bin for the purpose of this tutorial. You are welcome to continue using it for your own work, or use the installation options we present here to install or update to newer versions as needed.

Download breseq from github

See if you can install breseq and get it running from the installation instructions.

Info for installing breseq on TACC

You do not need to install a compiler (GCC/ICC), bowtie2, or R on TACC as they are available via the module system.

module load R
module load GCC
module load bowtie/2.1.0

You could add these commands to your  $HOME/.profile_user if you wanted them available by default.

 I need help...

Hint: We have some optional info on Installing Linux tools that should help you get breseq installed.