Introduction

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Input data / expectations:

Haploid reference genome
Relatively small (<20 Mb) reference genome
Input FASTQ reads can be from any sequencing technology
Average genomic coverage > 30-fold
Less than ~1,000 mutations expected
Detects SNVs and SVs from single-end reads (does not use paired-end distance information)
Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.

This tutorial was reformatted from the most recent version found here. Our thanks to the previous instructors.

Objectives:

Use a very self contained/automated pipeline to identify mutations.
Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.

Example 1: Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Environment

In order to run breseq, we need to add "module load bowtie/2.2.6" to your .bashrc file.

Adding bowtie to your profile

cdh  # move to your home directory
nano .bashrc  # open your .bashrc file for editing
# scroll down to section 1 where modules are being loaded and add the following command making sure to include the leading spaces to match the formatting.
  module load bowtie/2.2.6
ctl-o # exit nano
ctl-x  # exit nano
source .bashrc  # this is the equivalent of logging out of lonestar and logging back in

breseq will now run using the breseq command

Data

The data files for all the class tutorials are located in following location:

/corral-repl/utexas/BioITeam/ngs_course

Copy this directory to a new directory called GVA2016 in your work directory.

Click here for the solution

cp -r /corral-repl/utexas/BioITeam/ngs_course $WORK/GVA2016  # because the absolute path to both where the folder is, and where you want it to be are provided, this command can be executed anywhere. Absolute paths differ from relative paths in that they start with a / rather than a . or the name of a folder/file (which is assumed to have started with a .)

Now that we have all of the data that we want in our work directory, lets move the lambda phage data to scratch so we can begin using it. The following 2 files are in the lambda_mixed_pop/data directory inside of your newly created GVA2016 directory:

File Name	Description	Sample
`lambda_mixed_population.fastq`	Single-end Illumina 36-bp reads	Evolved lambda bacteriophage mixed population genome sequencing
`lambda.gbk`	Reference Genome	Bacteriophage lambda

See if you can figure out how to copy them to a new directory on $SCRATCH called BDIB_breseq_tutorial_1.

Click here for the solution

mkdir $SCRATCH/BDIB_breseq_tutorial_1
cp $WORK/GVA2016/lambda_mixed_pop/data/lambda_mixed_population.fastq $SCRATCH/BDIB_breseq_tutorial_1
cp $WORK/GVA2016/lambda_mixed_pop/data/lambda.gbk $SCRATCH/BDIB_breseq_tutorial_1

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node. In our setup/linux tutorial earlier today we showed you how to use launcher_creator.py to generate the slurm file necessary to run on the compute nodes, but the compute nodes can take some time to reach the front of the line and actually start running. Instead, we are going to use a priority access reservation set up special for the BDIB summer school that you normally would not have access to but should guarantee immediate starting of your idev session. Copy and paste the following 2 commands

breseq prep and commands

idev  -m 240 -r CCBB_Bio_Summer_School_2016_Day1 -A UT-2015-05-18 -N 1
# wait for your command prompt to come back
cd $SCRATCH/BDIB_breseq_tutorial_1
breseq -j 24 -r lambda.gbk lambda_mixed_population.fastq > log.txt &

A bunch of progress messages will stream by during the breseq run which would clutter the screen if not for the redirection to the log.txt file. The & at the end of the line tells the system to run the previous command in the background which will enable you to still type and execute other commands while breseq runs. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f log.txt. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. To stop the tailing command hit ctrl-c which is the keyboard interrupt signal. While breseq is running lets look at what the different parts of the command are actually doing:

part	puprose
-j 24	Use 24 processors (the max available on lonestar5 nodes)
-r lambda.gbk	Use the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastq	breseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping
> log.txt	redirect the output the log.txt file
&	run the preceding command in the background

Looking at breseq predictions

breseq will produce a lot of directories beginning 01_sequence_conversion, 02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer.

Before preceding you will need to let breseq finish running. It is very likely that it will be done by the time you have finished reading the above, but make sure that it is complete by using the tail command. If it is not done yet, it will be very shortly.

We have previously covered using scp to transfer files, but here we present another detailed example. Click to expand.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC

tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window

scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

Feel free to click around through the different mutations and examine their evidence when you have time, but first start the next breseq run so that it can finish running while you look at the data. We will go over the different types of mutations and the evidence for them as a group towards the end of class today, but additional information on analyzing the output can be found at the following reference:

Deatherage, D.E., Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq. Methods Mol. Biol. 1151:165-188. «PubMed»

Example 2: E. coli data sets

Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger.

Data

The data files for this example are in the following path. Go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_tutorial_2:

location of data files

$WORK/GVA2016/ecoli_clones/data

Click here for the solution

mkdir $SCRATCH/BDIB_breseq_tutorial_2
cp $WORK/GVA2016/ecoli_clones/data/* $SCRATCH/BDIB_breseq_tutorial_2
cd $SCRATCH/BDIB_breseq_tutorial_2

if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:

File Name	Description	Sample
`SRR030252_1.fastq SRR030252_2.fastq`	Paired-end Illumina 36-bp reads	0K generation evolved E. coli strain
`SRR030253_1.fastq SRR030253_2.fastq`	Paired-end Illumina 36-bp reads	2K generation evolved E. coli strain
`SRR030254_1.fastq SRR030254_2.fastq`	Paired-end Illumina 36-bp reads	5K generation evolved E. coli strain
`SRR030255_1.fastq SRR030255_2.fastq`	Paired-end Illumina 36-bp reads	10K generation evolved E. coli strain
`SRR030256_1.fastq SRR030256_2.fastq`	Paired-end Illumina 36-bp reads	15K generation evolved E. coli strain
`SRR030257_1.fastq SRR030257_2.fastq`	Paired-end Illumina 36-bp reads	20K generation evolved E. coli strain
`SRR030258_1.fastq SRR030258_2.fastq`	Paired-end Illumina 36-bp reads	40K generation evolved E. coli strain
`NC_012967.1.fasta`	Reference Genome	E. coli B str. REL606

Running breseq on TACC

breseq will take a little longer to run on these sequences, so this give us an opportunity to run several commands at the same time making use of the multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:

breseq -j 3 -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt

Notice we've added some additional options:

part	puprose
&> <XX>00K.log.txt	Redirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file.
-o output_<xx>00k	all of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and change the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.

Click here for commands solution

Put the following commands into a new file called "commands" using nano.

Example commands file

breseq -j 3 -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt &

how to execute all the commands at once

chmod +x commands  # makes the commands file executable
./commands  # executes the commands file

This will take several minutes to finish, making it a good opportunity to go back through some of the more detailed information in the ls5 and linux tutorial from earlier, and interrogate the output from the lambda phage data.

Examining breseq results

Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?

Click here for a hint without the answer

You will want to use the tar command again, but you will need to use a wildcard to specify what goes into the compressed file, and only the output directories within each of the wildcard-matched directories.

click here to check your solution, or get the answer

tar -cvzf output.tgz output_*/output

Here are the commands we showed you for the previous example (with the trick of getting a single compressed output directory you just learned) to transfer so you don't have to scroll back and forth. See if you can remember how to do it without going back over the lesson.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC

tar -czvf output.tar.gz output_*/output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window

scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract

Click around in the results and see the different types of mutations you can detect.

Lonestar5 Breseq Tutorial

Introduction

Objectives:

Example 1: Bacteriophage lambda data set

Environment

Data

Running breseq

Looking at breseq predictions

Example 2: E. coli data sets

Data

Running breseq on TACC

Examining breseq results