Breseq Basics

Introduction

On Monday we introduced breseq to you but were limited by time and setting up our .bashrc files and other environmental things. Here we present an extremely quick breseq example so you can familiarize yourself with the breseq command before moving on to more advanced uses of breseq. breseq is one of (if not the absolute) best and most complete programs for bacterial variant identification not just in our opinion but by multiple people at conferences and those that have used it during this class. More information can be found in the following reference:

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

Objectives

  1. Run a simple breseq analysis on a set of lambda phage sequencing data.
  2. Transfer output files off TACC to interrogate output and see mutation visualizations

Running breseq

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. To do that we'll want to copy our data off of corral and onto scratch. The following 2 files are in the $BI/ngs_course/lambda_mixed_pop/data :

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

See if you can figure out how to copy them to a new directory on $SCRATCH called BDIB_breseq_basic.

Click here for the solution
mkdir $SCRATCH/BDIB_breseq_basic
cp $BI/ngs_course/lambda_mixed_pop/data/* $SCRATCH/BDIB_breseq_basic

idev idev idev

By now hopefully you are guessing that you need to be on an idev node to be running a computationally intensive job like breseq. You can check this via the showq -u command.

idev command
idev  -m 120 -r CCBB_Bio_Summer_School_2016_Day3 -A UT-2015-05-18 -N 2 -n 8
basic breseq command
module load intel
module load Rstats
breseq -j 48 -r lambda.gbk lambda_mixed_population.fastq

The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. This should finish relatively quickly but while it runs lets look at what we put in the command line: 

partpuprose
-j 48Use 48 processors (the max available on lonestar5 nodes)
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a "-" or "--" option to be an input fastq file to be used for mapping

Once you see a line that reads "Creating index HTML table..." breseq has finished running

Looking at breseq predictions

breseq produced a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. 

 We have previously covered using scp to transfer files, but here we present another detailed example. Click to expand.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
cd $SCRATCH/BDIB_breseq_basic
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
cd Desktop
scp <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
# Enter your password and wait for the file transfer to complete
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

 Can you identify the indel locations?
  1. 138 1bp deletion
  2. 14,266 G addition
  3. 20,832 C addition
  4. 37,815 1bp deletion
  5. 47,977 2bp deletion
 Can you identify the only structural variant in this data set?

There is a 5,996bp deletion starting at 21,738

While it is a somewhat unfair comparison, consider how long it took to complete this analysis compared to the SNV and SV tutorials and IGV tutorials from yesterday and today. You have access to better formatted data in a shorter amount of time using breseq which is what makes it so powerful

return to 2016 GVA course