Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlebreseq prep and commands
idev  -m 240 -r CCBB_Bio_Summer_School_2016_Day1 -A UT-2015-05-18 -N 12 -n 8
 
# waitThis should forreturn yourthe commandfollowing:
prompt# to comeWe backfound cd $SCRATCH/BDIB_breseq_tutorial_1
breseq -j 24 -r lambda.gbk lambda_mixed_population.fastq > log.txt &

A bunch of progress messages will stream by during the breseq run which would clutter the screen if not for the redirection to the log.txt file. The & at the end of the line tells the system to run the previous command in the background which will enable you to still type and execute other commands while breseq runs. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f log.txt. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. To stop the tailing command hit ctrl-c which is the keyboard interrupt signal. While breseq is running lets look at what the different parts of the command are actually doing:

partpuprose
-j 24Use 24 processors (the max available on lonestar5 nodes)
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping
> log.txtredirect the output the log.txt file
&run the preceding command in the background

 

Looking at breseq predictions

breseq will produce a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer.

Warning

Before preceding you will need to let breseq finish running. It is very likely that it will be done by the time you have finished reading the above, but make sure that it is complete by using the tail command. If it is not done yet, it will be very shortly.

Expand
titleWe have previously covered using scp to transfer files, but here we present another detailed example. Click to expand.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Code Block
languagebash
titleCommand to type in TACC
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Code Block
languagebash
titleCommand to type in the desktop's terminal window
scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

Feel free to click around through the different mutations and examine their evidence when you have time, but first start the next breseq run so that it can finish running while you look at the data. We will go over the different types of mutations and the evidence for them as a group towards the end of class today, but additional information on analyzing the output can be found at the following reference:

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

 

Example 2: E. coli data sets

Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger. 

Data

The data files for this example are in the following path. Go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_tutorial_2:

Code Block
titlelocation of data files
$WORK/GVA2016/ecoli_clones/data
Code Block
languagebash
titleClick here for the solution
collapsetrue
mkdir $SCRATCH/BDIB_breseq_tutorial_2
cp $WORK/GVA2016/ecoli_clones/data/* $SCRATCH/BDIB_breseq_tutorial_2
cd $SCRATCH/BDIB_breseq_tutorial_2

if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:

File Name

Description

Sample

SRR030252_1.fastq SRR030252_2.fastq

Paired-end Illumina 36-bp reads

0K generation evolved E. coli strain

SRR030253_1.fastq SRR030253_2.fastq

Paired-end Illumina 36-bp reads

2K generation evolved E. coli strain

SRR030254_1.fastq SRR030254_2.fastq

Paired-end Illumina 36-bp reads

5K generation evolved E. coli strain

SRR030255_1.fastq SRR030255_2.fastq

Paired-end Illumina 36-bp reads

10K generation evolved E. coli strain

SRR030256_1.fastq SRR030256_2.fastq

Paired-end Illumina 36-bp reads

15K generation evolved E. coli strain

SRR030257_1.fastq SRR030257_2.fastq

Paired-end Illumina 36-bp reads

20K generation evolved E. coli strain

SRR030258_1.fastq SRR030258_2.fastq

Paired-end Illumina 36-bp reads

40K generation evolved E. coli strain

NC_012967.1.fasta

Reference Genome

E. coli B str. REL606

Running breseq on TACC

breseq will take a little longer to run on these sequences, so this give us an opportunity to run several commands at the same time making use of the multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:

Code Block
breseq -j 3 -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt

Notice we've added some additional options:

partpuprose
&> <XX>00K.log.txtRedirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file.
-o output_<xx>00kall of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and change the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.

 

 

Expand
titleClick here for commands solution

Put the following commands into a new file called "commands" using nano.

Code Block
titleExample commands file
breseq -j 3 -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt &
breseq -j 3 -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt &
Code Block
titlehow to execute all the commands at once
chmod +x commands  # makes the commands file executable
./commands  # executes the commands file
This will take several minutes to finish, making it a good opportunity to go back through some of the more detailed information in the ls5 and linux tutorial from earlier, and interrogate the output from the lambda phage data.
an ACTIVE reservation request for you, named CCBB_Bio_Summer_School_2016_Day1.
#  Do you want to use it for your interactive session?
#  Enter y/n [default y]: 


# If for any reason you don't see the above message let us know by raising your hand.
 
# Your answer should be y, which should return the following:
#  Reservation      : --reservation=CCBB_Bio_Summer_School_2016_Day1 (ACTIVE)
 
# Some of you may see a new prompt stating something like the following:
# We need a project to charge for interactive use.
# We will be using a dummy job submission to determine your project(s).
# We will store your (selected) project $HOME/.idevrc file.
# Please select the NUMBER of the project you want to charge.\n
# 1 OTHER_PROJECTS
# 2 UT-2015-05-18
# Please type the NUMBER(default=1) and hit return:
 
# If you see this message, again let us know.
 
# You will then see something similar to the following:
# job status:  PD
# job status:  R
# --> Job is now running on masternode= nid00032...OK
# --> Sleeping for 7 seconds...OK
# --> Checking to make sure your job has initialized an env for you....OK
# --> Creating interactive terminal session (login) on master node nid00032.
 
# If this takes more than 1 minute get our attention, otherwise continue with the following
cd $SCRATCH/BDIB_breseq_tutorial_1
module load intel
module load Rstats
breseq -j 48 -r lambda.gbk lambda_mixed_population.fastq &> log.txt &

A bunch of progress messages will stream by during the breseq run which would clutter the screen if not for the redirection to the log.txt file. The & at the end of the line tells the system to run the previous command in the background which will enable you to still type and execute other commands while breseq runs. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f log.txt. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. To stop the tailing command hit ctrl-c which is the keyboard interrupt signal. While breseq is running lets look at what the different parts of the command are actually doing:

partpuprose
-j 24Use 24 processors (the max available on lonestar5 nodes)
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping
&> log.txtredirect the output and error to the file log.txt
&run the preceding command in the background

 

This will finish very quickly (likely before you begin reading this) with a final line of "Creating index HTML table...". Before we start looking at what mutations were actually identified, we are going to start the second example so that it can be running while we investigate it making better use of our time.

Example 2: E. coli data sets

Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger. 

Data

The data files for this example are in the following path. Go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_tutorial_2:

Code Block
titlelocation of data files
$WORK/GVA2016/ecoli_clones/data
Code Block
languagebash
titleClick here for the solution
collapsetrue
mkdir $SCRATCH/BDIB_breseq_tutorial_2
cp $WORK/GVA2016/ecoli_clones/data/* $SCRATCH/BDIB_breseq_tutorial_2
cd $SCRATCH/BDIB_breseq_tutorial_2

if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:

File Name

Description

Sample

SRR030252_1.fastq SRR030252_2.fastq

Paired-end Illumina 36-bp reads

0K generation evolved E. coli strain

SRR030253_1.fastq SRR030253_2.fastq

Paired-end Illumina 36-bp reads

2K generation evolved E. coli strain

SRR030254_1.fastq SRR030254_2.fastq

Paired-end Illumina 36-bp reads

5K generation evolved E. coli strain

SRR030255_1.fastq SRR030255_2.fastq

Paired-end Illumina 36-bp reads

10K generation evolved E. coli strain

SRR030256_1.fastq SRR030256_2.fastq

Paired-end Illumina 36-bp reads

15K generation evolved E. coli strain

SRR030257_1.fastq SRR030257_2.fastq

Paired-end Illumina 36-bp reads

20K generation evolved E. coli strain

SRR030258_1.fastq SRR030258_2.fastq

Paired-end Illumina 36-bp reads

40K generation evolved E. coli strain

NC_012967.1.fasta

Reference Genome

E. coli B str. REL606

Running breseq on TACC

breseq will take a little longer to run on these sequences, so this give us an opportunity to run several commands at the same time making use of the multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:

Code Block
breseq -j 12 -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt

Notice we've added some additional options:

partpuprose
&> <XX>00K.log.txtRedirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file.
-o output_<xx>00kall of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and change the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.
Expand
titleClick here for commands solution

Put the following commands into a new file called "commands" using nano.

Code Block
titleExample commands file
breseq -j 12 -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt &
breseq -j 12 -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt &
Expand
titleBonus question ... why did we use -j 12?
  1. Each lonestar5 compute node has 48 processors available.
  2. We requested 2 nodes (-N 2) and 8 tasks (-n 8)
  3. We have 7 samples to run, so by requesting 12 processors, we allow all 7 samples to start at the same time leaving us with 12 unused processors
  4. If we had requested 13 processors for each sample in theory there would have only been 5 unused processors, BUT in actuality, only 6 samples would have started because processors can't be shared across nodes.

 

Code Block
titlehow to execute all the commands at once
chmod +x commands  # makes the commands file executable
./commands  # executes the commands file

This will take several minutes to finish, making it a good opportunity to interrogate the output from the lambda phage data.

Looking at breseq predictions

breseq produced a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. Go back to the first tutorial (BDIB_breseq_tutorial_1) and transfer the contents of the output directory back to your local computer.

 

Expand
titleWe have previously covered using scp to transfer files, but here we present another detailed example. Click to expand.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Code Block
languagebash
titleCommand to type in TACC
cd $SCRATCH/BDIB_breseq_tutorial_1
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Code Block
languagebash
titleCommand to type in the desktop's terminal window
scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
 
# Enter your password and wait for the file transfer to complete
 
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it. The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).

Click around through the different mutations and examine their evidence to see what kinds of mutations you can identify. Interact with your instructors, and show us what different types of mutations you are able to identify, or ask us what mutations you don't understand. Periodically check the 2nd example and when it is done, transfer that data back and begin to look at the data as well. Additional information on analyzing the output can be found at the following reference:

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

 

Examining breseq results

...