Advanced breseq GVA2020

Overview

We analyzed a single lambda phage evolution experiment as both an introduction to variant visualization in the first part of Wednesday's class, but breseq is designed to be able to with "small" genomes, not just "tiny" genomes. It also has more advanced tools for visualizing mutations across multiple genomes and more complete identification of all mutations.

Prerequisite required

In order for this tutorial to work, you must have installed your own version of bowtie2 in this tutorial. If you have not the commands below will not run.

To verify these commands will work, use the 'which bowtie2' command and verify that bowtie2 resides in your $WORK directory not /opt


Learning Objectives

  1. Run breseq on 7 different E. coli clones evolved for 40,000 generations
  2. Use the packaged gdtools commands to generate a comparison table of all mutations from all 7 samples
  3. Learn shortcuts for compressing, transferring, and extracting multiple folders/files with single commands
  4. Visualize the data on your local computer

Get some data

The data files for this example are in the following path: $BI/ngs_course/ecoli_clones/data/ go ahead and copy them to a new folder in your $SCRATCH directory called GVA_breseq_multi-sample:

 By this point in the class you should have a real good idea of how to do this, but click here if you need more help or to check your work
Suggested command
mkdir $SCRATCH/GVA_breseq_multi-sample
cd $SCRATCH/GVA_breseq_multi-sample
cp $BI/ngs_course/ecoli_clones/data/* .

if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:

File Name

Description

Sample

SRR030252_1.fastq SRR030252_2.fastq

Paired-end Illumina 36-bp reads

0K generation evolved E. coli strain

SRR030253_1.fastq SRR030253_2.fastq

Paired-end Illumina 36-bp reads

2K generation evolved E. coli strain

SRR030254_1.fastq SRR030254_2.fastq

Paired-end Illumina 36-bp reads

5K generation evolved E. coli strain

SRR030255_1.fastq SRR030255_2.fastq

Paired-end Illumina 36-bp reads

10K generation evolved E. coli strain

SRR030256_1.fastq SRR030256_2.fastq

Paired-end Illumina 36-bp reads

15K generation evolved E. coli strain

SRR030257_1.fastq SRR030257_2.fastq

Paired-end Illumina 36-bp reads

20K generation evolved E. coli strain

SRR030258_1.fastq SRR030258_2.fastq

Paired-end Illumina 36-bp reads

40K generation evolved E. coli strain

NC_012967.1.fasta

Reference Genome

E. coli B str. REL606


Setting up the run

As we begin working with multiple samples it becomes more important or at least helpful to organize our output in a more meaningful way. Typically this means storing input files in 1 location, output folders in another, and information about the analysis in a 3rd location.

Suggested command
mkdir run_output logs

Note in the above command we are making deliberate use of having a space on the command line to create 2 different directories as was mentioned in the introduction tutorial.

Normally, we would use all of our data and use both R1 and R2 data, but to speed things up for the class, we will only use R1 data. Put the following commands into a new file called "commands" using nano. As the run will take ~40 minutes to complete, don't worry about not understanding what each part of the commands mean, after the run is started, there will be a section explaining what was done.

Example commands file
breseq -j 6 -r NC_012967.1.gbk -o run_output/00K SRR030252_1.fastq &> logs/00K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/02K SRR030253_1.fastq &> logs/02K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/05K SRR030254_1.fastq &> logs/05K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/10K SRR030255_1.fastq &> logs/10K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/15K SRR030256_1.fastq &> logs/15K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/20K SRR030257_1.fastq &> logs/20K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/40K SRR030258_1.fastq &> logs/40K.log.txt &

Be sure to write (ctrl + o) before you exit (ctrl +x) nano.

If you are especially interested in seeing a 'full' data set, you can use the following commands, but the runs will take longer as they have more reads associated with them, and likely better submitted as a job to the queue and allowed to run overnight for your evaluation in tomorrow's class
breseq -j 6 -r NC_012967.1.gbk -o run_output/00K SRR030252_1.fastq SRR030252_2.fastq &> logs/00K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/02K SRR030253_1.fastq SRR030253_2.fastq &> logs/02K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/05K SRR030254_1.fastq SRR030254_2.fastq &> logs/05K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/10K SRR030255_1.fastq SRR030255_2.fastq &> logs/10K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/15K SRR030256_1.fastq SRR030256_2.fastq &> logs/15K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/20K SRR030257_1.fastq SRR030257_2.fastq &> logs/20K.log.txt &
breseq -j 6 -r NC_012967.1.gbk -o run_output/40K SRR030258_1.fastq SRR030258_2.fastq &> logs/40K.log.txt &

Executing the run

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

how to execute all the commands at once
chmod +x commands
./commands

^The chmod command makes the commands file executable.

^ './commands'  # executes the commands file. the './' portion tells the command line to execute a file from the current directory as it is not in your $PATH variable.

Since we started running all of the breseq runs in the background, and are redirecting all of the output to log files (underlined portions will be explained below), there won't actually be anything printed to the screen, and we will get our prompt back. DON'T worry! things are likely working fine despite you not seeing anything happening. 

These 2 commands are good ways to check the commands are running in the background
ps  
tail logs/*.log.txt  

^ the ps command will return information on what processes are currently running. You should see 7 breseq commands initially though as they run they will spawn their own processes with names like perl and bowtie.

^ the tail command will show the last 10 lines the run information about all 7 samples. Errors should be obvious.

You will want to periodically use these commands to check that the files continue to run. Eventually all 7 log files will show "+++   SUCCESSFULLY COMPLETED" as the last line. 

You can also check how many have finished with this command
tail -n 1 logs/*|grep -c "+++   SUCCESSFULLY COMPLETED"


Understanding the breseq command that is being used.

While it is running, we'll look back at what you did and explain some of the new things you are encountering/using. you should move onto reading through the next section learning how you will create comparison tables to easily visualize the data across the different samples. If you have all that figured out you may want to start one of the other tutorials while the above commands finish. Just remember to check back periodically as the 'completion' notifications of background jobs can be easy to miss.

Generic command
breseq -j 6 -r NC_012967.1.gbk -o run_output/<XX>K <ID>_1.fastq <ID>_2.fastq &> logs/<XX>K.log.txt &
partpuprose
-j 6use 6 processors/threads .. this both speeds our run up, and would cause it to fail if we were using the module version of bowtie2
-o run_output/<xx>kdirects all output to the run_output directory, AND creates a new directory with 2 digits (<XX>) followed by a K for individual samples data. If we don't include the second part referencing the individual sample, breseq would write the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, but definitely not be analyzable
<ID>this is just used to denote read1 and or read2 ... note that in our acutal commands they reference the fastq files, and are supplied without an option
&> <XX>00K.log.txtRedirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. and put that file in a directory named logs. The &> are telling the command line to send the streams to that file.
&run the preceding command in the background. This is required so all the commands will run at once

Why did we use -j 6?

  1. Each lonestar5 compute node has 48 processors available.
  2. We have 7 samples to run, so by requesting 6 processors, we allow all 7 samples to start at the same time leaving us with 6 unused processors.
  3. If we had requested 7 processors for each sample, only 6 samples would start initially and the 7th would start after the first finishes.
  4. These are similar to the considerations you have to make with the job submission system we will go over at the end of tomorrow's class.


Using gdtools to compare samples.

Assuming everything is running as it should (you should ask if you are not sure), we need to start exploring the gdtools commands. gdtools is a series of commands designed to function on "genome diff" (.gd) files (which are the main data type produced by breseq which is used by downstream analysis) that come packaged with breseq. As we have seen with numerous other programs throughout the course typing gdtools by itself will return a list of options. In the case of gdtools, it returns the list of subcommands that are available. Each has its own use, but for the purpose of this tutorial we are interested in comparing the output of multiple samples to one another. 

 What sub command do you think will accomplish this?

# from the gdtools output we see that:

ANNOTATE (or COMPARE)  annotate the effects of mutations and compare multiple samples

Since gdtools compare seems to be the command and subcommand that we wanted to run to generate the comparison. Try typing just that to get a list of expected arguments.

 What information do we need to give to this command to generate the output we want (hints inside)?

The example command shows:

gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]

note that things displayed in the [] are optional.

In our command we sent the output of each sample to its own folder using the -o run_output/<XX>K portion of the command line. Among the multiple directories that breseq created within each output directory, you will find one named output which contains a file named output.gd. These are the files that we need to feed into the gdtools compare command (along with the NC_012967.1.gbk reference file) to generate the comparison table we want. In this course we have mentioned that using consistent nomenclature is important which it is, but now we come to a point where the consistency has created a problem. Because all of the files are named "output.gd" we can neither move all of the commands into a single folder (they would overwrite one another as they have the same name) or the resulting table would display sample IDs as "output" "output" "output" ... not the most helpful information to say the least.

To circumvent that we will need to rename each file to something more meaningful (say the name of the directory that the output/output.gd file was found in (aka the sample name)). 

The remaining portion of this tutorial requires the runs to have completed

You can also check how many have finished with this command
cd $SCRATCH/GVA_breseq_multi-sample
tail -n 1 logs/*|grep -c "+++   SUCCESSFULLY COMPLETED"

Until the above returns 7 the following commands will not work correctly.

While you wait consider going back to the intro breseq tutorial to rerun the sample that you took through read QCmappingSNV calling, and visualization. or skip to the bottom of the page for other potentually useful tutorials.

Grouping .gd files together


Handy command to move into the run_output directory, and copy all those .gd file to a new folder with a new name
cd $SCRATCH/GVA_breseq_multi-sample/run_output
for d in *;do cp -i $d/output/output.gd $d.gd;done;mkdir gd_files;mv -i *.gd gd_files
mv gd_files ..

This will copy the .gd file from each run into a new directory and then move that directory up 1 level so that it will be easier to use when making the compare table

Generating a compare table

From the information above we can construct the following command to generate the compare table
cd $SCRATCH/GVA_breseq_multi-sample/
gdtools compare -r NC_012967.1.gbk -o multi-sample-compare-table.html gd_files/*.gd

where:

  • -r ... is the reference
  • -o ... is the output table we want
  • gd_files/*.gd lists all the samples we want to have on our table

As the .html ending suggests, it will be an html output similar to the other breseq output you have seen and thus needs a web browser to be able to view it. Therefore you should use the scp command to transfer the multi-sample-compare-table.html file back to your local computer. Remember assistance with the scp command can be found here.

Evaluating compare table

Once you open the file on your local computer consider the following broad explanations:

  1. Each row represents a unique mutation
  2. The first 2 columns state what the position in the genome is, and what mutation ocurred
  3. Columns 3-9 represent each individual sample
  4. the remainder of the columns provide information from the annotation describing what genes are hit, and what the effect of the mutation is

As columns 3-9 show an increase in evolutionary time (from 2,000 to 20,000 generations), you will notice multiple examples where the first few columns are empty but the last 2,3,4 columns show 100% showing that the mutation was not detected until a certain time point then arose before the next sampling, and remained in the population to the end of our time course.

Can you find any examples of mutations that arose, but were out competed by the end of the evolution experiment? (do you understand what those should look like)

One downside of compare tables is that while it shows what mutations breseq called, you lose the ability to click on individual mutations to understand how/why that mutation was called. Therefore while compare tables are great, they are nearly always used in conjuncture with evaluating the individual breseq output as well. The next section will deal with a helper script to more quickly export multiple samples.


Exporting multiple breseq results using a single .tar.gz file

I have a small shell script for dealing with breseq data that copies, renames, compresses, groups each samples output directory into a single new directory, and compresses that directory so you can easily transfer it. Where can you get ahold of such a time saving script you ask? In the BioITeam directories of course!

Using Dan's most commonly used shell script for breseq data
cd $SCRATCH/GVA_breseq_multi-sample/run_output
ls
ls ..
export-breseq.sh
ls
ls ..

The above code block:

  1. moves to the directory that contains the samples we just ran through breseq
  2. shows you what is in that directory
  3. shows you what is in the parent directory
  4. executes the bash script described above
  5. shows you the new output is in the parent directory not the current directory

Using the SCP tutorial if necessary transfer the compressed archive 05_Output_Export.tar.gz back to your computer.

Inflating compressed archives of breseq results.

Command to type in the local terminal window AFTER the scp command
# scp command ... 05_Output_Export.tar.gz .
tar -xvzf 05_Output_Export.tar.gz
cd 05_Output_Export
ls

Now you will see 7 .tar.gz files in the directory, and you can extract each of them 1 at a time by using the tar -xvzf <FILE> command. This seems acceptable for 7 samples but not when dealing with 100s of samples. The 1 liner below will extract .tar.gz files sequentially:

for f in *.tar.gz;do tar xvzf $f;done

This command will extract all files ending in .tar.gz regardless of their origin. If you find yourself doing this routinely you may want to turn this command into a small bash script as i have done. This is as simple as the following 2 commands and moving the detar.sh file to a location in your $PATH. Remember you can check what directories are in your $PATH by typing 'echo $PATH'

echo "for f in *.tar.gz;do tar xvzf $f;done" > detar.sh
chmod +x detar.sh

This command will work on either your personal computer or TACC the same way, but in my experience, the export-breseq.sh is used on TACC while detar.sh is used on my local computer.

Comparing breseq output to compare table

Now you can click through each individual sample's output to see the different mutations as you could in the intro to breseq tutorial.


Other useful tutorials

Data analysis with breseq (like all other programs) is only as good as the data that goes into it. The MultiQC and Trimmomatic tutorials work well upstream of breseq, while the identification of novel DNA elements may be useful for things such as trying to identify unknown plasmids and makes use of the genome assembly tutorial.

Alternatively, again consider going back to the intro breseq tutorial to rerun the sample that you took through read QC, mapping, SNV calling, and visualization.



Back to the GVA2020 page