...
First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.
Data
The data files for this example are in the path:
Code Block |
---|
$BI/ngs_course/lambda_mixed_pop/data
|
Copy this directory to a new directory called BDIB_breseq in your $SCRATCH
space and cd
into it.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_breseq
cp $BI/ngs_course/lambda_mixed_pop/data/* BDIB_breseq
cd BDIB_breseq
ls |
If the copy worked correctly you should see the following 2 files:
File Name | Description | Sample |
---|---|---|
| Single-end Illumina 36-bp reads | Evolved lambda bacteriophage mixed population genome sequencing |
| Reference Genome | Bacteriophage lambda |
Running breseq
Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue or run on an idev node. general
Code Block | ||||
---|---|---|---|---|
| ||||
breseq -j 12 -r lambda.gbk lambda_mixed_population.fastq > log.txt
|
...
Environment
To set your profile up to run breseq, we need to add "module load bowtie/2.1.0" to your profile.
Code Block | ||||
---|---|---|---|---|
| ||||
cdh #move to your home directory
echo "module load bowtie/2.1.0" >> .profile #this command updates your profile to automatically load the bowtie module |
After you've completed these commands, exit lonestar and re log in to re run your profile.
Data
The data files for this example are in the path:
Code Block |
---|
$BI/ngs_course/lambda_mixed_pop/data
|
Copy this directory to a new directory called BDIB_breseq in your $SCRATCH
space and cd
into it.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_breseq_lambda
cp $BI/ngs_course/lambda_mixed_pop/data/* BDIB_breseq_lambda
cd BDIB_breseq_lambda
ls |
If the copy worked correctly you should see the following 2 files:
File Name | Description | Sample |
---|---|---|
| Single-end Illumina 36-bp reads | Evolved lambda bacteriophage mixed population genome sequencing |
| Reference Genome | Bacteriophage lambda |
Running breseq
Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue or run on an idev node.
Code Block | ||||
---|---|---|---|---|
| ||||
idev #idev starts an "interactive development" mode which allows you to run computationally intensive tasks
breseq -j 12 -r lambda.gbk lambda_mixed_population.fastq > log.txt
|
A bunch of progress messages will stream by during the breseq run which would be lost on the compute node if not for the redirection to the log.txt file. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt
file as your job runs using tail -f
. The -f
option means to "follow" the file and keep giving you output from it as it gets bigger. You will need to wait for your job to start running before you can tail -f log.txt
.
...
Expand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
To use To figure out the full path to your file, you can use the
Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:
|
Navigate to the output
directory in the finder and open the a file called index.html
. This will open the results in a web browser window that you can click through different mutations and other information and see the evidence supporting it.
Example 2: E. coli data sets
Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger. You don't want to run them in interactive mode. We'll submit them to the TACC queue all at once.
Data
The data files for this example are in the path:
Code Block |
---|
$BI/ngs_course/ecoli_clones/data
|
...
File Name
...
Description
...
Sample
...
SRR030252_1.fastq SRR030252_2.fastq
...
Paired-end Illumina 36-bp reads
...
0K generation evolved E. coli strain
...
SRR030253_1.fastq SRR030253_2.fastq
...
Paired-end Illumina 36-bp reads
...
2K generation evolved E. coli strain
...
SRR030254_1.fastq SRR030254_2.fastq
...
Paired-end Illumina 36-bp reads
...
5K generation evolved E. coli strain
...
SRR030255_1.fastq SRR030255_2.fastq
...
Paired-end Illumina 36-bp reads
...
10K generation evolved E. coli strain
...
SRR030256_1.fastq SRR030256_2.fastq
...
Paired-end Illumina 36-bp reads
...
15K generation evolved E. coli strain
...
SRR030257_1.fastq SRR030257_2.fastq
...
Paired-end Illumina 36-bp reads
...
20K generation evolved E. coli strain
...
The summary page provides useful information about the percent of reads mapping to the genome as well as the overall coverage of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (and more rarely inaccurate).
Feel free to click around through the different mutations and examine their evidence when you have time, but first start the next breseq run so that it can be in the queue and completing while you look at the data. We will go over the different types of mutations and the evidence for them as a group towards the end of class today, but additional information on analyzing the output can be found at the following reference:
- Deatherage, D.E., Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseq. Methods Mol. Biol. 1151:165-188. «PubMed»
Example 2: E. coli data sets
Now we'll try running breseq on some Escherichia coli genomes from an evolution experiment. These files are larger. You don't want to run them in interactive mode. We'll submit them to the TACC queue all at once.
Data
The data files for this example are in the following path. Go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_coli_clones
:
Code Block | ||
---|---|---|
| ||
$BI/ngs_course/ecoli_clones/data
|
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina 36-bp reads |
0K generation evolved E. coli strain |
|
Reference Genome
E. coli B str. REL606
Running breseq on TACC
breseq may take an hour or two to run on these sequences, so you should submit to the serial
queue instead of the development
queue on TACC and should give a run time of 4 hours as a conservative estimate.
Since we have multiple data sets, this example will also give us an opportunity to run several commands as part of a single job on TACC, and use multiple cores on a single processor.
You'll want each command to look something like this:
Code Block |
---|
login1$ breseq -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq
|
Notice the additional -o
option which specifies that all of those output directories should be put in the specified directory, instead of the current directory. If we don't include this, then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.
Hint: It is often a good idea to try running a command that you are about to submit to the TACC queue yourself, just to be sure you have all the options and paths correct. Otherwise you will have to wait until it starts running on TACC in order to find out that it it failed immediately, which can be frustrating. Try running the command above. You can use control-c
to cancel the command after you're sure it started ok.
...
title | Example commands file |
---|
...
| Paired-end Illumina 36-bp reads | 2K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 5K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 10K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 15K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 20K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 40K generation evolved E. coli strain |
| Reference Genome | E. coli B str. REL606 |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_breseq_coli_clones
cp -v $BI/ngs_course/ecoli_clones/data/* BDIB_breseq_coli_clones
cd BDIB_breseq_coli_clones |
Running breseq on TACC
breseq may take an hour to run on these sequences, so you should submit to the normal
queue instead of the development
queue on TACC and should give a run time of 3 hours as a conservative estimate. Since we have multiple data sets, this example will also give us an opportunity to run several commands as part of a single job on TACC, and use multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:
Examining breseq results
As before, copy the data back to your computer and examine the HTML output in a web browser.
Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?
...
Code Block |
---|
tar -cvzf output.tgz output_*/output
|
Click around in the results.
Optional: breseq utility commands
breseq includes a few utility commands that can be used on any BAM/FASTA set of files to draw an HTML read pileup or a plot of the coverage across a region.
It's easiest to run these commands from inside the main output directory (e.g., output_20K
) of a breseq run. They use information in the data
directory.
Code Block |
---|
breseq bam2aln NC_012967:237462-237462
breseq bam2cov NC_012967:2300000-2320000
|
Additionally, the files in the data
directory can be loaded in IGV if you copy them back to your desktop.
Optional Exercise: Running breseq in mixed population mode
The phage lambda data set you examined is actually a mixed population of many different phage lambda genotypes descended from a clonal ancestor. You ran breseq in a mode where it predicted consensus mutations in what it thinks is one uniform haploid genome. Actually, some individuals in the population have certain mutations and others do not, so you might have noticed when you looked at some of the alignments that there was a mixture of bases at a position.
We will talk more about analyzing mixed population data to predict rare variants in a later lesson. However, if you're curious you can now experimental with running breseq in a mode where it estimates the frequencies of different mutations in the population. This process is most accurate for single nucleotide variants. Mutations at intermediate frequencies are not (yet) predicted for all classes of mutations like large structural variants.
Code Block |
---|
login1$ breseq --polymorphism-prediction --polymorphism-no-indels -r lambda.gbk lambda_mixed_population.fastq
|
The option --polymorphism-prediction
turns on these mixed population predictions. The option --polymorphism-no-indels
turns off predictions of small insertions and deletions (which don't work as well for reasons too complicated to explain here). You're welcome to also try it without this option.
Copy the resulting output
directory back to your computer and examine the HTML output in a web browser. Compare it to the output from before.
...
Code Block |
---|
breseq -j 12 -r NC_012967.1.gbk -o output_15K <XX>KSRR030256 SRR030252_1.fastqSRR030256 SRR030252_2.fastqbreseq -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq breseq -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq |
Once you have your commands
file ready, then you need to create your launcher.sge
script.
...
Code Block |
---|
launcher_creator.py -q serial -t 4:00:00 ... <your other options>
qsub launcher.sge
|
&> <XX>K.log.txt
|
Notice we've added some additional options:
part | puprose |
---|---|
&> <XX>00K.log.txt | Redirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file. |
-o output_<xx>00k | all of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and chande the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess. |
Tip | ||
---|---|---|
| ||
It is often a good idea to try running a command that you are about to submit to the TACC queue yourself, just to be sure you have all the options and paths correct. Otherwise you will have to wait until it starts running on TACC in order to find out that it it failed immediately, which can be frustrating. Try running the command above on the terminal before using launcher_creator.py. If you include the &> option at the end, you will see nothing happen as all of the output is being directed to a new location. Count to ten slowly and then use control-c to cancel the command and use ls to make sure the output file is created and use tail or cat to make sure that the program is running rather than crashing. |
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
This will likely sit for some time in the launcher que, making it a good opportunity to work through the interrogating launcher queue portion of our linux tutorial if you didn't get the opportunity to earlier. |
Examining breseq results
Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
You will want to use the tar command again, but you will need to use a wildcard to specify what goes into the compressed file, and only the output directories within each of the wildcard-matched directories.
|
Expand | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||||||
To use To figure out the full path to your file, you can use the
Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:
|
Click around in the results.
Optional: breseq utility commands
breseq includes a few utility commands that can be used on any BAM/FASTA set of files to draw an HTML read pileup or a plot of the coverage across a region.
It's easiest to run these commands from inside the main output directory (e.g., output_20K
) of a breseq run. They use information in the data
directory.
Code Block |
---|
breseq bam2aln NC_012967:237462-237462
breseq bam2cov NC_012967:2300000-2320000
|
Additionally, the files in the data
directory can be loaded in IGV if you copy them back to your desktop.
Optional: Install breseq
We have already installed breseq in $BI/bin
for the purpose of this tutorial. You are welcome to continue using it for your own work, or use the installation options we present here to install or update to newer versions as needed.
...