Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Data

The data files for this example are in the path:

Code Block
$BI/ngs_course/lambda_mixed_pop/data

Copy this directory to a new directory called BDIB_breseq in your $SCRATCH space and cd into it.

Code Block
languagebash
titleClick here for the solution
collapsetrue
cds
mkdir BDIB_breseq_lambda
cp $BI/ngs_course/lambda_mixed_pop/data/* BDIB_breseq_lambda
cd BDIB_breseq_lambda
ls 

If the copy worked correctly you should see  the following 2 files:

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue or run on an idev node.

Code Block
languagebash
titlebreseq commands for commands file
module load bowtie/2.1.0   #Breseq uses bowtie to map reads, so this module must be loaded before you run breseq
 
breseq -j 12 -r lambda.gbk lambda_mixed_population.fastq > log.txt

...

Environment

To set your profile up to run breseq, we need to add "module load bowtie/2.1.0" to your profile.

Code Block
languagebash
titleAdding bowtie to your profile
cdh  #move to your home directory
echo "module load bowtie/2.1.0" >> .profile  #this command updates your profile to automatically load the bowtie module

After you've completed these commands, exit lonestar and re log in to re run your profile.  

Data

The data files for this example are in the path:

Code Block
$BI/ngs_course/lambda_mixed_pop/data

Copy this directory to a new directory called BDIB_breseq in your $SCRATCH space and cd into it.

Code Block
languagebash
titleClick here for the solution
collapsetrue
cds
mkdir BDIB_breseq_lambda
cp $BI/ngs_course/lambda_mixed_pop/data/* BDIB_breseq_lambda
cd BDIB_breseq_lambda
ls 

If the copy worked correctly you should see  the following 2 files:

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue or run on an idev node.

Code Block
languagebash
titlebreseq prep and commands
idev  #idev starts an "interactive development" mode which allows you to run computationally intensive tasks
 
breseq -j 12 -r lambda.gbk lambda_mixed_population.fastq > log.txt

A bunch of progress messages will stream by during the breseq run which would be lost on the compute node if not for the redirection to the log.txt file. The output text details several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. You will need to wait for your job to start running before you can tail -f log.txt.

...

Expand
titleClick here for commands file example and launcher_creator.py generator
Code Block
titleExample commands file
breseq -j 12 -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt
breseq -j 12 -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt
Code Block
launcher_creator.py -q normal -t 3:00:00 -n launcher -a "UT-2015-05-18" -m "module load bowtie/2.1.0"
qsub launcher.sge

 

Examining breseq results

...

This will likely sit for some time in the launcher que, making it a good opportunity to work through the interrogating launcher queue portion of our linux tutorial if you didn't get the opportunity to earlier.

 

Examining breseq results

Exercise: Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?

...

Expand
titleHere are the commands we showed you for the previous example (with the trick of getting a single compressed output directory you just learned) to transfer so you don't have to scroll back and forth. See if you can remember how to do it without going back over the lesson.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Code Block
languagebash
titleCommand to type in TACC
tar -czvf output.tar.gz output_*/output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Code Block
languagebash
titleCommand to type in the desktop's terminal window
scp -r <username>@lonestar.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

 

Click around in the results.

Optional: breseq utility commands

breseq includes a few utility commands that can be used on any BAM/FASTA set of files to draw an HTML read pileup or a plot of the coverage across a region.

It's easiest to run these commands from inside the main output directory (e.g., output_20K) of a breseq run. They use information in the data directory.

Code Block
breseq bam2aln NC_012967:237462-237462
breseq bam2cov NC_012967:2300000-2320000

Additionally, the files in the data directory can be loaded in IGV if you copy them back to your desktop.

Optional Exercise: Running breseq in mixed population mode

The phage lambda data set you examined is actually a mixed population of many different phage lambda genotypes descended from a clonal ancestor. You ran breseq in a mode where it predicted consensus mutations in what it thinks is one uniform haploid genome. Actually, some individuals in the population have certain mutations and others do not, so you might have noticed when you looked at some of the alignments that there was a mixture of bases at a position.

We will talk more about analyzing mixed population data to predict rare variants in a later lesson. However, if you're curious you can now experimental with running breseq in a mode where it estimates the frequencies of different mutations in the population. This process is most accurate for single nucleotide variants. Mutations at intermediate frequencies are not (yet) predicted for all classes of mutations like large structural variants.

Code Block
login1$ breseq --polymorphism-prediction --polymorphism-no-indels -r lambda.gbk lambda_mixed_population.fastq 

The option --polymorphism-prediction turns on these mixed population predictions. The option --polymorphism-no-indels turns off predictions of small insertions and deletions (which don't work as well for reasons too complicated to explain here). You're welcome to also try it without this option.

Copy the resulting output directory back to your computer and examine the HTML output in a web browser. Compare it to the output from before.

...

faster:

Code Block
languagebash
titleCommand to type in TACC
tar -czvf output.tar.gz output_*/output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Code Block
languagebash
titleCommand to type in the desktop's terminal window
scp -r <username>@lonestar.tacc.utexas.edu:<the_directory_returned_by_pwd>/output.tar.gz .
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

 

Click around in the results.

Optional: breseq utility commands

breseq includes a few utility commands that can be used on any BAM/FASTA set of files to draw an HTML read pileup or a plot of the coverage across a region.

It's easiest to run these commands from inside the main output directory (e.g., output_20K) of a breseq run. They use information in the data directory.

Code Block
breseq bam2aln NC_012967:237462-237462
breseq bam2cov NC_012967:2300000-2320000

Additionally, the files in the data directory can be loaded in IGV if you copy them back to your desktop.

Optional: Install breseq

We have already installed breseq in $BI/bin for the purpose of this tutorial. You are welcome to continue using it for your own work, or use the installation options we present here to install or update to newer versions as needed.

...