Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.

data

Like we did yesterday we'll start by downloading our reads and reference into a new folder on scratch:

Code Block
languagebash
titleRemember that to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

Code Block
languagebash
titlebreseq command
36
breseq -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. Otherwise this is still the same basic command as we used in the first command that uses the bare minimum of inputs: a reference file, and read file(s).

In the advanced breseq tutorials, we'll start working with more complex options, such as storing input reads in 1 directory and breseq output in a separate directory, installing your own version of breseq on TACC so you aren't reliant on the BioITeam version, enabling multiple processors to speed up all the breseq runs, comparing across multiple samples, and more.

Now that you have this command running (estimated to take ~20 minutes) I suggest:

  1. Going back up to the lambda phage run you transferred back to your computer above and interrogate the results some as they will be fundamentally the same types of mutations and output that you will see for this sample, just on a smaller scale. 
  2. Reading below to anticipate what the results here will look like and how they will compare to the list of 40 variants you identified using samtools, and visualized with IGV.
  3. Reading the next section of the tutorial which deals with rerunning breseq on the lambda phage data in polymorphism mode (and the differences that makes in the results)
  4. Going back to the course home page and deciding what tutorial you'd like to work on next and begin reading through that material. 

...

Note
titleRunning this data set not recommended

Running this data set yourself is not actually recommended, or at least not recommended until you have completed some of the advanced breseq tutorials for 3 reasons:

  1. At best, this data set expected to take ~1 hour to complete.
  2. Version conflicts with bowtie2 between TACC and breseq prevent accessing the 'best' time.
  3. Potential issue with the launcher module.

3 can be worked around fairly simply as can be seen in the multiqc tutorial 2 will be handled in breseq installation tutorial after which the data will still take ~1hr.

Therefore while the information of how to run this data is below, it is recommended that the majority of you focus on just downloading the results of that run when I did it.

Code Block
languagebash
titlepartial scp command to copy to the current directory of your local computer
scp <user>@ls5.tacc.utexas.edu:/corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/breseq_SRR030257_run_output_folder.tar.gz .


data

Like we did yesterday we'll start by downloading our reads and reference into a new folder on scratch:

Code Block
languagebash
titleRemember that to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

Code Block
languagebash
titlebreseq command
breseq -j 48 -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. The other thing we have introduced is the -j 48 option which should allow breseq to access 48 processors. Unfortunately if you have not completed the breseq installation tutorial, there is a version conflict between the module version of bowtie2 and breseq and will therefore throw an error when breseq tries to call bowtie2.

As noted above, the run will still take ~1 hour even after you have installed your own version of breseq and therefore it is recommended that you use the launcher_creator.py script to send this job to the queue after you have completed the installation tutorial.

Anchor
scpData
scpData
evaluating output

Here are the comments from the IGV tutorial evaluating the data:

Expand
titleSome interesting example coordinates
  • Expand
    titleCoordinate 161,041. What gene is this in and what is the effect on the protein sequence?

    Gene is pcnB, mutation is a snp

  • Expand
    titleCoordinate 3,248,957. What gene is this in and what is the effect on the protein sequence?

    Gene is infB, mutation is a snp

  • Expand
    titleCoordinate 3,894,997. What type of mutation is this?

    Deletion of the rbsD gene

  • Expand
    titleCheck out the rbsA gene region? What's going on here?

    There was a large deletion. Can you figure out the exact coordinates of the endpoints?

  • Expand
    titleNavigate to coordinate 3,289,962. Compare the results for different alignment programs and settings. Can you explain what's going on here?

    There is a 16 base deletion in the gltB gene reading frame.

  • Expand
    titleWhat is going on in the pykF gene region? You might see red read pairs. What does that mean? Can you guess what type of mutation occurred here?

    The read pairs are discordantly mapped. There was an insertion of a new copy of a mobile genetic element (an IS150 element) that exists at other locations in the reference sequence.

  • See if you can find more interesting locations. Recall in the IGV tutorial we had ~190 mutations, here we have ~40. This highlights much of the statistical testing going on under the hood of breseq to separate the signal from the noise.


Additional tutorials dealing with breseq

...