Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Click around through the different mutations and examine their evidence to see what kinds of mutations you can identify. If you cant understand what type of mutation each line represents, or how the images should help you understand what the mutation is, please dont hesitate to interact over zoom.

E. coli data from Mapping, SNV tutorials:

As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.

data

...

Bacteriophage lambda data set repeated

Did you notice the name of the fastq file we used? lambda_mixed_population.fastq. As the name of the file implies, this file is actually from a mixed population of phage though we did not include any information about that fact in the breseq command we used. Further as you clicked around on some of the RA evidences you may have noticed that some of the mutations which listed reads as having aligned to the reference genome as well. This is similar to our initial SNV tutorial where the variant calls were made in a consensus diploid mode which forced the program to decide between variants being at ~50% and 100%, breseq only works on haploid organisms and thus assumed that any variants that were present must have been present at 100%.

Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.

 Data, and running breseq

Code Block
languagebash
titleRemember that Commands to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

Code Block
languagebash
titlebreseq command
36
breseq -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. Otherwise this is still the same basic command as we used in the first command that uses the bare minimum of inputs: a reference file, and read file(s).

In the advanced breseq tutorials, we'll start working with more complex options, such as storing input reads in 1 directory and breseq output in a separate directory, installing your own version of breseq on TACC so you aren't reliant on the BioITeam version, enabling multiple processors to speed up all the breseq runs, comparing across multiple samples, and more.

Now that you have this command running (estimated to take ~20 minutes) I suggest:

  1. Going back up to the lambda phage run you transferred back to your computer above and interrogate the results some as they will be fundamentally the same types of mutations and output that you will see for this sample, just on a smaller scale. 
  2. Reading below to anticipate what the results here will look like and how they will compare to the list of 40 variants you identified using samtools, and visualized with IGV.
  3. Reading the next section of the tutorial which deals with rerunning breseq on the lambda phage data in polymorphism mode (and the differences that makes in the results)
  4. Going back to the course home page and deciding what tutorial you'd like to work on next and begin reading through that material. 

evaluating output

Bacteriophage lambda data set revisited

Did you notice the name of the fastq file we used? lambda_mixed_population.fastq. As the name of the file implies, this file is actually from a mixed population of phage though we did not include any information about that fact in the breseq command we used. Further as you clicked around on some of the RA evidences you may have noticed that some of the mutations which listed reads as having aligned to the reference genome as well. This is similar to our initial SNV tutorial where the variant calls were made in a consensus diploid mode which forced the program to decide between variants being at ~50% and 100%, breseq only works on haploid organisms and thus assumed that any variants that were present must have been present at 100%.

Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.

 Data, and running breseq

Code Block
languagebash
titleCommands to copy the input data from the first breseq run to a new folder, and rerun breseq on the same fastq and reference file in polymorphism mode. Since this copy command is between 2 scratch locations i doubt there will be issues with it, but remember to restart an idev node if you experience difficulties
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -p -r lambda.gbk lambda_mixed_population.fastq

Evaluating output

...

the input data from the first breseq run to a new folder, and rerun breseq on the same fastq and reference file in polymorphism mode. Since this copy command is between 2 scratch locations i doubt there will be issues with it, but remember to restart an idev node if you experience difficulties
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -p -r lambda.gbk lambda_mixed_population.fastq

Evaluating output

Again you will need to transfer files back to your local computer to visualize the differences. The same exact compression command will work as the folder name is the same. In doing so you need to be careful where you transfer that file to on your local computer such that you don't overwrite the previously transferred files. Maybe add a _polymode to the directory you are transferring to as we did in our command above. Again help with SCP can be found here.

Code Block
languagebash
titlesuggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window
Code Block
languagebash
titleCommand to type in the desktop's terminal window to decompress the transferred archive after running the scp command
# scp command first ...
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

When you look at the summary statistic page, you will see none of the output has changed until you get quite far down the page and find that this time it was run in full polymorphism mode. When you look at the mutation predictions page, you now see more total mutations (with most of the new mutations being at frequencies less than 20%), a new column listing the frequency each mutation was listed at (with variants at less than 100% showing up in green), and if you look closely some mutations that were previously listed in white and at 100% are now listed as less than 100% (ie 82.0%). Hopefully from the discussions we've been having to this point it makes sense that mutations that are real but at low frequency would be mistaken for 0% rather than 100% when those are the only 2 choices. Again feel free to get my attention if you have any questions about the output, such as wondering why there are so many mutations at 100% even in a mixed population sample.

E. coli data from Mapping, SNV tutorials:

As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.


data

Like we did yesterday we'll start by downloading our reads and reference into a new folder on scratch:

Code Block
languagebash
titlesuggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window
Code Block
languagebash
titleCommand to type in the desktop's terminal window to decompress the transferred archive after running the scp command
# scp command first ...
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

...

Remember that to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

Code Block
languagebash
titlebreseq command
36
breseq -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. Otherwise this is still the same basic command as we used in the first command that uses the bare minimum of inputs: a reference file, and read file(s).

In the advanced breseq tutorials, we'll start working with more complex options, such as storing input reads in 1 directory and breseq output in a separate directory, installing your own version of breseq on TACC so you aren't reliant on the BioITeam version, enabling multiple processors to speed up all the breseq runs, comparing across multiple samples, and more.

Now that you have this command running (estimated to take ~20 minutes) I suggest:

  1. Going back up to the lambda phage run you transferred back to your computer above and interrogate the results some as they will be fundamentally the same types of mutations and output that you will see for this sample, just on a smaller scale. 
  2. Reading below to anticipate what the results here will look like and how they will compare to the list of 40 variants you identified using samtools, and visualized with IGV.
  3. Reading the next section of the tutorial which deals with rerunning breseq on the lambda phage data in polymorphism mode (and the differences that makes in the results)
  4. Going back to the course home page and deciding what tutorial you'd like to work on next and begin reading through that material. 

evaluating output

Additional tutorials dealing with breseq

...