...
Click around through the different mutations and examine their evidence to see what kinds of mutations you can identify. If you cant understand what type of mutation each line represents, or how the images should help you understand what the mutation is, please dont hesitate to interact over zoom.
E. coli data from Mapping, SNV tutorials:
As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.
data
...
Bacteriophage lambda data set repeated
Did you notice the name of the fastq file we used? lambda_mixed_population.fastq. As the name of the file implies, this file is actually from a mixed population of phage though we did not include any information about that fact in the breseq command we used. Further as you clicked around on some of the RA evidences you may have noticed that some of the mutations which listed reads as having aligned to the reference genome as well. This is similar to our initial SNV tutorial where the variant calls were made in a consensus diploid mode which forced the program to decide between variants being at ~50% and 100%, breseq only works on haploid organisms and thus assumed that any variants that were present must have been present at 100%.
Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.
Data, and running breseq
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls |
Running breseq
Code Block | ||||
---|---|---|---|---|
| ||||
36
breseq -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz |
As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. Otherwise this is still the same basic command as we used in the first command that uses the bare minimum of inputs: a reference file, and read file(s).
In the advanced breseq tutorials, we'll start working with more complex options, such as storing input reads in 1 directory and breseq output in a separate directory, installing your own version of breseq on TACC so you aren't reliant on the BioITeam version, enabling multiple processors to speed up all the breseq runs, comparing across multiple samples, and more.
Now that you have this command running (estimated to take ~20 minutes) I suggest:
- Going back up to the lambda phage run you transferred back to your computer above and interrogate the results some as they will be fundamentally the same types of mutations and output that you will see for this sample, just on a smaller scale.
- Reading below to anticipate what the results here will look like and how they will compare to the list of 40 variants you identified using samtools, and visualized with IGV.
- Reading the next section of the tutorial which deals with rerunning breseq on the lambda phage data in polymorphism mode (and the differences that makes in the results)
- Going back to the course home page and deciding what tutorial you'd like to work on next and begin reading through that material.
evaluating output
Bacteriophage lambda data set revisited
Did you notice the name of the fastq file we used? lambda_mixed_population.fastq. As the name of the file implies, this file is actually from a mixed population of phage though we did not include any information about that fact in the breseq command we used. Further as you clicked around on some of the RA evidences you may have noticed that some of the mutations which listed reads as having aligned to the reference genome as well. This is similar to our initial SNV tutorial where the variant calls were made in a consensus diploid mode which forced the program to decide between variants being at ~50% and 100%, breseq only works on haploid organisms and thus assumed that any variants that were present must have been present at 100%.
Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.
Data, and running breseq
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -p -r lambda.gbk lambda_mixed_population.fastq
|
Evaluating output
...
| |
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -p -r lambda.gbk lambda_mixed_population.fastq
|
Evaluating output
Again you will need to transfer files back to your local computer to visualize the differences. The same exact compression command will work as the folder name is the same. In doing so you need to be careful where you transfer that file to on your local computer such that you don't overwrite the previously transferred files. Maybe add a _polymode to the directory you are transferring to as we did in our command above. Again help with SCP can be found here.
Code Block | ||||
---|---|---|---|---|
| ||||
tar -czvf output.tar.gz output # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window |
Code Block | ||||
---|---|---|---|---|
| ||||
# scp command first ...
tar -xvzf output.tar.gz # the new "x" option at the front means eXtract |
When you look at the summary statistic page, you will see none of the output has changed until you get quite far down the page and find that this time it was run in full polymorphism mode. When you look at the mutation predictions page, you now see more total mutations (with most of the new mutations being at frequencies less than 20%), a new column listing the frequency each mutation was listed at (with variants at less than 100% showing up in green), and if you look closely some mutations that were previously listed in white and at 100% are now listed as less than 100% (ie 82.0%). Hopefully from the discussions we've been having to this point it makes sense that mutations that are real but at low frequency would be mistaken for 0% rather than 100% when those are the only 2 choices. Again feel free to get my attention if you have any questions about the output, such as wondering why there are so many mutations at 100% even in a mixed population sample.
E. coli data from Mapping, SNV tutorials:
As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.
data
Like we did yesterday we'll start by downloading our reads and reference into a new folder on scratch:
Code Block | ||||
---|---|---|---|---|
| ||||
tar -czvf output.tar.gz output # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window |
Code Block | ||||
---|---|---|---|---|
| ||||
# scp command first ...
tar -xvzf output.tar.gz # the new "x" option at the front means eXtract |
...
| |||
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls |
Running breseq
Code Block | ||||
---|---|---|---|---|
| ||||
36
breseq -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz |
As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. Otherwise this is still the same basic command as we used in the first command that uses the bare minimum of inputs: a reference file, and read file(s).
In the advanced breseq tutorials, we'll start working with more complex options, such as storing input reads in 1 directory and breseq output in a separate directory, installing your own version of breseq on TACC so you aren't reliant on the BioITeam version, enabling multiple processors to speed up all the breseq runs, comparing across multiple samples, and more.
Now that you have this command running (estimated to take ~20 minutes) I suggest:
- Going back up to the lambda phage run you transferred back to your computer above and interrogate the results some as they will be fundamentally the same types of mutations and output that you will see for this sample, just on a smaller scale.
- Reading below to anticipate what the results here will look like and how they will compare to the list of 40 variants you identified using samtools, and visualized with IGV.
- Reading the next section of the tutorial which deals with rerunning breseq on the lambda phage data in polymorphism mode (and the differences that makes in the results)
- Going back to the course home page and deciding what tutorial you'd like to work on next and begin reading through that material.
evaluating output
Additional tutorials dealing with breseq
...