Stampede2 Breseq Tutorial GVA2022

Stampede2 Breseq Tutorial GVA2022

Overview

breseq is a tool developed by the Barrick lab (of which your instructor is a member) intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Learning objectives:

  • Quick introduction to a self contained/automated pipeline to identify mutations.

  • Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.

  • Examine the same data used in the Mapping, and SNV tutorials as breseq output.



Input data / expectations:

  • Haploid reference genome

  • Relatively small (<20 Mb) reference genome

  • Average genomic coverage > 30-fold

  • Less than ~1,000 mutations expected

  • Detects SNVs and SVs from split read alignment of reads (does not use paired-end distance information)

  • Produces annotated HTML output

This does mean that breseq is not suited for diploids, and other very large genomes. GATK is a similar pipeline that has many more additional options that is suited for diploids and large genomes. You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.



breseq access

In order to run breseq, we need to install it. If you think this sounds like a great opportunity to use conda you are right! Using https://anaconda.org/  you find 2 different results for breseq. 1 is in the bioconda channel that we have worked with multiple times and has been downloaded >32,000 times. The other is from a contributor with limited track record, nothing has been updated in more than 4.5 years, and has only been accessed 6 times. Like with apps on your phone the answer to "which is the correct one" should be rather obvious.

As breseq is a "all in 1 tool" that works to map, call variants, sift signal from noise, and provide basic visualization, you may think that breseq is the only program you would use for the analysis of appropriate samples. Do not forget that read qc and read processing actually take place before running breseq. Therefor a full environment might contain (such as the one I use in my own analysis): fastqc, fastp, and breseq. Nicely, all 3 programs are in the bioconda channel. Using what you have already learned so far, you should know how to create a new environment with these 3 programs. Unfortunately there is currently an issue wherein installing breseq does not grab one program that is needed to draw some of the graphs it can normally produce. One way of fixing this is to include "libjpeg-turbo". 

You can name your new environment anything you want, my suggestion would be GVA-breseq.
conda create --name GVA-breseq -c bioconda fastqc fastp breseq libjpeg-turbo -c conda-forge conda activate GVA-breseq

Because we are including multiple different packages, the installation will take a few extra minutes to complete. If you just include breseq and libjpeg-turbo the installation will be faster. This is one of the reasons that having 1 environment with all programs installed in it is not always the best idea.



(GVA-breseq) tacc:/scratch/0004/train402/GVA_samtools_tutorial$ fastqc --version FastQC v0.11.9 (GVA-breseq) tacc:/scratch/0004/train402/GVA_samtools_tutorial$ fastp --version fastp 0.23.2 (GVA-breseq) tacc:/scratch/0004/train402/GVA_samtools_tutorial$ breseq --version breseq 0.36.1



breseq should now run using the breseq command. breseq without any options will show you what the command expectations are while adding .

A consolidated explanation of help files and my experience with them

Not all programs are configured to tell you what it expects just from typing the name of it. Some require the name of the command followed by one of the following: -h or --help or ? while others require preceding the command name with "man" (short for manual). If all that fails google is your friend for all programs not named "R" ... for R google is still your best bet, but it won't be your friend. In the specific case of R, adding the word "stats" somewhere to the search will greatly help things in my experience.

Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Data

The data files for this tutorial is located in following location:

$BI/ngs_course/lambda_mixed_pop/data/

Copy the contents of this directory to a new directory called GVA_breseq_lambda_mixed_pop in your scratch directory.

By this point in the course you should not need to expand this box to see the suggested solution. You should continue expanding boxes such as this to make sure you are not drifting too far
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop cp $BI/ngs_course/lambda_mixed_pop/data/* $SCRATCH/GVA_breseq_lambda_mixed_pop

Possible errors on idev nodes

As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files, and be sure to start a new idev session as breseq should not be run on the headnode.

Now use the ls command to see what files were copied. again, you should not need to expand this to get the output listed below
ls $SCRATCH/GVA_breseq_lambda_mixed_pop

File Name

Description

Sample

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node. 

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

breseq command
conda activate GVA-breseq cd $SCRATCH/GVA_breseq_lambda_mixed_pop breseq -j 68 -r lambda.gbk lambda_mixed_population.fastq

While breseq is running lets look at what the different parts of the command are actually doing:

part

puprose

part

puprose

-j 68

use 68 processors (recall we saw this was best in the rough testing using of bowtie2

-r lambda.gbk

Use the lambda.gbk file as the reference to identify specific mutations

lambda_mixed_population.fastq

breseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping