SSCS vs Trimmed Read Variant calls GVA2017

Overview:

Throughout the course we have spoken to you about how errors can pop up in your data and how they can effect confidence in variant calls and knowledge of what variants are actually read. Here we provide you with real data to show the difference between a non-corrected library and a error corrected library.

This tutorial makes use of data generated in the quick Error Correction tutorial. If you have not done that tutorial already you should do it first.

 

Learning Objectives:

  1. Run breseq on fastq files corresponding to error corrected, and non-error corrected data
  2. Gain an understanding of just how powerful error correction can be

 

Running Breseq on SSCS and Trimmed Reads:

The following reference files should be copied from the $BI/gva_course/mixed_population directory into a new folder BDIB_breseq_Error_Correction: 1400flanking.gff3, and REL606.maksed.gff3 while DED110_SSCS.fastq, and DED110_all.trimmed.fastq should be copied from your BDIB_Error_Correction folder. 

 Click here if you need help copying the reference files...
mkdir $SCRATCH/BDIB_breseq_Error_Correction
cp $BI/gva_course/mixed_population/*.gff3 $SCRATCH/BDIB_breseq_Error_Correction
cd $SCRATCH/BDIB_breseq_Error_Correction
 Click here if you need help copying the fastq files you made in the previous tutorial...
cp $SCRATCH/BDIB_Error_Correction/DED110_SSCS.fastq .
cd $SCRATCH/BDIB_Error_Correction/DED110_all.trimmed.fastq .

 

Just like our previous tutorials these commands should be run on an idev shell and they should be run in the background so both comparisons can be run at the same time. 

In this case we are going to be running breseq in both polymorphism mode (-p), and targeted sequencing mode (-t). By default breseq creates a lot of html files as output which ordinarily are very useful to visualize what the characteristics of any mutation are (in a better way than can be done with IGV). For normal runs this is extremely useful. For targeted deep sequencing runs (esspecially ones where we have not done error corrections), this can be incredibly time consuming. Since we are ultimately only interested in comparing the difference in number of mutations detected and their frequencies rather tahn the specific characteristics of those mutations, we also include the --brief-html-output option. We will also be using 2 different reference files, 1400flanking.gff3 should be used as a standard reference while REL606.masked.gff3 should be used as a junction only reference. Also we want to make good use of the 48 processors we have available.

mkdir Logs
module unload samtools
breseq -p -t -j 36 -o SSCS_output/trimmed -r 1400flanking.gff3 -s REL606.masked.gff3 --brief-html-output DED110_all.trimmed.fastq >& Logs/trimmed.log.txt &
breseq -p -t -j 12 -o SSCS_output/SSCS -r 1400flanking.gff3 -s REL606.masked.gff3 --brief-html-output DED110_SSCS.fastq >& Logs/SSCS.log.txt &
# wait a few moments and check that breseq is running using the tail command
tail Logs/*.log.txt
 While the above commands are running, can you think why we used two different values for the -j flag?

From the breseq -h command, we are reminded that the -j argument corresponds to the number of processors used. From our work in the previous error correction tutorial we saw that the error corrected data had ~1 million reads, while the trimmed data had ~6 million reads. The extra number of reads will make the trimmed files take longer to run, and hence why we give it more processors to speed things up.

 

Generate comparison table for the 2 types of samples

Just like the previous tutorial we now want to create a comparison file using "gdtools compare" so we can see the difference of having the trimmed reads vs the SSCS reads. This time you want to create a file "trimmed_vs_SSCS.html".

 If you need a hint without the answer click the triangle...

This was extensively covered in the Advanced Breseq tutorial. You may want to review this section of the tutorial for more detailed explanation while you wait for the 2 runs to complete. Check which commands are still running using the jobs command and the tail command listed above.

gdtools compare -h

the above will display instructions stating that gdtools compare needs a genebank (gbk) reference file, and .gd files to compare. The -o command is used to direct the output to a specific file name and/or location.

 If you are still stuck and want the answer click the triangle...
gdtools compare -o trimmed_vs_SSCS.html -r reference/REL606.masked.gff3 -r reference/1400flanking.gff3 SSCS_output/SSCS/output/output.gd SSCS_output/trimmed/output/output.gd

The comparison table as well as the output for the trimmed and SSCS reads can now be exported to your local computer for viewing.

If you feel you need a hint at this point raise your hand and let us know

By this point in the course you should be able to do this without assistance if you think through it. If you need help thats fine, but its possible that copy pasting the commands has not helped you to learn the principles of doing this. Raise your hand, ask for help, and let us walk you through what you know about scp and cp and how to get the information you need off of TACC

Evaluating the effect of error correction:

Once you have transfered the trimmed_vs_SSCS.html file back to your computer, open it and scroll around to see what the effects of our SSCS error correction were.

Return to GVA2017