SSCS Read Generation (GVA14)

Overview:

This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files. 

Learning Objectives:

  1. Use python script to generate SSCS Reads.
  2. Use flexbar to trim molecular indexes from duplex seq libraries.

Tutorial: SSCS Reads

Since this will take longer we will first use First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. For the purpose of this tutorial, the paired end sequencing of sample DED110 has been placed in the  $BI/gva_course/mixed_population directory. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/scripts directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data.

 If you need help to copy the fastq files click the triangle...
SSCS_DCS.py -f1 fastq/DED110_CATGGC_L006_R1_001.fastq -f2 fastq/DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log


 If you need a hint without the answer click the triangle...

The following arguments are the ones that are needed to generate just the SSCS reads

  -f1 FASTQ1, --fastq1 FASTQ1
                        fastq read1 file to check
  -f2 FASTQ2, --fastq2 FASTQ2
                        fastq read2 file to check
  -p PREFIX, --prefix PREFIX
                        prefix for output files
  -s, --SSCS            calculate SSCS sequence, off by default. IF DCS
                        specificed, automatically on
  -m MINIMUM_READS, --minimum_reads MINIMUM_READS
                        minimum number of reads needed to support SSCS reads
  --log LOG             name of output log file
 If you are still stuck and want the answer click the triangle...
SSCS_DCS.py -f1 fastq/DED110_CATGGC_L006_R1_001.fastq -f2 fastq/DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log

This should take ~15 minutes to complete in an idev shell. While this is completing we will generate .fastq files where the molecular index has been trimmed from the read.

Tutorial (Trimmed Reads):

For the purpose of this tutorial, we will be working with flexbar which like breseq is something that we have installed in the BioITeam as it is not a tacc module. Some additional modules must be loaded in order for it to work correctly, and the LD_LIBRARY_PATH variable must be modified as listed below.

module swap intel gcc
export LD_LIBRARY_PATH=/corral-repl/utexas/BioITeam/flexbar_v2.23_linux64:$LD_LIBRARY_PATH

For this tutorial, it is sufficient to simply type these commands out, if this becomes something you want to do more often, or want to submit as a job, it would be important to add these lines to your .profile so they are loaded each time you log in.

If the above commands are executed properly, typing flexbar -h should display a lengthy list of optional arguments which can be used for a variety of purposes. For the purpose of this tutorial, we will only focus on trimming the first 16 bases off each read as this represents the 12 bases of the molecular index and a 4 base constant region. See if you can figure out what the command is based on the help output pay special attention to the -t option.

 If you need a hint without the answer click the triangle...

The following arguments are the ones that are needed to successfully trim the first 16 bases of the sequence:

    -u, --max-uncalled NUM
          Allowed uncalled bases (N or .) in reads, default: 0
    -x, --pre-trim-left NUM
          Trim specified number of bases on 5' end of reads before alignment
    -t, --target STR
          Prefix for output file names
    -s, --source FILE
          Input file with reads, that may contain barcodes
    -p, --source2 FILE
          Second input file for paired read scenario
    -f, --format STR
          Input format of reads: csfasta, csfastq, fasta, fastq, fastq-sanger, fastq-solexa, fastq-i1.3, fastq-i1.5,
          fastq-i1.8 (illumina 1.8+)
 If you are still stuck and want the answer click the triangle...
flexbar -u 100 -x 16 -t trimmed -s fastq/DED110_CATGGC_L006_R1_001.fastq -p fastq/DED110_CATGGC_L006_R2_001.fastq -f fastq

In an idev shell this should take less than 5 minutes to complete. Once completed there should be 6 new files, all of which begin with "trimmed" if you took the answer from the above help, or whatever string you entered for the -t argument if you did not use the above help. These 6 files represent the trimmed files, the length distribution, and any errors. using the head command, see if you can figure out which file is which. 

 click here for the answer to which file is which

the trimmed_1/2.fastq is the trimmed fastq files

the trimmed_1/2.fastq.lengthdist is the length distribution file (which should have 2 lines: a header line and a line showing that all of the reads are 85 bp long now

the trimmed_1_single.fastq is the error file which should be empty

Next step:

You should now have 3 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, trimmed_1.fastq, and trimmed_2.fastq. We will now run breseq to compare the number and quality of variants called.

LINK TO NEXT TUTORIAL