Molecular Index Error correction

Overview:

This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files. 

Learning Objectives:

  1. Use python script to generate SSCS Reads.
  2. Use flexbar to trim molecular indexes from duplex seq libraries.

Tutorial: SSCS Reads

Since this will take longer we will first use First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. For the purpose of this tutorial, the paired end sequencing of sample DED110 has been placed in the  $BI/gva_course/mixed_population directory. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/scripts directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data.

Click here for solution of how to copy the DED110 fastq files to a new directorry called BDIB_Error_Correction
cds
mkdir BDIB_Error_Correction
cd BDIB_Error_Correction
cp $BI/gva_course/mixed_population/DED110*.fastq .

 

 

 Interrogate the SSCS_DCS.py script to determine how to invoke it. Click here for hints before the answer

You can often get more information about python scripts by typing the name of the script followed by the -h command.

If you see an error message check here for a probable solution
Traceback (most recent call last):
  File "/corral-repl/utexas/BioITeam/bin/SSCS_DCS.py", line 45, in <module>
    import numpy
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/__init__.py", line 170, in <module>
    from . import add_newdocs
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/core/__init__.py", line 46, in <module>
    from numpy.testing import Tester
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/testing/__init__.py", line 13, in <module>
    from .utils import *
  File "/corral-repl/utexas/BioITeam/lib/python2.7/site-packages/numpy/testing/utils.py", line 15, in <module>
    from tempfile import mkdtemp
  File "/opt/apps/python/epd/7.3.2/lib/python2.7/tempfile.py", line 34, in <module>
    from random import Random as _Random
  File "/opt/apps/python/epd/7.3.2/lib/python2.7/random.py", line 47, in <module>
    from os import urandom as _urandom
ImportError: cannot import name urandom
 
# this error message seems to be being caused by the default version of python which is loaded on login from your .profile
module swap python/2.7.3-epd-7.3.2 python/2.7.1  # will swap 1 version of python for another. If you start having other problems with other programs you may need to switch back to the other version
The -h command should show you these options as being the key options to use/consider
  -f1 FASTQ1, --fastq1 FASTQ1
                        fastq read1 file to check
  -f2 FASTQ2, --fastq2 FASTQ2
                        fastq read2 file to check
  -p PREFIX, --prefix PREFIX
                        prefix for output files
  -s, --SSCS            calculate SSCS sequence, off by default. IF DCS
                        specificed, automatically on
  -m MINIMUM_READS, --minimum_reads MINIMUM_READS
                        minimum number of reads needed to support SSCS reads
  --log LOG             name of output log file
Using that information, see if you can figure out how to put the command together
SSCS_DCS.py -f1 DED110_CATGGC_L006_R1_001.fastq -f2 DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log

This should take ~15 minutes to complete in an idev shell. While this is completing we suggest generating .fastq files where the molecular index has been trimmed from the read to familiarize yourself with flexbar and to set up input files for a more advanced optional breseq tutorial. 

Error correction evaluation:

The SSCS_Log is a great place to start. Use the tail command to look at the last 8 lines of the log file to determine how many reads made it from raw reads to error corrected SSCS reads. 

help with the tail command
tail -n 8 SSCS_Log
 Approximately what fraction of raw reads became SSCS reads?

There are approximately 1/6th as many SSCS reads as raw reads:

Total Reads: 6066836

SSCS count: 978142

While this is somewhat misleading as it takes a minimum of 2 reads to generate a single SSCS read, we do have some additional information regarding what happened to the other reads. The first thing is to consider is the "Dual MI Reads" these represent the reads which correctly had the 12bp of degenerate sequence and the 4bp anchor. In this case, more than 1.5 million reads lacked an identifyable molecular index on read 1 and/or read 2. By that regard, we had ~1/4 as many SSCS reads as raw reads.

Perhaps more interesting is the number of errors removed. This is also available in the SSC_Log file, but in the middle of the file and don't have any good handle to grep with. One option is to cat the entire file and scroll around, another is to use tail/head commands you can get the specific lines only:

help with the tail command
tail -n 94 SSCS_Log | head -n 86

The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases.

Tutorial (Trimmed Reads):

For the purpose of this tutorial, we will be working with flexbar which like breseq is something that we have installed in the BioITeam as it is not a tacc module. Test that flexbar is installed and working correctly using the which command and 'flexbar -h'. If these commands work, note that this is the perfect example of the benefits of the BioITeam. This is has proven to be an extremely difficult command to install and configure and install in the past, leading me to place it in the $BI/bin directory, but it still required additional options (expand the next section to see what those may have been). Yet if it is working it strongly suggests some other member of the BioITeam found it available, and fixed it in some way so it works by default.

 

 If error messages are displayed you may need to try the following as has been necessary in previous years

Some additional modules must be loaded in order for it to work correctly, and the LD_LIBRARY_PATH variable must be modified as listed below.

module swap intel gcc
export LD_LIBRARY_PATH=/corral-repl/utexas/BioITeam/flexbar_v2.23_linux64:$LD_LIBRARY_PATH

For this tutorial, it is sufficient to simply type these commands out, if this becomes something you want to do more often, or want to submit as a job, it would be important to add these lines to your .profile so they are loaded each time you log in, and available by default to the compute nodes.

 

If the above commands are executed properly, typing flexbar -h should display a lengthy list of optional arguments which can be used for a variety of purposes. For the purpose of this tutorial, we will only focus on trimming the first 16 bases off each read as this represents the 12 bases of the molecular index and a 4 base constant region. See if you can figure out what the command is based on the help output pay special attention to the -t option.

 If you need a hint without the answer click the triangle...

The following arguments are the ones that are needed to successfully trim the first 16 bases of the sequence:

    -u, --max-uncalled NUM
          Allowed uncalled bases (N or .) in reads, default: 0
    -x, --pre-trim-left NUM
          Trim specified number of bases on 5' end of reads before alignment
    -t, --target STR
          Prefix for output file names
    -r, --reads FILE
          Input file with reads, that may contain barcodes
    -p, --reads2 FILE
          Second input file for paired read scenario
    -f, --format STR
          Input format of reads: csfasta, csfastq, fasta, fastq, fastq-sanger, fastq-solexa, fastq-i1.3, fastq-i1.5,
          fastq-i1.8 (illumina 1.8+)
    -d, --length-dist
          Write length distribution for read output files
click here for the answer
flexbar -u 100 -x 16 -t trimmed -r DED110_CATGGC_L006_R1_001.fastq -p DED110_CATGGC_L006_R2_001.fastq -f fastq -d

In an idev shell this should take less than 5 minutes to complete. Once completed there should be 4 new files, all of which begin with "trimmed" if you took the answer from the above help, or whatever string you entered for the -t argument if you did not use the above help. These 4 files represent the trimmed fastq files, the length distribution. Using the head command, see if you can figure out which file is which. 

 click here for the answer to which file is which

the trimmed_1/2.fastq is the trimmed fastq files

the trimmed_1/2.fastq.lengthdist is the length distribution file (which should have 2 lines: a header line and a line showing that all of the reads are 85 bp long now

Next step:

You should now have 3 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, trimmed_1.fastq, and trimmed_2.fastq. You could take these files into a more in depth breseq tutorial that was prepared last year for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other optional tutorial.