Overview:

This section provides directions for generating SSCS (Single Strand Consensus Sequence) reads and trimming molecular indexes from raw fastq files.

Learning Objectives:

Use python script to generate SSCS Reads.
Use cutadapt to trim molecular indexes from duplex seq libraries.

Tutorial: SSCS Reads

First we want to generate SSCS reads where we take advantage of the molecular indexes added during library prep. To do so we will use a "majority rules" python script (named SSCS_DCS.py) which was heavily modified by DED from a script originally created by Mike Schmitt and Scott Kennedy for the original duplex seq paper. This script can be found in the $BI/bin directory which means we will need to copy it to our scratch directory to use it on the compute node. For the purpose of this tutorial, the paired end sequencing of sample DED110 (prepared with a molecular index library) has been placed in the $BI/gva_course/mixed_population directory. Invoking the script is as simple as typing SSCS_DCS.py; adding -h will give a list of the available options. The goal of this command is to generate SSCS reads, for any molecular index where we have at least 2 reads present, and to generate a log file which will tell us some information about the data.

Click here for solution of how to copy the DED110 fastq files to a new directorry called BDIB_Error_Correction

mkdir $SCRATCH/GVA_Error_Correction
cd $SCRATCH/GVA_Error_Correction
cp $BI/gva_course/mixed_population/DED110*.fastq.gz .
cp `which SSCS_DCS.py` .

If you have not seen the double backtick command in any of the tutorials yet, the command within the `` is evaluated first and the output placed within the `` marks to be evaluated by the outer command. In this case it means that wherever the SSCS_DCS.py script is in our path it will be copied to our current location. As this was actually one of the first scripts I ever wrote, it is still written in python 2.7 which means that you will need to deactivate your conda environment in order to get the script to run.

Interrogate the SSCS_DCS.py script to determine how to invoke it. Click here for hints before the answer

You can often get more information about python scripts by typing the name of the script followed by the -h command.

The -h command should show you these options as being the key options to use/consider

  -f1 FASTQ1, --fastq1 FASTQ1
                        fastq read1 file to check
  -f2 FASTQ2, --fastq2 FASTQ2
                        fastq read2 file to check
  -p PREFIX, --prefix PREFIX
                        prefix for output files
  -s, --SSCS            calculate SSCS sequence, off by default. IF DCS
                        specificed, automatically on
  -m MINIMUM_READS, --minimum_reads MINIMUM_READS
                        minimum number of reads needed to support SSCS reads
  --log LOG             name of output log file

This script requires uncompressed fastq files. use gunzip before trying to build your python command

gunzip *.gz

Using that information, see if you can figure out how to put the command together

SSCS_DCS.py -f1 DED110_CATGGC_L006_R1_001.fastq -f2 DED110_CATGGC_L006_R2_001.fastq -p DED110 -s -m 2 --log SSCS_Log

This is expected to take more than 30 minutes in an idev shell. You may want run it as a submitted job rather than interatively. If you are unsure how to accomplish this please ask. Suggest looking over the alternative library prep presentation or the duplex sequencing paper itself in the mean time.

Error correction evaluation:

The SSCS_Log is a great place to start. Use the tail command to look at the last 8 lines of the log file to determine how many reads made it from raw reads to error corrected SSCS reads.

help with the tail command

tail -n 8 SSCS_Log

Approximately what fraction of raw reads became SSCS reads?

There are approximately 1/6th as many SSCS reads as raw reads:

Total Reads: 6066836

SSCS count: 978142

While this is somewhat misleading as it takes a minimum of 2 reads to generate a single SSCS read, we do have some additional information regarding what happened to the other reads. The first thing is to consider is the "Dual MI Reads" these represent the reads which correctly had the 12bp of degenerate sequence and the 4bp anchor. In this case, more than 1.5 million reads lacked an identifiable molecular index on read 1 and/or read 2. By that regard, we had ~1/4 as many SSCS reads as raw reads.

Perhaps more interesting is the number of errors removed. This is also available in the SSC_Log file, but in the middle of the file and don't have any good handle to grep with. One option is to cat the entire file and scroll around, another is to use tail/head commands you can get the specific lines only:

help with the tail command

tail -n 94 SSCS_Log | head -n 86

The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases.

Tutorial (Trimmed Reads with cutadapt):

From our earlier tutorial on read quality control you likely remember that we installed cutadapt into our GVA-fastqc conda environment. Use cutadapt's help files to figure out how to trim the first 16 bases off DED110_CATGGC_L006_R1_001.fastq and DED110_CATGGC_L006_R2_001.fastq

Use what you know about cutadapt and help functions to try to determine how you want to trim the first 16 bases off the R1 and R2 reads. Click here for a hint.

conda activate GVA-fastqc
cutadapt -h

Click here for 2 example commands that will work.

cutadapt -u 17 DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq
cutadapt -u 17 DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq

Each of these commands will take 1-2 minutes to complete. Think about ways you could have run both commands at the same time. In the tutorials (including some of the optional ones) there are at least 3 ways you may have been shown, and many more we haven't. How many can you come up with?

Possible answers

# 1. use a semicolon to separate the two commands so that the second will start as soon as the first finishes:
cutadapt -u 17 DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq; cutadapt -u  17 DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq
 
# 2. use a double && between the commands so the second will start as soon as the first finishes, if it finishes without any errors:
cutadapt -u 17 DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq && cutadapt -u 17 DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq
 
# 3. use a trailing & to have the commands run in the background:
cutadapt -u 17 DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq &
cutadapt -u 17 DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq &

Of the 3 answers we show above, one of them will actually finish much sooner than the first. Do you know which one and why?

Answer and small discussion

The 3rd solution will finish before the other two because they are actually executed at the same time rather than waiting for one to finish. In many circumstances this is among the best ways to do something like this, and 'simple' read trimming with the cutadapt is one of them. If you are doing something much more computationally intense (say read mapping, variant calling, or genome assembly) trying to complete the tasks at the same time will often leave you with no results at all as you run out of memory even on the compute nodes and the programs error out.

Checking the current contents of the directory will show you we've now made 2 new .trimmed.fastq files in addition to the trio of .fastq files we made in the error correction part of the tutorial. The DED110_SSCS.fastq is the one of most interest to us for the follow up tutorial, while both the .trimmed.fastq files will be of interest. Rather than working with 3 files for 2 samples (error corrected and trimmed), use what you have learned about piping to generate a single file called DED110_all.trimmed.fastq and check your work.

Need a hint?

command	function
cat	prints contents of a file to the screen
>	writes whatever happened on the left side to the right side
>>	appends whatever happened on the left side to the right side
wc -l	how many lines are in whatever is specified next
head	view the top lines of a file
tail	view the bottom lines of a file

Possible solution

cat *.trimmed.fastq > DED110_all.trimmed.fastq  
head DED110_all.trimmed.fastq
tail DED110_all.trimmed.fastq
wc -l *.trimmed.fastq

If your solution was different, feel free to ask me if yours will work the same.

Optional bonus excercise:

Can you figure out how to trim only the first 16 bases off the reads using fastp?

Next step:

You should now have 2 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, and DED110_all.trimmed.fastq. You should take these files into a more in depth breseq tutorial for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other tutorial.

Return to GVA2023 course page.

Bioinformatics Team (BioITeam) at the University of Texas

Molecular Index Error correction GVA2023