Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The 3 columns are the read posistion, the number of bases changed, and the number of bases not changed. If you copy and paste these 3 columns into excel you can easily calculate the sum of the 2nd column to see that 446,104 bases were changed. The read position is based on the 5-3' sequence, and you should notice that generally the higher the read position, the more errors were corrected. This should make sense based on what we have talked about with decreasing quality scores as read length increases.

Tutorial (Trimmed Reads with fastxtoolkit):

From our earlier tutorial on read quality control you likely remember that you can load the fastx_toolkit as a module. If you feel like you need a hint to do this, pause and think for a minute and try some things. If you still can't get it, raise your hand and talk to us as this is a concept that you should be able to do on your own by now so we need to help explain things differently.

Code Block
titleUse what you know about fastx and help functions to try to determine how you want to trim the first 16 bases off the R1 and R2 reads. Click here for a hint.
collapsetrue
$ fastx<tabx2> # will display the following:

...

Rework for fastx_trimmer ... fastx_trimmer -f 17


fastx_artifacts_filter                       fastx_nucleotide_distribution_graph.sh       fastx_reverse_complement
fastx_barcode_splitter.pl                    fastx_nucleotide_distribution_line_graph.sh  fastx_trimmer
fastx_clipper                                fastx_quality_stats                          fastx_uncollapser
fastx_collapser                              fastx_renamer  
 
# interrogate the commands you think might have the answer to what you are trying to do using the '-h' option to determine how to trim the first 16 bases off DED110_CATGGC_L006_R1_001.fastq and DED110_CATGGC_L006_R2_001.fastq
Code Block
languagebash
titleClick here for 2 example commands that will work.
collapsetrue
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq

Each of these commands will take 1-2 minutes to complete. Think about ways you could have run both commands at the same time. In the tutorials (including some of the optional ones) there are at least 3 ways that we have shown you to do it, and many more we haven't. How many can you come up with?

Code Block
titlePossible answers
collapsetrue
# 1. use a semicolon to separate the two commands so that the second will start as soon as the first finishes:
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq; fastx_trimmer -f 17 -i DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq
 
# 2. use a double && between the commands so the second will start as soon as the first finishes, if it finishes without any errors:
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq && fastx_trimmer -f 17 -i DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq
 
# 3. use a trailing & to have the commands run in the background:
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R1_001.fastq -o DED110.R1.trimmed.fastq &
fastx_trimmer -f 17 -i DED110_CATGGC_L006_R2_001.fastq -o DED110.R2.trimmed.fastq &

Of the 3 answers we show above, one of them will actually finish much sooner than the first. Do you know which one and why?

Expand
titleAnswer and small discussion

 The 3rd solution will finish before the other two because they are actually executed at the same time rather than waiting for one to finish. In many circumstances this is among the best ways to do something like this, and 'simple' read trimming with the fastx toolkit is one of them. If you are doing something much more computationally intense (say read mapping, variant calling, or genome assembly) trying to complete the tasks at the same time will often leave you with no results at all as you run out of memory even on the compute nodes and the programs error out.

Checking the current contents of the directory will show you we've now made 2 new .trimmed.fastq files in addition to the trio of .fastq files we made in the error correction part of the tutorial. The DED110_SSCS.fastq is the one of most interest to us for the follow up tutorial, while both the .trimmed.fastq files will be of interest. Rather than working with 3 files for 2 samples (error corrected and trimmed), use what you have learned about piping to generate a single file called DED110_all.trimmed.fastq and check your work.

Expand
titleNeed a hint?
commandfunction
catprints contents of a file to the screen
>writes whatever happened on the left side to the right side
>>appends whatever happened on the left side to the right side
wc -lhow many lines are in whatever is specified next
headview the top lines of a file
tailview the bottom lines of a file
Code Block
titlePossible solution
collapsetrue
cat *.trimmed.fastq > DED110_all.trimmed.fastq  
# The above could also be done as 2 sequential steps with naming each file separately, and using a >> on the second line.
head DED110_all.trimmed.fastq
tail DED110_all.trimmed.fastq
wc -l *.trimmed.fastq
 
# these 4 commands should give you all the information you need to make sure you have a single file with all the information from the first 2. Ask if you aren't sure you ahve the right solution.

Next step:

You should now have 2 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, and DED110_all.trimmed.fastq. You should take these files into a more in depth breseq tutorial for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other tutorial.

Optional not recommended tutorial trimming reads with flexbar:

For an another discussion about version control and when it is necessary to update to new tools and versions of programs, take a look at the trimmed reads tutorial from last year which used flexbrar simply because 'it worked before so keep using it'. Compare the simplistic fastx_trimmer commands used in this tutorial to all the work that went into flexbar last year. So while "well enough can be left alone", sometimes it is still better to use new tools. As the heading suggests, we don't actually suggest that you USE flexbar to trim this data set or any other, just something worth looking at to see how different programs operate or are invoked to achieve the same goals.

Next step:

You should now have 3 new .fastq files which we will use to call variants in: DED110_SSCS.fastq, trimmed_1.fastq, and trimmed_2.fastq. You could take these files into a more in depth breseq tutorial that was prepared last year for comparisons of the specific mutations that are eliminated using the error correction (SSCS). Link to other tutorial.

Return to GVA2017