Page Comparison

...

Part	Purpose	replace with/note
fastp	tell the computer you are using the fastp prgram
-i <READ1>	fastq file read1 you are trying to trim	actual name of fastq file
-I <READ2>	fastq file read2 you are trying to trim	actual name of paired fastq file
-o <TRIM1>	output file of trimmed fastq file of read 1	desired name of trimmed fastq file
-O <TRIM2>	output file of trimmed fastq file of read 2	desired name of paired trimmed fastq file
--threads #	use more processors, make command run faster	number of additional processors (68 48 max on stampede2)
--detect_adapter_for_pe	automatically detect adapter sequence based on paired end reads, and remove them
-j <LOG.json>	json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysis	name of json file you want to use
-h <LOG.html>	html file with infomration similar to the json file, but with graphs	name of html file you want to use

...

Expand

title	What are the 4 new files that were generated, and where did they come from?

E1-7_S187_L001_R2_001.trim.fastq.gz and E1-7_S187_L001_R1_001.trim.fastq.gz
- These were created with the -o and -O options, they are in the Trim_Reads folder, and you likely found them using the ls command
fastp.html and fastp.json
- These are log files created by default since we didn't specify their names. This is part of why -j and -h were discussed above with the general command.
- While the json file can be evaluated in the terminal (cat less more head tail), the html file has to be transferred back to your computer to view.

Expand

title	How many reads were left after trimming?

5884 paired end reads

11768 total reads

You likely found this out from using the zgrep command, or from the following blocks that printed as the command ran:

No Format

Read1 after filtering:
total reads: 5884
total bases: 791763
Q20 bases: 782948(98.8867%)
Q30 bases: 765510(96.6842%)

Read2 after filtering:
total reads: 5884
total bases: 791763
Q20 bases: 711414(89.8519%)
Q30 bases: 658164(83.1264%)

Filtering result:
reads passed filter: 11768
reads failed due to low quality: 2014
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 3970
bases trimmed due to adapters: 193972

Expand

title	How big was our fragment size? How is this estimated, what might make it inaccurate?

From the information generated while the command ran we see:
- No Format
  Insert size peak (evaluated by paired-end reads): 171
- This tells us that the average peak size was 171 bases, and that it was estimated by looking at the overlap between the read pairs. It is potentially inaccurate as reads which do not overlap each other can not estimate the size.
If you transferred the .html file back to your laptop, you would see this relevant histogram:
- The general section of the summary at the top of the html tells us that the average insert size was 171, while the histogram tells us that 50% of our data is <18 or >272 bases

Expand

title	Did our sample have any adapter present?

If you only look at the information that printed to the screen, you probably answer "No"
- you likely see the following block and think this is the end of the answer:
- No Format
  Detecting adapter sequence for read1... No adapter detected for read1 Detecting adapter sequence for read2... No adapter detected for read2

A more fuller answer might be "maybe" or "probably" or "I'm not sure" as:

1. Not finding any adapter would be super rare
2. If 45% of our reads have an insert size of 171 bases, and we did 151bp PE sequencing, we should be able to find adapter sequences
3. in the filtering results we see:

No Format

Filtering result:
reads passed filter: 11768
reads failed due to low quality: 2014
reads failed due to too many N: 0
reads failed due to too short: 0
reads with adapter trimmed: 3970
bases trimmed due to adapters: 193972

If you look at the html file you probably answered "yes"
- There is a section for Read1 and Read2 adapters which show a growing stretch of DNA which recreates the illumina adapter sequences.

Expand

title	Why was answering if there were adapters present not straight forward?

Like we saw in our fastqc reports (over represented sequences having "no hit" and adapter content staying at bottom of graph), for something to be classified as an "adapter" in the first section of the printed information, it has to meet certain criteria that in this (and many other instances) is perhaps a bit too stringent.

Expand

title	What other interesting things can you find from this command?

This is pretty open ended, take a look at the html file in particular, see what of it does or doesn't make sense and consider asking a question if you would like to know moresense and consider asking a question if you would like to know more.

Info

Of several things that you may stand out to you is large fraction of reads end with stretches of "G" on the end. There are 2 things to note with this: 1. 2 color sequencing on illumina (detailed information here) reads "no color" as "G", 2. This library is very fragmented and contains adapter dimers meaning that in some cases there are only ~40bp downstream of the sequencing primer location leaving 60 cycles that have no template available. If you look at the help for fastp the following options may stand out to you as a way to deal with this:

-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])
-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data
-x, --trim_poly_x enable polyX trimming in 3' ends.
--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

Consider rerunning the fastp command while adding "-g" to the command line and see how the results differ.

Trim all the samples from the multiqc tutorial

...

Line number	As is	To be
16	#SBATCH -J jobName	#SBATCH -J mutli_fastp
17	#SBATCH -n 1	#SBATCH -n 1712
21	#SBATCH -t 12:00:00	#SBATCH -t 0:20:00
22	##SBATCH --mail-user=ADD	#SBATCH --mail-user=<YourEmailAddress>
23	##SBATCH --mail-type=all	#SBATCH --mail-type=all
29	export LAUNCHER_JOB_FILE=commands	export LAUNCHER_JOB_FILE=fastp.commands

Line 17 being set to -n 17 allows 17 jobs to run at the same time, since our command uses -w 4 (4 threads) this job will use all 68 48 threads available. The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

...

Before we jump to making our commands file executable and executing it, we want to change it to be slightly different. Specifically, above we used -w 4 to specify we wanted to use 4 processor for every command. While this worked great when we also were launching 17 12 processes at the same time as it used all 68 48 processes, when executing a commands file from the command line without the help of the queue system, only 1 sample at a time will launch so you likely think we need to increase to 68 48 processors.

Code Block

language	bash
title	2 different ways to increase to 68 threads

#1 change the for loop:
for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 6848 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands

#2 use sed to do an in file replacement (something new you haven't seen before in this class)
sed -i 's/ -w 4 / -w 6848 /g' fastp.commands

Note that if you use the sed command above, you need to be very careful in what you choose to match to. If you just choose "4" and replace with "6848" then the commands file will then change any file name that has a 4 into 68 48 and all those samples will fail. When using sed to do replacements, always make sure you have a unique handle, when you don't, and when you don't need one.

...

Once the command is started continue reading below.

Comparing different run optionsoptions

Note

title	Run times in this section

Run times in this section are based on nodes with 68 processors available rather than 48.

In previous years it has been common to question what the fastest way of getting a large set of samples analyzed is with respect to threads and Nodes and tasks. Here we hav an opportunity to do just that, and have some surprising results. Since we have been working with idev sessions all along we'll start with the following:

...

Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat What may be surprising is the lack of improvement despite running 4x more samples at the same time. Potential hypotheses:

...

No Format

Duplication rate: 6.5207%

Insert size peak (evaluated by paired-end reads): 80

JSON report: Trim_Logs/E2-1_S189_L001.json
HTML report: rim_Logs/E2-1_S189_L001.html

fastp -i Raw_Reads/E2-1_S189_L001_R1_001.fastq.gz -I Raw_Reads/E2-1_S189_L001_R2_001.fastq.gz -o Trim_Reads/E2-1_S189_L001.trim.R1.fastq.gz -O Trim_Reads/E2-1_S189_L001.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/E2-1_S189_L001.json -h Trim_Logs/E2-1_S189_L001.html 
fastp v0.23.24, time used: 3 seconds

Typically what you should look for is some kind of anchor that you can pass to grep that is as far down the file as possible. Sometimes you will be lucky and the program will actually print something like "successfully complete". In our case the last line looks promising, "fastp v0.23.24, time used:" seems likely to be printed as the last step in the program.

Code Block

language	bash
title	submit the job to run on the que

ls Trim_Logs/*.log.txt|wc -l
wc -l fastp.commands
tail -n 1 Trim_Logs/*.log.txt|grep -c "^fastp v0.23.24, time used:"

The above 3 commands are all expected to return 272. If so remember I'm on zoom if you need help looking at whats going on.

...

Versions Compared

Old Version 5

New Version Current

Key

Trim all the samples from the multiqc tutorial

Comparing different run optionsoptions