...
Part | Purpose | replace with/note |
---|---|---|
fastp | tell the computer you are using the fastp prgram | |
-i <READ1> | fastq file read1 you are trying to trim | actual name of fastq file |
-I <READ2> | fastq file read2 you are trying to trim | actual name of paired fastq file |
-o <TRIM1> | output file of trimmed fastq file of read 1 | desired name of trimmed fastq file |
-O <TRIM2> | output file of trimmed fastq file of read 2 | desired name of paired trimmed fastq file |
--threads # | use more processors, make command run faster | number of additional processors (68 48 max on stampede2) |
--detect_adapter_for_pe | automatically detect adapter sequence based on paired end reads, and remove them | |
-j <LOG.json> | json file with information about how the trim was accomplished. can be helpful for looking at multiple samples similar to multiqc analysis | name of json file you want to use |
-h <LOG.html> | html file with infomration similar to the json file, but with graphs | name of html file you want to use |
...
Line number | As is | To be |
---|---|---|
16 | #SBATCH -J jobName | #SBATCH -J mutli_fastp |
17 | #SBATCH -n 1 | #SBATCH -n 1712 |
21 | #SBATCH -t 12:00:00 | #SBATCH -t 0:20:00 |
22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
29 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=fastp.commands |
Line 17 being set to -n 17 allows 17 jobs to run at the same time, since our command uses -w 4 (4 threads) this job will use all 68 48 threads available. The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
...
Before we jump to making our commands file executable and executing it, we want to change it to be slightly different. Specifically, above we used -w 4 to specify we wanted to use 4 processor for every command. While this worked great when we also were launching 17 12 processes at the same time as it used all 68 48 processes, when executing a commands file from the command line without the help of the queue system, only 1 sample at a time will launch so you likely think we need to increase to 68 48 processors.
Code Block | ||||
---|---|---|---|---|
| ||||
#1 change the for loop: for R1 in Raw_Reads/*_R1_001.fastq.gz; do R2=$(echo $R1| sed 's/_R1_/_R2_/'); name=$(echo $R1|sed 's/_R1_001.fastq.gz//'|sed 's/Raw_Reads\///'); echo "fastp -i $R1 -I $R2 -o Trim_Reads/$name.trim.R1.fastq.gz -O Trim_Reads/$name.trim.R2.fastq.gz -w 6848 --detect_adapter_for_pe -j Trim_Logs/$name.json -h Trim_Logs/$name.html &> Trim_Logs/$name.log.txt";done > fastp.commands #2 use sed to do an in file replacement (something new you haven't seen before in this class) sed -i 's/ -w 4 / -w 6848 /g' fastp.commands |
Note that if you use the sed command above, you need to be very careful in what you choose to match to. If you just choose "4" and replace with "6848" then the commands file will then change any file name that has a 4 into 68 48 and all those samples will fail. When using sed to do replacements, always make sure you have a unique handle, when you don't, and when you don't need one.
...
Once the command is started continue reading below.
Comparing different run optionsoptions
Note | ||
---|---|---|
| ||
Run times in this section are based on nodes with 68 processors available rather than 48. |
In previous years it has been common to question what the fastest way of getting a large set of samples analyzed is with respect to threads and Nodes and tasks. Here we hav an opportunity to do just that, and have some surprising results. Since we have been working with idev sessions all along we'll start with the following:
...
Based on what we have already seen, it is probably not surprising that using 68 (really 16) threads and only evaluating 1 sample at a time took approximately the same amount of time as it did when running on an idev node as those conditions are functionally equivalent. Whaat What may be surprising is the lack of improvement despite running 4x more samples at the same time. Potential hypotheses:
...
No Format |
---|
Duplication rate: 6.5207% Insert size peak (evaluated by paired-end reads): 80 JSON report: Trim_Logs/E2-1_S189_L001.json HTML report: rim_Logs/E2-1_S189_L001.html fastp -i Raw_Reads/E2-1_S189_L001_R1_001.fastq.gz -I Raw_Reads/E2-1_S189_L001_R2_001.fastq.gz -o Trim_Reads/E2-1_S189_L001.trim.R1.fastq.gz -O Trim_Reads/E2-1_S189_L001.trim.R2.fastq.gz -w 68 --detect_adapter_for_pe -j Trim_Logs/E2-1_S189_L001.json -h Trim_Logs/E2-1_S189_L001.html fastp v0.23.24, time used: 3 seconds |
Typically what you should look for is some kind of anchor that you can pass to grep that is as far down the file as possible. Sometimes you will be lucky and the program will actually print something like "successfully complete". In our case the last line looks promising, "fastp v0.23.24, time used:" seems likely to be printed as the last step in the program.
Code Block | ||||
---|---|---|---|---|
| ||||
ls Trim_Logs/*.log.txt|wc -l wc -l fastp.commands tail -n 1 Trim_Logs/*.log.txt|grep -c "^fastp v0.23.24, time used:" |
The above 3 commands are all expected to return 272.
If so remember I'm on zoom if you need help looking at whats going on.
...