...
Code Block | ||
---|---|---|
| ||
# Shorten the sample prefix some more... for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do file=`basename $path` pfx=${file%%_R1_001.fastq.gz} pfx=$( echo $pfx | perl -pe '~s/_S\d+.*////' | perl -pe '~s/L00/L/') echo "$pfx - $file" done |
Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:
...
language | bash |
---|
...
zcat <path> | wc -l
But FASTQ files have 4 lines for every sequence read. So to count the sequences properly we need to divide this number by 4.
Code Block | ||
---|---|---|
| ||
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic! echo $(( `zcat $path<gzipped fq file> | wc -l` / 4 )) # Better way using awk zcat <gzipped fq $pathfile> | wc -l | awk '{print $1/4}' |
...
Code Block | ||
---|---|---|
| ||
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c
# produces this output:
2 WT-1
2 WT-2 |
What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:
...