...
Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:
...
Code Block |
---|
|
# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
file=`basename $path`
pfx=${file%%_R1_001.fastq.gz}
pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
echo "Counting reads in $file..." 1>&2
nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt |
Now use awk to total the number of sequences we received:
Code Block |
---|
|
cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz |
Now use awk to total the number of sequences we received:
Code Block |
---|
|
cat fastq_stats.txt | awk '
BEGIN{FS="\t"; tot=0; ct=0}
{tot = tot + $1
ct = ct + 1}
END{print "Total of",tot,"sequences in",ct,"files";
printf("Mean reads per file: %d\n", tot/ct)}'
# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476 |
So the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:
Code Block |
---|
|
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c
# produces this output:
2 WT-1
2 WT-2 |
What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:
Code Block |
---|
|
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
echo "sample $samp" 1>&2
cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk '
BEGIN{FS="\t"; tot=0; ct=0}
{tot = tot + $1; pfx = $2}
ct = ct + 1}
END{print "Total of",tot,"sequences in",ct,"files";
printf("Mean reads per file: %d\n", tot/ct)}' END{print pfx,"has",tot,"total reads"}'
done | tee sample_stats.txt
cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads |