Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:

...

Code Block
languagebash
# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
  echo "Counting reads in $file..." 1>&2
  nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
  echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt

Now use awk to total the number of sequences we received:

Code Block
languagebash



cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz

Now use awk to total the number of sequences we received:

Code Block
languagebash
cat fastq_stats.txt | awk '
  BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1
   ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'


# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476

So the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:

Code Block
languagebash
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c


# produces this output:
      2 WT-1
      2 WT-2

What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:

Code Block
languagebash
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
  echo "sample $samp" 1>&2
  cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk '
    BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1; pfx = $2}
 ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'   END{print pfx,"has",tot,"total reads"}'
done | tee sample_stats.txt


cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads