Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • -n tells perl to feed the input to the script one line at a time
  • -e introduces the perl script
    • always encode a command-line perl script in single quotes to protect it from shell evaluation
  • $_ is a built-in variable holding the current line
  • ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
  • the forward slashes ("/  /") enclose the regex pattern.

sed pattern substitution

The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.

...

If sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke regex pattern substitution from a command line:

...

Code Block
languagebash
cat fastq_stats.txt | awk '
  BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1
   ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'

# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476

So Note that the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:

...

Code Block
languagebash
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
  echo "sample $samp" 1>&2
  cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk -vsample=$samp '
    BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2}
    END{print pfxsample,"has",tot,"total reads"}'
done | tee sample_stats.txt

cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads