...
- -n tells perl to feed the input to the script one line at a time
- -e introduces the perl script
- always encode a command-line perl script in single quotes to protect it from shell evaluation
- $_ is a built-in variable holding the current line
- ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
- the forward slashes ("/ /") enclose the regex pattern.
sed pattern substitution
The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.
...
If sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke regex pattern substitution from a command line:
...
Code Block | ||
---|---|---|
| ||
cat fastq_stats.txt | awk ' BEGIN{FS="\t"; tot=0; ct=0} {tot = tot + $1 ct = ct + 1} END{print "Total of",tot,"sequences in",ct,"files"; printf("Mean reads per file: %d\n", tot/ct)}' # produces this output: Total of 7489904 sequences in 4 files Mean reads per file: 1872476 |
So Note that the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:
...
Code Block | ||
---|---|---|
| ||
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do echo "sample $samp" 1>&2 cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk -vsample=$samp ' BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2} END{print pfxsample,"has",tot,"total reads"}' done | tee sample_stats.txt cat sample_stats.txt # produces this output: WT-1_L1 has 2712076 total reads WT-2_L1 has 4777828 total reads |