When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:


Multiple files can be linked by providing multiple file name arguments along and using the -t (targettarget) option to specify the directory where links to all the files can be created.


What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where Fastq FASTQ files are nested in subdirectories:


for <arg_name> in <list of whitespace-separated words>;


Here's a simple example using the seq function, that returns a list of numbers.

Code Block
for num in seq`seq 55`; do
  echo $num

Quotes matter


Here's a simple awk script that takes the mean average of the numbers passed to it:


Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:c

  • The double quotes around the text that "$line" are important to preserve special characters inside the original line (here tab characters).
    • Without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
  • Some lines have an empty job name field; we replace job and sample names in this case.
  • We assign file descriptor 4 to the file data being read (4< sampleinfo.txt after the done keyword), and read from it explicitly (read line <&4 in the while line).
    • This avoids conflict with any global redirection of standard output (e.g. from automatic logging).


utilitydefault delimiterhow to changeexample
cuttab-d or --delimiter optioncut -d ':' -f 1 /etc/passwd
(one ore more spaces or Tabs)
-t or --field-separator optionsort -t ':' -k1,1 /etc/passwd

whitespace (one ore more spaces or Tabs)

Note: some older versions of awk do not treat Tabs as field separators.

  • FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block
  • -F or --field-separator option

cat sampleinfo.txt | awk 'BEGIN{ FS=OFS="\t" } {print $1,$3}'

cat /etc/passwd | awk -F ":" '{print $1}'
joinone or more spaces-t option
join -t $'\t' -j 2 file1 file12
(one ore more spaces or Tabs)
when auto-splitting input with -a
-F'/<pattern>/' optioncat sampleinfo.txt | perl -F'/\t/' -a -n -e 'print "$F[0]\t$F[2]\n";'
(one ore or more spaces or tabs)
IFS= optionsee example above
Note that a bare IFS= removes any field separator, so whole lines are read each loop iteration.

Viewing special characters in text


  • Default field separators
    • Tab is the default field separator for cut
      • and the field separator can only be a single character
    • whitespace (one or more spaces or Tabs) is the default field separator for awk
      • note that some older versions of awk do not include Tab as a default delimiter
  • Re-ordering
    • cut cannot re-order fields; cut -f 3,2 is the same as cut -f 2,3.
    • awk does reorder fields, based on the order you specify
  • awk is a full-featured programming language while cut is just a single-purpose utility.


If grep pattern matching isn't behaving the way I expect, I turn to perl. Here's how to invoke regex pattern matching from a command line using perl:


















For example:

Code Block
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'

# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/'


  • -n tells perl to feed the input to the script one line at a time
  • -e introduces the perl script
    • always encode Always enclose a command-line perl script in single quotes to protect it from shell evaluation
  • $_ is a built-in variable holding the current line (including any invisible line-ending characters)
  • ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
  • the forward slashes ("/  /") enclose the regex pattern.


The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.


perl pattern substitution

If I have a more complicated pattern, or if sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke regex pattern substitution from a command line:















For example:

Code Block
cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \
  perl -pe '~s/JA/job /' | perl -pe '~s/SA/run /'

Gory details:


# Or, using parentheses to capture part of the search pattern:
cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \
  perl -pe '~s/JA(\d+)\tSA(\d+)/job $1 - run $2/'

Gory details:

  • -p tells perl to print its substitution results
  • -e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
  • ~s is the perl pattern substitution operator
  • forward slashes ("/  /  /") enclose the regex search pattern and the replacement text enclose the regex search pattern and the replacement text
  • parentheses ( ) enclosing a pattern "capture" text matching the pattern in a built-in positional variable
    • $1 for the 1st captured text, $2 for the 2nd captured text, etc.

Handling multiple FASTQ files example


Here's a one-liner that isolates just the unique sample names, where there are 2 files for each sample name:

Code Block
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \
  | perl -pe 's|.*/||' | perl -pe 's/_S\d.*//' | sort | uniq

But what if we want to manipulate each of the 4 FASTQ files? For example, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:


Code Block
# Shorten the sample prefix some more...
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=$( echo $pfx | perl -pe '~s/_S\d+.*///' | perl -pe '~s/L00/L/')
  echo "$pfx - $file"

Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:




<path> |





But FASTQ files have 4 lines for every sequence read. So to count the sequences properly we need to divide this number by 4.

Code Block
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat <gzipped fq $pathfile> | wc -l` / 4 ))

# Better way using awk
zcat $path<gzipped fq file> | wc -l | awk '{print $1/4}'


Code Block
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c

# produces this output:
      2 WT-1
      2 WT-2

What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:
