Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.

Image RemovedImage Added

Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which
    • column names are in a header line

...

Code Block
languagebash
# how many samplesjobs does the jobrun with the most samplesjobs have?
numSampnumJob=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' )
echo $numSamp$numJob  # will be 23

One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:

...

Code Block
languagebash
runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 )
echo "$runs" | wc -l  # preserves linefeeds
echo $runs   | wc -l  # linefeeds converted to spaces; all runs one one line
for run in $runs; do
  echo "Run name is: $run"
done

...

  • Default field separators
    • Tab is the default field separator for cut
      • and the field separator can only be a single character
    • whitespace (one or more spaces or Tabs) is the default field separator for awk
      • note that some older versions of awk do not include Tab as a default delimiter
  • Re-ordering
    • cut cannot re-order fields; cut -f 3,2 is the same as cut -f 2,3.
    • awk does reorder fields, based on the order you specify
  • awk is a full-featured programming language while cut is just a single-purpose utility.

...