...
For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.
Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
, which - column names are in a header line
- columns (tab-delimited) are
...
Code Block | ||
---|---|---|
| ||
# how many samplesjobs does the jobrun with the most samplesjobs have? numSampnumJob=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' ) echo $numSamp$numJob # will be 23 |
One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:
...
Code Block | ||
---|---|---|
| ||
runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 ) echo "$runs" | wc -l # preserves linefeeds echo $runs | wc -l # linefeeds converted to spaces; all runs one one line for run in $runs; do echo "Run name is: $run" done |
...
- Default field separators
- Tab is the default field separator for cut
- and the field separator can only be a single character
- whitespace (one or more spaces or Tabs) is the default field separator for awk
- note that some older versions of awk do not include Tab as a default delimiter
- Tab is the default field separator for cut
- Re-ordering
- cut cannot re-order fields;
cut -f 3,2
is the same ascut -f 2,3
. - awk does reorder fields, based on the order you specify
- cut cannot re-order fields;
- awk is a full-featured programming language while cut is just a single-purpose utility.
...