Page Comparison

...

joblist.txt - contains job name/sample name pairs, tab-delimited, no header
sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

create

...

symbolic links to data files

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

...

Expand

title	Solution

Add a count to the unique run lines then sort on it numerically, in reverse order. The 1st line will then be the job with the most lines (jobs).

Code Block

language	bash

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

muti-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

Code Block

language	bash

# how many samples does the job with the most samples have?
numSamp=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' )
echo $numSamp  # will be 23

One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:

Code Block

language	bash

cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?    # exit code will be 0

To force a non-0 exit code to be returned if any part of a multi-pipe expression fails, enable the pipefail shell option:

Code Block

language	bash

set -o pipefail  # non-0 exit code by any pipe component will terminate pipe execution
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?          # exit code will be 1

But of course there's always a case where you don't want piping to return an error – for example, you might want to grab the first 100 lines of a file, but the file may not have that many lines. This condition will cause head to return a non-0 error code – even though it still returns all the lines there are.

Code Block

language	bash

set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.

...

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	whitespace (one ore more spaces or tabs) for both input and output	FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }{print $1,$3}'` `cat sampleinfo.txt \| awk -F "\t" '{ print $1,$3 }'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore more spaces or tabs	IFS= option	see example above

viewing special characters in text

When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:

Code Block

language	bash

alias hexdump='od -A x -t x1z -v'
head -5 joblist.txt | hexdump

Output will look like this

Code Block

000000 4a 41 31 31 31 38 30 09 53 41 31 32 30 31 33 0a  >JA11180.SA12013.<
000010 4a 41 31 31 32 30 36 09 53 41 31 32 30 31 33 0a  >JA11206.SA12013.<
000020 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 31 0a  >JA11207.SA12001.<
000030 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 34 0a  >JA11207.SA12004.<
000040 4a 41 31 31 32 30 38 09 53 41 31 32 30 30 31 0a  >JA11208.SA12001.<
000050

Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).

cut versus awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

...

Versions Compared

Old Version 11

New Version 12

Key

create

symbolic links to data files

muti-pipe expression considerations

reading file lines

viewing special characters in text

cut versus awk