...
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
, which names are in a header line
- columns (tab-delimited) are
create
...
symbolic links to data files
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:
...
Expand | |||||
---|---|---|---|---|---|
| |||||
Add a count to the unique run lines then sort on it numerically, in reverse order. The 1st line will then be the job with the most lines (jobs).
|
muti-pipe expression considerations
Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:
Code Block | ||
---|---|---|
| ||
# how many samples does the job with the most samples have?
numSamp=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' )
echo $numSamp # will be 23 |
One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:
Code Block | ||
---|---|---|
| ||
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $? # exit code will be 0 |
To force a non-0 exit code to be returned if any part of a multi-pipe expression fails, enable the pipefail shell option:
Code Block | ||
---|---|---|
| ||
set -o pipefail # non-0 exit code by any pipe component will terminate pipe execution
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $? # exit code will be 1 |
But of course there's always a case where you don't want piping to return an error – for example, you might want to grab the first 100 lines of a file, but the file may not have that many lines. This condition will cause head to return a non-0 error code – even though it still returns all the lines there are.
Code Block | ||
---|---|---|
| ||
set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $? # exit code will be 0 |
reading file lines
The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.
...
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace (one ore more spaces or tabs) | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |
awk | whitespace (one ore more spaces or tabs) for both input and output |
|
cat sampleinfo.txt | awk -F "\t" '{ print $1,$3 }' |
join | one or more spaces | -t option |
|
perl | whitespace (one ore more spaces or tabs) when auto-splitting input with -a | -F'/<pattern>/' option | cat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";' |
read | whitespace (one ore more spaces or tabs | IFS= option | see example above |
viewing special characters in text
When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:
Code Block | ||
---|---|---|
| ||
alias hexdump='od -A x -t x1z -v'
head -5 joblist.txt | hexdump |
Output will look like this
Code Block |
---|
000000 4a 41 31 31 31 38 30 09 53 41 31 32 30 31 33 0a >JA11180.SA12013.<
000010 4a 41 31 31 32 30 36 09 53 41 31 32 30 31 33 0a >JA11206.SA12013.<
000020 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 31 0a >JA11207.SA12001.<
000030 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 34 0a >JA11207.SA12004.<
000040 4a 41 31 31 32 30 38 09 53 41 31 32 30 30 31 0a >JA11208.SA12001.<
000050 |
Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).
cut versus awk
The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:
...