Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

create

...

symbolic links to data files

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

...

Expand
titleSolution

Add a count to the unique run lines then sort on it numerically, in reverse order. The 1st line will then be the job with the most lines (jobs).

Code Block
languagebash
cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

muti-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

Code Block
languagebash
# how many samples does the job with the most samples have?
numSamp=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' )
echo $numSamp  # will be 23

One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:

Code Block
languagebash
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?    # exit code will be 0

To force a non-0 exit code to be returned if any part of a multi-pipe expression fails, enable the pipefail shell option:

Code Block
languagebash
set -o pipefail  # non-0 exit code by any pipe component will terminate pipe execution
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?          # exit code will be 1

But of course there's always a case where you don't want piping to return an error – for example, you might want to grab the first 100 lines of a file, but the file may not have that many lines. This condition will cause head to return a non-0 error code – even though it still returns all the lines there are.

Code Block
languagebash
set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely. 

...

utilitydefault delimiterhow to changeexample
cuttab-d or --delimiter optioncut -d ':' -f 1 /etc/passwd
sortwhitespace
(one ore more spaces or tabs)
-t or --field-separator optionsort -t ':' -k1,1 /etc/passwd
awkwhitespace
(one ore more spaces or tabs)
for both input and output
  • FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block
  • -F or --field-separator option

cat sampleinfo.txt | awk 'BEGIN{ FS=OFS="\t" }{print $1,$3}'

cat sampleinfo.txt | awk -F "\t" '{ print $1,$3 }'


joinone or more spaces-t option
join -t $'\t' -j 2 file1 file12 
perlwhitespace
(one ore more spaces or tabs)
when auto-splitting input with -a
-F'/<pattern>/' optioncat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'
readwhitespace
(one ore more spaces or tabs
IFS= optionsee example above

viewing special characters in text

When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:

Code Block
languagebash
alias hexdump='od -A x -t x1z -v'
head -5 joblist.txt | hexdump

Output will look like this

Code Block
000000 4a 41 31 31 31 38 30 09 53 41 31 32 30 31 33 0a  >JA11180.SA12013.<
000010 4a 41 31 31 32 30 36 09 53 41 31 32 30 31 33 0a  >JA11206.SA12013.<
000020 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 31 0a  >JA11207.SA12001.<
000030 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 34 0a  >JA11207.SA12004.<
000040 4a 41 31 31 32 30 38 09 53 41 31 32 30 30 31 0a  >JA11208.SA12001.<
000050

Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).

cut versus awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

...