Tips and tricks

So we've examined a lot of "framework" issue – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.

change shell text colors

If your terminal has a dark background, the default shell colors can be hard to read. Execute this line to display directory names in yellow (and put it in your ~/.profile login script)

export LS_COLORS=$LS_COLORS:'di=1;33:'

create multiple symbolic links

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

mkdir ~/test; cd ~/test
ln -s -f ~/workshop/data/sampleinfo.txt
ls -l

Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.

cd; rm -f test/*.*
ln -s -f -t test~/workshop/data/*.txt
ls -l

What about the case where the files you want are scattered in sub-directories? Here's a solution using find and xargs:

find returns a list of matching file paths on its standard output
the paths are piped to the standard input of xargs
xargs takes the data on its standard input and calls the specified function (here ln) with that data as the function's argument list.

sort & uniq tricks

The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

cd ~/test
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1244 runs

Are all the The -c option to uniq addes a count field. Which

exercise 1

How many unique job names are in the joblist.txt file?

Solution

cut -f 1 joblist.txt | sort | uniq | wc -l
# there are 3842

Are all the job/run pairs unique?

Solution

Yes. Compare the unique lines of the file to the total lines.

cat joblist.txt | sort | uniq | wc -l
wc -l joblist.txt
# both are 3842

Which run has the most jobs?

Solution

Add a count to the unique run lines then sort on it numerically, in reverse order. The 1st line will then be the job with the most lines (jobs).

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038