...
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
, which names are in a header line
- columns (tab-delimited) are
...
Creating symbolic links to data files
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:
...
Code Block | ||
---|---|---|
| ||
cd; rm -f test/*.* find ~/workshop/ -name "*.txt" | xargs ln -s -f -t ~/ -name "*.txt" | xargs ln -s -f -t ~/test |
...
test |
Removing file suffixes
Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename funciton. Then use the odd-looking syntax ${<variable-name>%%.<suffix-to-remove>}
Code Block | ||
---|---|---|
| ||
path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename
prefix=${filename%%.something.txt}
echo $prefix |
Tricks with sort & uniq
The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Multi-pipe expression considerations
Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:
...
Code Block | ||
---|---|---|
| ||
set +o pipefail # only the exit code of the last pipe component is returned cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' echo $? # exit code will be 0 |
...
Quotes matter
We've already seen that quoting variable evaluation preserves the caller's argument quoting (see Quoting subtleties). But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).
...
Code Block | ||
---|---|---|
| ||
runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 ) echo "$runs" # preserves linefeeds echo $runs # linefeeds converted to spaces for run in $runs; do echo "Run name is: $run" done |
...
Reading file lines
The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
...
Field delimiter issues
As we've already seen, field delimiters are tricky! Be aware of the default field delimiter for the various bash utilities, and how to change them:
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace (one ore more spaces or tabs) | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |
awk | spaces (one ore more spaces) for both input and output |
|
cat sampleinfo.txt | awk -F "\t" '{ print $1,$3 }' |
join | one or more spaces | -t option |
|
perl | whitespace (one ore more spaces or tabs) when auto-splitting input with -a | -F'/<pattern>/' option | cat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";' |
read | whitespace (one ore more spaces or tabs | IFS= option | see example above |
...
Viewing special characters in text
When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:
...
Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).
...
Parsing field-oriented text with cut and awk
The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:
...
When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data. Even though awk is a full-featured programming language, I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl for complicated text manipulation.
Regular expressions in grep, sed and perl
...
Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.
...