Page Comparison

...

joblist.txt - contains job name/sample name pairs, tab-delimited, no header
sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

...

Creating symbolic links to data files

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

...

Code Block

language	bash

cd; rm -f test/*.*
find ~/workshop/ -name "*.txt" | xargs ln -s -f -t ~/ -name "*.txt" | xargs ln -s -f -t ~/test

...

test

Removing file suffixes

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename funciton. Then use the odd-looking syntax ${<variable-name>%%.<suffix-to-remove>}

Code Block

language	bash

path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename
prefix=${filename%%.something.txt}
echo $prefix

Tricks with sort & uniq

The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

...

Expand

title	Solution

Code Block

language	bash

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

...

Multi-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

...

Code Block

language	bash

set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

...

Quotes matter

We've already seen that quoting variable evaluation preserves the caller's argument quoting (see Quoting subtleties). But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

...

Code Block

language	bash

runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 )
echo "$runs"   # preserves linefeeds
echo $runs     # linefeeds converted to spaces

for run in $runs; do
  echo "Run name is: $run"
done

...

Reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.

...

Expand

title	Solution

Code Block

language	bash

# be sure to use a different file descriptor (here 4)
while IFS= read line <&4; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then continue; fi
  echo "Project_${jobName}/${sampleName}.fastq.gz"
done 4< <(tail -n +2 sampleinfo.txt) | tee pathnames.txt

...

Field delimiter issues

As we've already seen, field delimiters are tricky! Be aware of the default field delimiter for the various bash utilities, and how to change them:

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	spaces (one ore more spaces) for both input and output	FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }{print $1,$3}'` `cat sampleinfo.txt \| awk -F "\t" '{ print $1,$3 }'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore more spaces or tabs	IFS= option	see example above

...

Viewing special characters in text

When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:

...

Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).

...

Parsing field-oriented text with cut and awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

...

When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data. Even though awk is a full-featured programming language, I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl for complicated text manipulation.

Regular expressions in grep, sed and perl

...

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.
...

Versions Compared

Old Version 28

New Version 29

Key

Creating symbolic links to data files

Removing file suffixes

Tricks with sort & uniq

Multi-pipe expression considerations

Quotes matter

Reading file lines

Field delimiter issues

Viewing special characters in text

Parsing field-oriented text with cut and awk

Regular expressions in grep, sed and perl

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.
...

Page Comparison

Versions Compared

Old Version 28

New Version 29

Key

Creating symbolic links to data files

Removing file suffixes

Tricks with sort & uniq

Multi-pipe expression considerations

Quotes matter

Reading file lines

Field delimiter issues

Viewing special characters in text

Parsing field-oriented text with cut and awk

Regular expressions in grep, sed and perl

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat....

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.
...