Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

...

creating symbolic links to data files

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

...

When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data. Even though awk is a full-featured programming language, I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl for complicated text manipulation.

grep, sed and perl regular expressions

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.

First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because grep has so many handy options.

grep patern matching

For basic command-line pattern matching, I first try grep. Even though there are several grep utilities (grep, egrep, fgrep), I tend to use the original grep command. While by default it implements clunky POSIX-style pattern matching, its -P argument asks it to honor perl regular expression syntax. In most cases this works well.

Code Block
languagebash
echo "foo bar" | grep '(oo|ba)'         # no match!
echo "foo bar" | grep -P '(oo|ba)'      # match

echo -e "12\n23\n4\n5" | grep '\d\d'    # no matches!
echo -e "12\n23\n4\n5" | grep -P'\d\d'  # 2 lines match

My other favorite grep options are:

  • -v  (inverse) – only print lines with no match
  • -i  (case insensitive) – ignore case when matching alphanumeric characters
  • -c  (count) – just return a count of the matches
  • -n  (line number) – prefix output with the line number of the match
  • -L  – instead of reporting each match, report only the name of files in which a match is found
    • handy for checking a bunch of log files for errors or success
  • -A  (After) and -B (Before) – output the specified number of lines before and after a match

perl pattern matching

If grep isn't behaving the way I expect, I turn to perl. Here's how to invoke regex pattern matching from a command line:

Code Block
languagebash
perl -n -e 'print if $_=~/<some pattern>/;'

# for example:
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'

Gory details:

  • -n tells perl to feed the input to the script one line at a time
  • -e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
  • $_ is a built-in variable holding the current line
  • =~ is the perl pattern matching operator
  • the forward slashes ("/  /") enclose the regex pattern

sed pattern substitution

The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regexes. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.

Code Block
languagebash
# look for runs in the SA18xxx and report their job and run numbers without JA/SA but with other text
cat joblist.txt | grep -P 'SA18\d\d\d$' | sed 's/JA/job /' | sed 's/SA/run /'

# in the 1st sed expression below, note
#   use of backslash escaping of the forward slash character we want to strip
#   the g modifier to replace all instances of the forward slash
for dir in `ls -d /stor/home/student0?/`; do
  dir2=$( echo $dir | sed 's/\///g' | sed 's/stor//' | sed 's/home//' )
  echo "full path: $dir - directory name $dir2"
done

perl pattern substitution

If sed pattern substitution is not working as I expect, I again turn to perl. Here's how to invoke regex pattern substitution from a command line:

Code Block
languagebash
perl -p -e '~s/<search pattern>/<replacement/;'

# for example:
cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \
  perl -pe '~s/JA/job /' | perl -pe '~s/SA/run /'

Gory details:

  • -p tells perl to print its substitution results
  • -e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
  • ~s is the perl pattern substitution operator
  • forward slashes ("/  /  /") enclose the regex search pattern and the replacement text