Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

...

Creating symbolic links to data files

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

...

Code Block
languagebash
cd; rm -f test/*.*
find ~/workshop/ -name "*.txt" | xargs ln -s -f -t ~/ -name "*.txt" | xargs ln -s -f -t ~/test

...

test

Removing file suffixes

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename funciton. Then use the odd-looking syntax ${<variable-name>%%.<suffix-to-remove>}

Code Block
languagebash
path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename
prefix=${filename%%.something.txt}
echo $prefix

Tricks with sort & uniq

The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

...

Expand
titleSolution
Code Block
languagebash
cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

...

Multi-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

...

Code Block
languagebash
set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

...

Quotes matter

We've already seen that quoting variable evaluation preserves the caller's argument quoting (see Quoting subtleties). But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

...

Code Block
languagebash
runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 )
echo "$runs"   # preserves linefeeds
echo $runs     # linefeeds converted to spaces

for run in $runs; do
  echo "Run name is: $run"
done

...

Reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely. 

...

Expand
titleSolution
Code Block
languagebash
# be sure to use a different file descriptor (here 4)
while IFS= read line <&4; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then continue; fi
  echo "Project_${jobName}/${sampleName}.fastq.gz"
done 4< <(tail -n +2 sampleinfo.txt) | tee pathnames.txt

...

Field delimiter issues

As we've already seen, field delimiters are tricky! Be aware of the default field delimiter for the various bash utilities, and how to change them:

utilitydefault delimiterhow to changeexample
cuttab-d or --delimiter optioncut -d ':' -f 1 /etc/passwd
sortwhitespace
(one ore more spaces or tabs)
-t or --field-separator optionsort -t ':' -k1,1 /etc/passwd
awkspaces
(one ore more spaces)
for both input and output
  • FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block
  • -F or --field-separator option

cat sampleinfo.txt | awk 'BEGIN{ FS=OFS="\t" }{print $1,$3}'

cat sampleinfo.txt | awk -F "\t" '{ print $1,$3 }'


joinone or more spaces-t option
join -t $'\t' -j 2 file1 file12 
perlwhitespace
(one ore more spaces or tabs)
when auto-splitting input with -a
-F'/<pattern>/' optioncat sampleinfo.txt | perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'
readwhitespace
(one ore more spaces or tabs
IFS= optionsee example above

...

Viewing special characters in text

When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:

...

Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).

...

Parsing field-oriented text with cut and awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

...

When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data. Even though awk is a full-featured programming language, I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl for complicated text manipulation.

Regular expressions in grep, sed and perl

...

Regular expressions are incredibly powerful and should be in every progammer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.

...