Tips and tricks

So we've covered a number of "framework" topics – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.

For data, we'll use a couple of files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing to deliver sequencing data to customers. These files have information about sequencing runs (a machine run, with many samples), sequencing jobs (representing a set of customer samples), and samples (a library of DNA molecules to sequence on the machine).

Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).

joblist.txt - contains job name/sample name pairs, tab-delimited, no header
sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which names are in a header line

create multiple symbolic links

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln, e.g.:

mkdir ~/test; cd ~/test
ln -s -f ~/workshop/data/sampleinfo.txt
ls -l

Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.

cd; rm -f test/*.*
ln -s -f -t test~/workshop/data/*.txt
ls -l

What about the case where the files you want are scattered in sub-directories? Here's a solution using find and xargs:

find returns a list of matching file paths on its standard output
the paths are piped to the standard input of xargs
xargs takes the data on its standard input and calls the specified function (here ln) with that data as the function's argument list.

sort & uniq tricks

The ~/test/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

cd ~/test
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1244 runs

Job names are in column 1 of the ~/test/sampleinfo.txt file. Here's how to create a histogram of job names showing the count of samples (lines) for each. the -c option to uniq addes a count of unique items, which we can then sort on (numerically) to show the jobs with the most samples first.

cat sampleinfo.txt | tail -n +2 | cut -f 1 | sort | uniq -c | sort -k1,1nr

exercise 1

How many unique job names are in the joblist.txt file?

Solution

cut -f 1 joblist.txt | sort | uniq | wc -l
# there are 3842

Are all the job/run pairs unique?

Solution

Yes. Compare the unique lines of the file to the total lines.

cat joblist.txt | sort | uniq | wc -l
wc -l joblist.txt
# both are 3842

Which run has the most jobs?

Solution

Add a count to the unique run lines then sort on it numerically, in reverse order. The 1st line will then be the job with the most lines (jobs).

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.

lineNo=1
while IFS= read line; do
  echo "Line $lineNo: '$line'"
  lineNo=$(( lineNo + 1 ))
done < sampleinfo.txt

The IFS= clears all of read's default input field separators (whitespace).
- This is needed so that read will set the line variable to exactly the contents of the input line, and not strip leading whitespace from it.
Note the odd syntax for incrementing the line number variable.

Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:

The double quotes around "$line" are important to preserve special characters inside the original line (here tab characters)
- without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
Some lines have an empty job name field; we replace job and sample names in this case.
We assign file descriptor 4 to the file data being read (4< sampleinfo.txt after the done keyword), and read from it explicitly (read line <&4 in the while line). This avoids conflict with any global redirection of standard output (e.g. from automatic logging).

# be sure to use a different file descriptor (here 4)
while IFS= read line <&4; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then
    sampleName="Undetermined"; jobName="none"
  fi
  echo "job $jobName - sample $sampleName"
done 4< sampleinfo.txt | tee read_line_output.txt

Two final modifications:

Strip the header line off the input using anonymous pipe syntax <(tail -n +2 sampleinfo.txt).
- This syntax takes the standard output of the expression in parentheses (a sub-shell) and that can be used as input instead of a file:
  - <4< <(tail -n +2 sampleinfo.txt) instead of 4< sampleinfo.txt.
Save all output from the while loop to a file by piping to tee.

# be sure to use a different file descriptor (here 4)
while IFS= read line <&4; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then
    sampleName="Undetermined"; jobName="none"
  fi
  echo "job $jobName - sample $sampleName"
done 4< <(tail -n +2 sampleinfo.txt) | tee read_line_output.txt

exercise 2

Using the above code as a guide, use the job name and sample name information to construct a pathname of the form Project_<job name>/<sample name>.fastq.gz, and write these paths to a file. Skip any entries with no job name.

Solution

# be sure to use a different file descriptor (here 4)
while IFS= read line <&4; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then continue; fi
  echo "Project_${jobName}/${sampleName}.fastq.gz"
done 4< <(tail -n +2 sampleinfo.txt) | tee pathnames.txt

field delimiter issues

As we've already seen, field delimiters are tricky! Be aware of the default field delimiter for the various bash utilities, and how to change them:

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	whitespace (one ore more spaces or tabs) for both input and output	FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }{print $1,$3}'` `cat sampleinfo.txt \| awk -F "\t" '{ print $1,$3 }'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore more spaces or tabs	IFS= option	see example above

cut versus awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

Default field separators
- Tab is the default field separator for cut
- whitespace (one or more spaces) is the default field separator for awk
Re-ordering
- cut cannot re-order fields; awk can, based on the order you specify
awk is a full-featured programming language while cut is just a single-purpose utility.

When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data. Even though awk is a full-featured programming language, I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl for complicated text manipulation.