Page Comparison

...

For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.

The files are in your ~/workshop/data directory:.

joblist.txt - contains job name/sample name pairs, tab-delimited, no header
sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string
- column names are in a header line

...

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:

Code Block

language	bash

mkdir ~/test; cd ~/test
ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt
ls -l

...

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:

${<variable-

...

name>%%.<suffix-to-remove>}

...

${<variable-

...

name>##<prefix-to-remove>}

...

Code Block

language	bash

path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename

# isolate the filename prefix by stripping the ".something.txt" suffix
prefix=${filename%%.something.txt}
echo $prefix

# isolate the filename suffix by stripping the "my_file.something." prefix
suffix=${filename##my_file.something.}
echo $suffix

Tricks with sort & uniq

The ~/testworkshop/data/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

Code Block

language	bash

cd ~/test
ln -s -f -t . /stor/work/CCBB_Workshops_1/bash_scripting/data/*.txtworkshop/data
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1234 unique runs

...

Expand

title	Solution

Code Block

language	bash

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

Multi-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

...

Code Block

language	bash

set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

Quotes matter

...

The bash for loop

The bash for loop has the basic syntax:

for <arg_name> in <list of whitespace-separated words>; do
<expression>
<expression>
done

Here's a simple example using the seq function, that returns a list of numbers.

Code Block

language	bash

for num in seq 5; do
  echo $num
done

Quotes matter

In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

...

Suppose we read a list of values from a file and want to use them in a for loop after capturing the values into a variable. Here we want to evaluate the variable without quotes; otherwise, the result includes the original linefeed characters instead of being a list of words that a for loop wants:

Code Block

language	bash

runsnums=$( grep 'SA1903.$' joblist.txt | cut -f 2 seq 5 )
echo "$runs$nums" | wc -l  # preserves linefeeds
echo $runs$nums   | wc -l  # linefeeds converted to spaces; all runsnumbers oneon one line
for run in $runs; do
  echo "Run name is: $run"
donenum in $nums; do
  echo "Number is: $num"
done

Basic awk

A basic awk script has the following form:

BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}

The BEGIN and END clauses are executed only once, before and after input is processed respectively, the body expressions are executed for each line of input. Each line is parsed into fields based on the specified field separator (default is whitespace). Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.

Here's a simple awk script that takes the mean of the numbers passed to it:

Code Block

language	bash

seq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
 ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}'

Here is an excellent awk tutorial, very detailed and in-depth

Reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.

...

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or Tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators.	FS (input field separator) and/or OFS (*output field separator) variable in BEGIN{ }* block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }` `{print $1,$3}'` `cat /etc/passwd \| awk -F ":" '{print $1}'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or Tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane a -n -e 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore more spaces or tabs	IFS= option	see example above

...

First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.

...

-v (inverse) – only print lines with no match
-n (line number) – prefix output with the line number of the match
-i (case insensitive) – ignore case when matching alphanumeric characters
-c (count) – just return a count of the matches
-L l – instead of reporting each match, report only the name of files in which any match is found
-L – like -l, but only reports the name of files in which no match is found
- handy for checking a bunch of log files for success
-A (After) and -B (Before) – output the specified number of lines before and after a match

...

Code Block

language	bash

perl -n -e 'print if $_=~/<some pattern>/;'

# for example:
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'


# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/'

...

But what if we want to manipulate each of the 4 FASTQ files? For eampleexample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:

...

Versions Compared

Old Version 48

New Version 49

Key

Tricks with sort & uniq

Multi-pipe expression considerations

Quotes matter

The bash for loop

Quotes matter

Basic awk

Reading file lines