Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.

The files are in your ~/workshop/data directory:.

  • joblist.txt - contains job name/sample name pairs, tab-delimited, no header
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string
    • column names are in a header line

...

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:

Code Block
languagebash
mkdir ~/test; cd ~/test
ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt
ls -l

...

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:

  • ${<variable-

...

  • name>%%.<suffix-to-remove>}

...

  • ${<variable-

...

  • name>##<prefix-to-remove>}

...



Code Block
languagebash
path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename

# isolate the filename prefix by stripping the ".something.txt" suffix
prefix=${filename%%.something.txt}
echo $prefix

# isolate the filename suffix by stripping the "my_file.something." prefix
suffix=${filename##my_file.something.}
echo $suffix

Tricks with sort & uniq

The ~/testworkshop/data/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

Code Block
languagebash
cd ~/test
ln -s -f -t . /stor/work/CCBB_Workshops_1/bash_scripting/data/*.txtworkshop/data
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1234 unique runs

...

Expand
titleSolution


Code Block
languagebash
cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038


Multi-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

...

Code Block
languagebash
set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

Quotes matter

...

The bash for loop

The bash for loop has the basic syntax:

for <arg_name> in <list of whitespace-separated words>; do
   <expression>
  
<expression>
done

Here's a simple example using the seq function, that returns a list of numbers.

Code Block
languagebash
for num in seq 5; do
  echo $num
done

Quotes matter

In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

...

Suppose we read a list of values from a file and want to use them in a for loop after capturing the values into a variable. Here we want to evaluate the variable without quotes; otherwise, the result includes the original linefeed characters instead of being a list of words that a for loop wants:

Code Block
languagebash
runsnums=$( grep 'SA1903.$' joblist.txt | cut -f 2 seq 5 )
echo "$runs$nums" | wc -l  # preserves linefeeds
echo $runs$nums   | wc -l  # linefeeds converted to spaces; all runsnumbers oneon one line
for run in $runs; do
  echo "Run name is: $run"
donenum in $nums; do
  echo "Number is: $num"
done

Basic awk

A basic awk script has the following form:

BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}

The BEGIN and END clauses are executed only once, before and after input is processed respectively, the body expressions are executed for each line of input. Each line is parsed into fields based on the specified field separator (default is whitespace). Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.

Here's a simple awk script that takes the mean of the numbers passed to it:

Code Block
languagebash
seq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
 ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}'

Here is an excellent awk tutorial, very detailed and in-depth

Reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely. 

...

utilitydefault delimiterhow to changeexample
cuttab-d or --delimiter optioncut -d ':' -f 1 /etc/passwd
sortwhitespace
(one ore more spaces or Tabs)
-t or --field-separator optionsort -t ':' -k1,1 /etc/passwd
awk

whitespace (one ore more spaces or Tabs)

Note: some older versions of awk do not treat Tabs as field separators.

  • FS (input field separator) and/or OFS (output field separator) variable in BEGIN{ } block
  • -F or --field-separator option

cat sampleinfo.txt | awk 'BEGIN{ FS=OFS="\t" } {print $1,$3}'

cat /etc/passwd | awk -F ":" '{print $1}'
joinone or more spaces-t option
join -t $'\t' -j 2 file1 file12
perlwhitespace
(one ore more spaces or Tabs)
when auto-splitting input with -a
-F'/<pattern>/' optioncat sampleinfo.txt | perl -F'/\t/' -ane a -n -e 'print "$F[0]\t$F[2]\n";'
readwhitespace
(one ore more spaces or tabs
IFS= optionsee example above

...

First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.

...

  • -v  (inverse) – only print lines with no match
  • -n  (line number) – prefix output with the line number of the match
  • -i  (case insensitive) – ignore case when matching alphanumeric characters
  • -c  (count) – just return a count of the matches
  • -L  l – instead of reporting each match, report only the name of files in which any match is found
  • -L  – like -l, but only reports the name of files in which no match is found
    • handy for checking a bunch of log files for success
  • -A  (After) and -B (Before) – output the specified number of lines before and after a match

...

Code Block
languagebash
perl -n -e 'print if $_=~/<some pattern>/;'

# for example:
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'


# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/'

...

But what if we want to manipulate each of the 4 FASTQ files? For eampleexample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:

...