Page Comparison

...

For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about GSAF's sequencing runs (a machine run, with many customer jobs), sequencing jobs (representing a set of customer samples), and samples (a library libraries of DNA molecules to sequence on the machine).

Here are links to the files if you need to download them after this class is over (you don't need to download them now, since we'll create symbolic links to them).

, grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.

Image Added

The files are in your ~/workshop/data directory:.

joblist.txt - contains job name/sample name pairs, tab-delimited, no header
sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are job_name, job_id, sample_name, sample_id, date_string, which
- column names are in a header line

...

When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:

Code Block

language	bash

mkdir ~/test; cd ~/test
ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt
ls -l

Multiple files can be linked by providing multiple file name arguments along and using the -t (targettarget) option to specify the directory where links to all the files can be created.

...

What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where FASTQ files are nested in subdirectories:

Image Added

Here's a solution using find and xargs:

find returns a list of matching file paths on its standard output
the paths are piped to the standard input of xargs
xargs takes the data on its standard input and calls the specified function (here ln) with that data as the function's argument list.

Code Block

language	bash

rm -rf ~/test; mkdir ~/test; cd ~/test
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.txtgz" | xargs ln -ssf -f -t .

Removing file suffixes and prefixes

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:

${<variable-

...

name>%%.<suffix-to-remove>}

...

${<variable-name>##<prefix-to-remove>}

Code Block

language	bash

path=~/my_file.something.txt; echo $path
filename=`basename $path`; echo $filename

# isolate the filename prefix by stripping the ".something.txt" suffix
prefix=${filename%%.something.txt}
echo $prefix

# isolate the filename suffix by stripping the "my_file.something." prefix
suffix=${filename##my_file.something.}
echo $suffix

Tricks with sort & uniq

The ~/testworkshop/data/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):

Code Block

language	bash

cd ~/testworkshop/data
cut -f 2 joblist.txt | sort | uniq | wc -l
# there are 1234 unique runs

...

Expand

title	Solution

Code Block

language	bash

cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

Multi-pipe expression considerations

Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:

Code Block

language	bash

# how many samplesjobs does the jobrun with the most samplesjobs have?
numSampnumJob=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' )
echo $numSamp$numJob  # will be 23

One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:

...

Code Block

language	bash

set +o pipefail # only the exit code of the last pipe component is returned
cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}'
echo $?         # exit code will be 0

Quotes matter

In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

Consider this case where a captured string contains tabs, as illustrated below.

evaluating "$txt" inside of double quotes preserves the tabs
evaluating $txt without double quotes converts each tab to a single space

Since we usually want to preserve tabs, we want to evaluate these variables inside double quotes.

Code Block

language	bash

txt=$( echo -e "aa\tbb\tcc" )
echo $txt    # tabs converted to spaces
echo "$txt"  # tabs preserved

...

The bash for loop

The bash for loop has the basic syntax:

for <arg_name> in <list of whitespace-separated words>
do
<expression>
<expression>
done

See https://www.gnu.org/software/bash/manual/html_node/Looping-Constructs.html.

Here's a simple example using the seq function, that returns a list of numbers.

Code Block

language	bash

for num in `seq 5`; do
  echo $num
done

Quotes matter

In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).

Consider this case where a captured string contains tabs, as illustrated below.

evaluating "$txt" inside of double quotes preserves the tabs
evaluating $txt without double quotes converts each tab to a single space

Since we usually want to preserve tabs, we want to evaluate these variables inside double quotes.

Code Block

language	bash

txt=$( echo -e "aa\tbb\tcc" )
echo $txt    # tabs converted to spaces
echo "$txt"  # tabs preserved

Suppose we read a list of values from a file and want to use them in a for loop after capturing the values into a variable. Here we want to evaluate the variable without quotes; otherwise, the result includes the original linefeed characters instead of being a list of words that a for loop wants:

Code Block

language	bash

nums=$( seq 5 )
echo "$nums" | wc -l  # preserves linefeeds
echo $nums   | wc -l  # linefeeds converted to spaces; all numbers on one line
for num in $nums; do
  echo "Number is: $num"
done

Basic awk

A basic awk script has the following form:

BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}

The BEGIN and END clauses are executed only once, before and after input is processed respectively, the body expressions are executed for each line of input. Each line is parsed into fields based on the specified field separator (default is whitespace). Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.

Here's a simple awk script that takes the average of the numbers passed to it:

Code Block

language	bash

runs=$( grep 'SA1903.$' joblist.txt | cut -f 2 )
echo "$runs"   # preserves linefeeds
echo $runs     # linefeeds converted to spaces; all runs one one line
for run in $runs; do
  echo "Run name is: $run"
doneseq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
 ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}'

Here is an excellent awk tutorial, very detailed and in-depth

Reading file lines

The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.

...

utility	default delimiter	how to change	example
cut	tab	-d or --delimiter option	`cut -d ':' -f 1 /etc/passwd`
sort	whitespace (one ore more spaces or Tabs)	-t or --field-separator option	`sort -t ':' -k1,1 /etc/passwd`
awk	whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators.	FS (input field separator) and/or OFS (*output field separator) variable in BEGIN{ }* block -F or --field-separator option	`cat sampleinfo.txt \| awk 'BEGIN{ FS=OFS="\t" }` `{print $1,$3}'` `cat sampleinfo.txt /etc/passwd \| awk -F "\t:" '{print $1,$3 }'`
join	one or more spaces	-t option	`join -t $'\t' -j 2 file1 file12`
perl	whitespace (one ore more spaces or Tabs) when auto-splitting input with -a	-F'/<pattern>/' option	`cat sampleinfo.txt \| perl -F'/\t/' -ane ' -a -n -e 'print "$F[0]\t$F[2]\n";'`
read	whitespace (one ore or more spaces or tabs)	IFS= option	see example above Note that a bare IFS= removes any field separator, so whole lines are read each loop iteration.

Viewing special characters in text

...

Default field separators
- Tab is the default field separator for cut
  - and the field separator can only be a single character
- whitespace (one or more spaces or Tabs) is the default field separator for awk
  - note that some older versions of awk do not include Tab as a default delimiter
Re-ordering
- cut cannot re-order fields; awk can cut -f 3,2 is the same as cut -f 2,3
- awk does reorder fields, based on the order you specify
awk is a full-featured programming language while cut is just a single-purpose utility.

...

First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.

...

-v (inverse) – only print lines with no match
-n (line number) – prefix output with the line number of the match
-i (case insensitive) – ignore case when matching alphanumeric characters
-c (count) – just return a count of the matchescount of the matches
-l – instead of reporting each match, report only the name of files in which any match is found
-L – instead of reporting each match, report only like -l, but only reports the name of files in which no match is found
- handy for checking a bunch of log files for success
-A (After) and -B (Before) – output the specified number of lines before and after a match

...

If grep pattern matching isn't behaving the way I expect, I turn to perl. Here's how to invoke regex pattern matching from a command line using perl:

...

language	bash

perl

...

-n

...

-e

...

'print

...

if

...

$_=~/

...

<pattern>/

...

'

For example:

Code Block

language	bash

echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'


# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/'

...

-n tells perl to feed the input to the script one line at a time
-e introduces the perl script
- Always enclose a command-line perl script
(always encode it
- in single quotes to protect it from shell evaluation
)
$_ is a built-in variable holding the current line (including any invisible line-ending characters)
~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
the forward slashes ("/ /") enclose the regex pattern.

sed pattern substitution

The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.

Code Block

language	bash

# Look for runs in the SA18xxx and report their job and run numbers without JA/SA but with other text
cat joblist.txt | grep -P 'SA18\d\d\d$' | sed 's/JA/job /' | sed 's/SA/run /'

# List some student home directories, both the full paths and the student account sub-directory
#   In the 1st sed expression below, note the use of backslash escaping of the
# forward slash character we want to strip.
#   The g modifier says to replace all instances of the forward slash
for dir in `ls -d /stor/home/student0?/`; do
  subdir=$( echo $dir | sed 's/\///g' | sed 's/stor//' | sed 's/home//' )
  echo "full path: $dir - directory name $subdir"
done

perl pattern substitution

...

substitution

If I have a more complicated pattern, or if sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke regex pattern substitution from a command line:

perl -p -e '~s/<search pattern>/<replacement>/'

For example:

Code Block

language	bash	bash

cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \
  perl -pe '~s/JA/job /' | perl -ppe -e '~s/SA/<searchrun pattern>/<replacement>/;'

# for exampleOr, using parentheses to capture part of the search pattern:
cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \
  perl -pe '~s/JA(\d+)\tSA(\d+)/job /' | perl$1 -pe '~s/SA/run $2/'

Gory details:

-p tells perl to print its substitution results
-e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
~s is the perl pattern substitution operator
forward slashes ("/ / /") enclose the regex search pattern and the replacement text/ enclose the regex search pattern and the replacement text
parentheses ( ) enclosing a pattern "capture" text matching the pattern in a built-in positional variable
- $1 for the 1st captured text, $2 for the 2nd captured text, etc.

Handling multiple FASTQ files example

Here's an example of how to work with multiple FASTQ files, produced when a Next Generation Sequencing (NGS) core facility such as the GSAF sequences a library of DNA molecules provided to them. These FASTQ files (generally compressed with gzip to save space), are often provided in sub-directories, each associated with a sample. For example, the output of running tree on one such directory is shown below. tree on one such directory is shown below.

Code Block

language	bash

tree /stor/work/CCBB_Workshops_1/bash_scripting/fastq/

Image Added

Here's a one-liner that isolates just the unique sample names, where there are 2 files for each sample name:

Code Block

language	bash

treefind /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \
  | perl -pe 's|.*/||' | perl -pe 's/

...

_S\d.*//' | sort | uniq

But what if we want to manipulate each of the 4 FASTQ files? For example, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:

Code Block

language	bash

# This is how to get all 4 full pathnames:
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz"


# In a for loop, strip off the directory and the common _R1_001.fastq.gz file suffix
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  echo "$pfx - $file"
done

...

Code Block

language	bash

# Shorten the sample prefix some more...
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sedperl -pe 's/00~s/_S\d+//' | perl -pe '~s/_S\d+L00/L/')
  echo "$pfx - $file"
done

Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:

...

zcat

...

<path> |

...

wc

...

-l

But FASTQ files have 4 lines for every sequence read. So to count the sequences properly we need to divide this number by 4.

Code Block

language	bash

# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic!
echo $(( `zcat <gzipped fq $pathfile> | wc -l` / 4 ))


# Better way using awk
zcat $path<gzipped fq file> | wc -l | awk '{print $1/4}'

...

Code Block

language	bash

# Count the sequence reads in each FASTQ file and save info to a file
for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do
  file=`basename $path`
  pfx=${file%%_R1_001.fastq.gz}
  pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//')
  echo "Counting reads in $file..." 1>&2
  nSeq=$( zcat $path | wc -l | awk '{print $1/4}' )
  echo -e "$nSeq\t$pfx\t$file\t$path"
done | tee fastq_stats.txt


cut -f 1-3 fastq_stats.txt
# produces this output:
1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz
1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz
2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz
2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz

...

Code Block

language	bash

cat fastq_stats.txt | awk '
  BEGIN{FS="\t"; tot=0; ct=0}
  {tot = tot + $1
   ct = ct + 1}
  END{print "Total of",tot,"sequences in",ct,"files";
      printf("Mean reads per file: %d\n", tot/ct)}'


# produces this output:
Total of 7489904 sequences in 4 files
Mean reads per file: 1872476

So Note that the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:

Code Block

language	bash

cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c


# produces this output:
      2 WT-1
      2 WT-2

What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:

Code Block

language	bash

for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do
  echo "sample $samp" 1>&2
  cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk -vsample=$samp '
    BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2}
    END{print pfxsample,"has",tot,"total reads"}'
done | tee sample_stats.txt


cat sample_stats.txt
# produces this output:
WT-1_L1 has 2712076 total reads
WT-2_L1 has 4777828 total reads

Versions Compared

Old Version 40

New Version Current

Key

Removing file suffixes and prefixes

Tricks with sort & uniq

Multi-pipe expression considerations

Quotes matter

The bash for loop

Quotes matter

Basic awk

Reading file lines

Viewing special characters in text

sed pattern substitution

perl pattern substitution

substitution

Handling multiple FASTQ files example