...
For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.
The files are in your ~/workshop/data directory:.
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
- column names are in a header line
- columns (tab-delimited) are
...
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:
Code Block | ||
---|---|---|
| ||
mkdir ~/test; cd ~/test ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt ls -l |
...
Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:
- ${<variable-
...
- name>%%.<suffix-to-remove>}
...
- ${<variable-
...
- name>##<prefix-to-remove>}
...
Code Block | ||
---|---|---|
| ||
path=~/my_file.something.txt; echo $path filename=`basename $path`; echo $filename # isolate the filename prefix by stripping the ".something.txt" suffix prefix=${filename%%.something.txt} echo $prefix # isolate the filename suffix by stripping the "my_file.something." prefix suffix=${filename##my_file.something.} echo $suffix |
Tricks with sort & uniq
The ~/testworkshop/data/joblist.txt file you just symlink'd describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):
Code Block | ||
---|---|---|
| ||
cd ~/test ln -s -f -t . /stor/work/CCBB_Workshops_1/bash_scripting/data/*.txtworkshop/data cut -f 2 joblist.txt | sort | uniq | wc -l # there are 1234 unique runs |
...
Expand | |||||
---|---|---|---|---|---|
| |||||
|
Multi-pipe expression considerations
Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:
...
Code Block | ||
---|---|---|
| ||
set +o pipefail # only the exit code of the last pipe component is returned cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' echo $? # exit code will be 0 |
Quotes matter
...
The bash for loop
The bash for loop has the basic syntax:
for <arg_name> in <list of whitespace-separated words>; do
<expression>
<expression>
done
Here's a simple example using the seq function, that returns a list of numbers.
Code Block | ||
---|---|---|
| ||
for num in seq 5; do
echo $num
done |
Quotes matter
In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).
...
Suppose we read a list of values from a file and want to use them in a for loop after capturing the values into a variable. Here we want to evaluate the variable without quotes; otherwise, the result includes the original linefeed characters instead of being a list of words that a for loop wants:
Code Block | ||
---|---|---|
| ||
runsnums=$( grep 'SA1903.$' joblist.txt | cut -f 2 seq 5 ) echo "$runs$nums" | wc -l # preserves linefeeds echo $runs$nums | wc -l # linefeeds converted to spaces; all runsnumbers oneon one line for run in $runs; do echo "Run name is: $run" donenum in $nums; do echo "Number is: $num" done |
Basic awk
A basic awk script has the following form:
BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}
The BEGIN and END clauses are executed only once, before and after input is processed respectively, the body expressions are executed for each line of input. Each line is parsed into fields based on the specified field separator (default is whitespace). Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.
Here's a simple awk script that takes the mean of the numbers passed to it:
Code Block | ||
---|---|---|
| ||
seq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}' |
Here is an excellent awk tutorial, very detailed and in-depth
Reading file lines
The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.
...
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace (one ore more spaces or Tabs) | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |
awk | whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators. |
|
cat /etc/passwd | awk -F ":" '{print $1}' |
join | one or more spaces | -t option |
|
perl | whitespace (one ore more spaces or Tabs) when auto-splitting input with -a | -F'/<pattern>/' option | cat sampleinfo.txt | perl -F'/\t/' -ane a -n -e 'print "$F[0]\t$F[2]\n";' |
read | whitespace (one ore more spaces or tabs | IFS= option | see example above |
...
First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.
...
- -v (inverse) – only print lines with no match
- -n (line number) – prefix output with the line number of the match
- -i (case insensitive) – ignore case when matching alphanumeric characters
- -c (count) – just return a count of the matches
- -L l – instead of reporting each match, report only the name of files in which any match is found
- -L – like -l, but only reports the name of files in which no match is found
- handy for checking a bunch of log files for success
- -A (After) and -B (Before) – output the specified number of lines before and after a match
...
Code Block | ||
---|---|---|
| ||
perl -n -e 'print if $_=~/<some pattern>/;'
# for example:
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/'
# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/' |
...
But what if we want to manipulate each of the 4 FASTQ files? For eampleexample, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:
...