Tips and tricks
Overview
So we've covered a number of "framework" topics – argument handling, stream handling, error handling – in a systematic way. This section presents various tips and tricks for actually manipulating data, which can be useful both in writing scripts and in command line manipulations.
For some of the discussions below, we'll use a couple of data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as jobs, and sequenced on GSAF's sequencing machines as part of runs.
The files are in your ~/workshop/data directory:.
- joblist.txt - contains job name/sample name pairs, tab-delimited, no header
- sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
- columns (tab-delimited) are
job_name, job_id, sample_name, sample_id, date_string
- column names are in a header line
- columns (tab-delimited) are
Creating symbolic links to data files
When dealing with large data files, sometimes scattered in many directories, it is often convenient to create multiple symbolic links (symlinks) to those files in a directory where you plan to work with them. A common way to make symbolic link uses ln -s, e.g.:
mkdir ~/test; cd ~/test ln -s -f /stor/work/CCBB_Workshops_1/bash_scripting/data/sampleinfo.txt ls -l
Multiple files can be linked by providing multiple file name arguments along and using the -t (target) option to specify the directory where links to all the files can be created.
rm -rf ~/test; mkdir ~/test; cd ~/test ln -s -f -t . /stor/work/CCBB_Workshops_1/bash_scripting/data/*.txt ls -l
What about the case where the files you want are scattered in sub-directories? Consider a typical GSAF project directory structure, where FASTQ files are nested in subdirectories:
Here's a solution using find and xargs:
- find returns a list of matching file paths on its standard output
- the paths are piped to the standard input of xargs
- xargs takes the data on its standard input and calls the specified function (here ln) with that data as the function's argument list.
cd ~/test find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.gz" | xargs ln -sf -t .
Removing file suffixes and prefixes
Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:
- ${<variable-name>%%.<suffix-to-remove>}
- ${<variable-name>##<prefix-to-remove>}
path=~/my_file.something.txt; echo $path filename=`basename $path`; echo $filename # isolate the filename prefix by stripping the ".something.txt" suffix prefix=${filename%%.something.txt} echo $prefix # isolate the filename suffix by stripping the "my_file.something." prefix suffix=${filename##my_file.something.} echo $suffix
Tricks with sort & uniq
The ~/workshop/data/joblist.txt file describes sequencing job/run pairs, tab-separated. We can use sort and uniq to collapse and count entries in the run name field (column 2):
cd ~/workshop/data cut -f 2 joblist.txt | sort | uniq | wc -l # there are 1234 unique runs
Job names are in column 1 of the ~/test/sampleinfo.txt file. Here's how to create a histogram of job names showing the count of samples (lines) for each. The -c option to uniq adds a count of unique items, which we can then sort on (numerically) to show the jobs with the most samples first.
cat sampleinfo.txt | tail -n +2 | cut -f 1 | sort | uniq -c | sort -k1,1nr
exercise 1
How many unique job names (column 1 values) are in the joblist.txt file?
Are all the job/run pairs unique?
Which run has the most jobs?
Multi-pipe expression considerations
Multiple-pipe expressions can be used to great benefit in scripts, both on their own or to capture their output. For example:
# how many jobs does the run with the most jobs have? numJob=$( cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' ) echo $numJob # will be 23
One complication is that, by default, pipe expression execution does not stop if one of the steps encounters an error – and the exit code for the expression as a whole may be 0 (success). For example:
cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' echo $? # exit code will be 0
To force a non-0 exit code to be returned if any part of a multi-pipe expression fails, enable the pipefail shell option:
set -o pipefail # non-0 exit code by any pipe component will terminate pipe execution cat joblistTypo.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' echo $? # exit code will be 1
But of course there's always a case where you don't want piping to return an error – for example, you might want to grab the first 100 lines of a file, but the file may not have that many lines. This condition will cause head to return a non-0 error code – even though it still returns all the lines there are.
set +o pipefail # only the exit code of the last pipe component is returned cat joblist.txt | head -5000 | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1 | awk '{print $1}' echo $? # exit code will be 0
The bash for loop
The bash for loop has the basic syntax:
for <arg_name> in <list of whitespace-separated words>
do
<expression>
<expression>
done
See https://www.gnu.org/software/bash/manual/html_node/Looping-Constructs.html.
Here's a simple example using the seq function, that returns a list of numbers.
for num in `seq 5`; do echo $num done
Quotes matter
In the see Quoting subtleties section, we see that quoting variable evaluation preserves the caller's argument quoting. But more specifically, quoting preserves any special characters in the variable value's text (e.g. tab or linefeed characters).
Consider this case where a captured string contains tabs, as illustrated below.
- evaluating "$txt" inside of double quotes preserves the tabs
- evaluating $txt without double quotes converts each tab to a single space
Since we usually want to preserve tabs, we want to evaluate these variables inside double quotes.
txt=$( echo -e "aa\tbb\tcc" ) echo $txt # tabs converted to spaces echo "$txt" # tabs preserved
Suppose we read a list of values from a file and want to use them in a for loop after capturing the values into a variable. Here we want to evaluate the variable without quotes; otherwise, the result includes the original linefeed characters instead of being a list of words that a for loop wants:
nums=$( seq 5 ) echo "$nums" | wc -l # preserves linefeeds echo $nums | wc -l # linefeeds converted to spaces; all numbers on one line for num in $nums; do echo "Number is: $num" done
Basic awk
A basic awk script has the following form:
BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}
The BEGIN and END clauses are executed only once, before and after input is processed respectively, the body expressions are executed for each line of input. Each line is parsed into fields based on the specified field separator (default is whitespace). Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.
Here's a simple awk script that takes the average of the numbers passed to it:
seq 10 | awk ' BEGIN{sum=0; ct=0;} {sum = sum + $1 ct = ct + 1} END{print sum/ct,"is the mean of",ct,"numbers"}'
Here is an excellent awk tutorial, very detailed and in-depth
Reading file lines
The read function can be used to read input one line at a time. While the full details of read are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely.
lineNo=0 while IFS= read line; do lineNo=$(( lineNo + 1 )) echo "Line $lineNo: '$line'" done < sampleinfo.txt echo -e "\nRead $lineNo lines"
- The IFS= clears all of read's default input field separators, which is normally whitespace (one or more space characters or tabs).
- This is needed so that read will set the line variable to exactly the contents of the input line, and not strip leading whitespace from it.
- Note the odd syntax for incrementing the line number variable.
Once a line has been read, it can be parsed, for example, using cut, as shown below. Other notes:
- The double quotes around the text that "$line" are important to preserve special characters inside the original line (here tab characters).
- Without the double quotes, the line fields would be separated by spaces, and the cut field delimiter would need to be changed.
- Some lines have an empty job name field; we replace job and sample names in this case.
- We assign file descriptor 4 to the file data being read (
4< sampleinfo.txt
after the done keyword), and read from it explicitly (read line <&4 in
the while line).- This avoids conflict with any global redirection of standard output (e.g. from automatic logging).
- This avoids conflict with any global redirection of standard output (e.g. from automatic logging).
# be sure to use a different file descriptor (here 4) while IFS= read line <&4; do jobName=$( echo "$line" | cut -f 1 ) sampleName=$( echo "$line" | cut -f 3 ) if [[ "$jobName" == "" ]]; then sampleName="Undetermined"; jobName="none" fi echo "job $jobName - sample $sampleName" done 4< sampleinfo.txt
Two final modifications:
- Strip the header line off the input using anonymous pipe syntax
<(tail -n +2 sampleinfo.txt)
.- This syntax takes the standard output of the expression in parentheses (a sub-shell) and that can be used as input instead of a file
:
4< <(tail -n +2 sampleinfo.txt)
instead of4< sampleinfo.txt
.
- This syntax takes the standard output of the expression in parentheses (a sub-shell) and that can be used as input instead of a file
- Save all output from the while loop to a file by piping to tee.
# be sure to use a different file descriptor (here 4) while IFS= read line <&4; do jobName=$( echo "$line" | cut -f 1 ) sampleName=$( echo "$line" | cut -f 3 ) if [[ "$jobName" == "" ]]; then sampleName="Undetermined"; jobName="none" fi echo "job $jobName - sample $sampleName" done 4< <(tail -n +2 sampleinfo.txt) | tee read_line_output.txt
exercise 2
Using the above code as a guide, use the job name and sample name information to construct a pathname of the form Project_<job name>/<sample name>.fastq.gz, and write these paths to a file. Skip any entries with no job name.
Field delimiter issues
As we've already seen, field delimiters are tricky! Be aware of the default field delimiter for the various bash utilities, and how to change them:
utility | default delimiter | how to change | example |
---|---|---|---|
cut | tab | -d or --delimiter option | cut -d ':' -f 1 /etc/passwd |
sort | whitespace (one ore more spaces or Tabs) | -t or --field-separator option | sort -t ':' -k1,1 /etc/passwd |
awk | whitespace (one ore more spaces or Tabs) Note: some older versions of awk do not treat Tabs as field separators. |
|
cat /etc/passwd | awk -F ":" '{print $1}' |
join | one or more spaces | -t option |
|
perl | whitespace (one ore more spaces or Tabs) when auto-splitting input with -a | -F'/<pattern>/' option | cat sampleinfo.txt | perl -F'/\t/' -a -n -e 'print "$F[0]\t$F[2]\n";' |
read | whitespace (one or more spaces or tabs) | IFS= option | see example above Note that a bare IFS= removes any field separator, so whole lines are read each loop iteration. |
Viewing special characters in text
When working in a terminal, it is sometimes difficult to determine what special characters (e.g. tabs) – if any – are in a file's data, or what line endings are being used. Your desktop GUI code editor may provide a mode for viewing "raw" file contents (usually as 2-digit hexadecimal codes representing each ASCII character). If not, here's an alias that can be used:
alias hexdump='od -A x -t x1z -v' head -5 joblist.txt | hexdump
Output will look like this
000000 4a 41 31 31 31 38 30 09 53 41 31 32 30 31 33 0a >JA11180.SA12013.< 000010 4a 41 31 31 32 30 36 09 53 41 31 32 30 31 33 0a >JA11206.SA12013.< 000020 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 31 0a >JA11207.SA12001.< 000030 4a 41 31 31 32 30 37 09 53 41 31 32 30 30 34 0a >JA11207.SA12004.< 000040 4a 41 31 31 32 30 38 09 53 41 31 32 30 30 31 0a >JA11208.SA12001.< 000050
Note that hex 0x09 is a tab, and hex 0x0a is a linefeed (see http://www.asciitable.com/).
Parsing field-oriented text with cut and awk
The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:
- Default field separators
- Tab is the default field separator for cut
- and the field separator can only be a single character
- whitespace (one or more spaces or Tabs) is the default field separator for awk
- note that some older versions of awk do not include Tab as a default delimiter
- Tab is the default field separator for cut
- Re-ordering
- cut cannot re-order fields;
cut -f 3,2
is the same ascut -f 2,3
- awk does reorder fields, based on the order you specify
- cut cannot re-order fields;
- awk is a full-featured programming language while cut is just a single-purpose utility.
When to use these programs is partly a matter of taste. I often use either cut or awk to deal with field-oriented data.
For more complex data manipulation, even though awk is a full-featured programming language I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl with its rich regular expression capabilities.
Regular expressions in grep, sed and perl
Regular expressions are incredibly powerful and should be in every programmer's toolbox. But every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.
First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other regex implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.
grep pattern matching
For basic command-line pattern matching, I first try grep. Even though there are several grep utilities (grep, egrep, fgrep), I tend to use the original grep command. While by default it implements clunky POSIX-style pattern matching, its -P argument asks it to honor perl regular expression syntax. In most cases this works well.
echo "foo bar" | grep '(oo|ba)' # no match! echo "foo bar" | grep -P '(oo|ba)' # match echo -e "12\n23\n4\n5" | grep '\d\d' # no matches! echo -e "12\n23\n4\n5" | grep -P'\d\d' # 2 lines match
My other favorite grep options are:
- -v (inverse) – only print lines with no match
- -n (line number) – prefix output with the line number of the match
- -i (case insensitive) – ignore case when matching alphanumeric characters
- -c (count) – just return a count of the matches
- -l – instead of reporting each match, report only the name of files in which any match is found
- -L – like -l, but only reports the name of files in which no match is found
- handy for checking a bunch of log files for success
- -A (After) and -B (Before) – output the specified number of lines before and after a match
perl pattern matching
If grep pattern matching isn't behaving the way I expect, I turn to perl. Here's how to invoke regex pattern matching from a command line using perl:
perl -n -e 'print if $_=~/<pattern>/'
For example:
echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ =~/\d\d/' # or, for lines not matching echo -e "12\n23\n4\n5" | perl -n -e 'print if $_ !~/\d\d/'
Gory details:
- -n tells perl to feed the input to the script one line at a time
- -e introduces the perl script
- Always enclose a command-line perl script in single quotes to protect it from shell evaluation
- $_ is a built-in variable holding the current line (including any invisible line-ending characters)
- ~ is the perl pattern matching operator (=~ says pattern must match; ! ~ says pattern not matching)
- the forward slashes ("/ /") enclose the regex pattern.
sed pattern substitution
The sed command can be used to edit text using pattern substitution. While it is very powerful, the regex syntax for some of its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.
# Look for runs in the SA18xxx and report their job and run numbers without JA/SA but with other text cat joblist.txt | grep -P 'SA18\d\d\d$' | sed 's/JA/job /' | sed 's/SA/run /' # List some student home directories, both the full paths and the student account sub-directory # In the 1st sed expression below, note the use of backslash escaping of the # forward slash character we want to strip. # The g modifier says to replace all instances of the forward slash for dir in `ls -d /stor/home/student0?/`; do subdir=$( echo $dir | sed 's/\///g' | sed 's/stor//' | sed 's/home//' ) echo "full path: $dir - directory name $subdir" done
perl pattern substitution
If I have a more complicated pattern, or if sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke regex pattern substitution from a command line:
perl -p -e '~s/<search pattern>/<replacement>/'
For example:
cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \ perl -pe '~s/JA/job /' | perl -pe '~s/SA/run /' # Or, using parentheses to capture part of the search pattern: cat joblist.txt | perl -ne 'print if $_ =~/SA18\d\d\d$/' | \ perl -pe '~s/JA(\d+)\tSA(\d+)/job $1 - run $2/'
Gory details:
- -p tells perl to print its substitution results
- -e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
- ~s is the perl pattern substitution operator
- forward slashes / / / enclose the regex search pattern and the replacement text
- parentheses ( ) enclosing a pattern "capture" text matching the pattern in a built-in positional variable
- $1 for the 1st captured text, $2 for the 2nd captured text, etc.
Handling multiple FASTQ files example
Here's an example of how to work with multiple FASTQ files, produced when a Next Generation Sequencing (NGS) core facility such as the GSAF sequences a library of DNA molecules provided to them. These FASTQ files (generally compressed with gzip to save space), are often provided in sub-directories, each associated with a sample. For example, the output of running tree on one such directory is shown below.
tree /stor/work/CCBB_Workshops_1/bash_scripting/fastq/
Here's a one-liner that isolates just the unique sample names, where there are 2 files for each sample name:
find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" \ | perl -pe 's|.*/||' | perl -pe 's/_S\d.*//' | sort | uniq
But what if we want to manipulate each of the 4 FASTQ files? For example, count the number of lines in each one. Let's start with a for loop to get their full paths, and just the FASTQ file names without the _R1_001.fastq.gz suffix:
# This is how to get all 4 full pathnames: find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" # In a for loop, strip off the directory and the common _R1_001.fastq.gz file suffix for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do file=`basename $path` pfx=${file%%_R1_001.fastq.gz} echo "$pfx - $file" done
Now shorten the lane numbers by removing "00", and remove the _S\d+ sample number (which is not part of the user's sample name):
# Shorten the sample prefix some more... for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do file=`basename $path` pfx=${file%%_R1_001.fastq.gz} pfx=$( echo $pfx | perl -pe '~s/_S\d+//' | perl -pe '~s/L00/L/') echo "$pfx - $file" done
Now that we have nice sample names, count the number of sequences in each file. To un-compress the gzip'd files "on the fly" (without creating another file), we use zcat (like cat but for gzip'd files) and count the lines, e.g.:
zcat <path> | wc -l
But FASTQ files have 4 lines for every sequence read. So to count the sequences properly we need to divide this number by 4.
# Clunky way to do arithmetic in bash -- but bash only does integer arithmetic! echo $(( `zcat <gzipped fq file> | wc -l` / 4 )) # Better way using awk zcat <gzipped fq file> | wc -l | awk '{print $1/4}'
So now add this to our for loop and save the results to a file
# Count the sequence reads in each FASTQ file and save info to a file for path in $( find /stor/work/CCBB_Workshops_1/bash_scripting/fastq -name "*.fastq.gz" ); do file=`basename $path` pfx=${file%%_R1_001.fastq.gz} pfx=$( echo $pfx | sed 's/00//' | perl -pe '~s/_S\d+//') echo "Counting reads in $file..." 1>&2 nSeq=$( zcat $path | wc -l | awk '{print $1/4}' ) echo -e "$nSeq\t$pfx\t$file\t$path" done | tee fastq_stats.txt cut -f 1-3 fastq_stats.txt # produces this output: 1302707 WT-1_L2 WT-1_S5_L002_R1_001.fastq.gz 1409369 WT-1_L1 WT-1_S5_L001_R1_001.fastq.gz 2281687 WT-2_L2 WT-2_S12_L002_R1_001.fastq.gz 2496141 WT-2_L1 WT-2_S12_L001_R1_001.fastq.gz
Now use awk to total the number of sequences we received:
cat fastq_stats.txt | awk ' BEGIN{FS="\t"; tot=0; ct=0} {tot = tot + $1 ct = ct + 1} END{print "Total of",tot,"sequences in",ct,"files"; printf("Mean reads per file: %d\n", tot/ct)}' # produces this output: Total of 7489904 sequences in 4 files Mean reads per file: 1872476
Note that the submitter actually provided GSAF with only 2 samples, labeled WT-1 and WT-2, but for various reasons each sample was sequenced on more than one lane of the sequencer's flowcell. To see how many FASTQ files were produced for each sample, we strip off the lane number and count the unique results:
cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq -c # produces this output: 2 WT-1 2 WT-2
What if we want to know the total sequences for each sample rather than for each file? Get a list of all unique sample names, then total the reads in the fastq_stats.txt files for that sample only:
for samp in `cut -f 2 fastq_stats.txt | perl -pe '~s/_L\d+//' | sort | uniq`; do echo "sample $samp" 1>&2 cat fastq_stats.txt | grep -P "\t${samp}_L\d" | awk -vsample=$samp ' BEGIN{FS="\t"; tot=0}{tot = tot + $1; pfx = $2} END{print sample,"has",tot,"total reads"}' done | tee sample_stats.txt cat sample_stats.txt # produces this output: WT-1 has 2712076 total reads WT-2 has 4777828 total reads
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.