Part 3: Advanced text manipulation

Example data files

For some of the discussions below, we'll use some files in your ~/data directory.

The ~/data/walrus_sounds.tsv file (walrus_sounds.tsv) lists the types of sounds made by several well-known walruses, and the length of each occurrence. Tab-delimited fields are:

  • column 1 - walrus name
  • column 2 - sound type
  • column 3 - length of sound

Take a look at the first few lines of this file:

cd ~/data
head walrus_sounds.tsv

The .tsv filename extension stands for tab separated values, indicating that the field separator (the character separating fields) is Tab. We can verify this using the handy hexdump alias we defined for you as discussed at Intro Unix: What is text? 

cd ~/data
head walrus_sounds.tsv | hexdump

The output looks like this, where the hexadecimal 0x09 character is a Tab.

We will also use two data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer Samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as Jobs, and sequenced on GSAF's sequencing machines as part of sequencer Runs.

The files are also in your ~/data directory:

  • joblist.txt - contains job name/sample name pairs, Tab-delimited, no header
    • the "JAnnnnn" items in the 1st column are Jobs
    • the "SAnnnnn" items in the 2nd column are Run
  • sampleinfo.txt - contains information about all samples run on a particular run, along with the job each belongs to.
    • columns (Tab-delimited) are job_name, job_id, sample_name, sample_id, date_string
    • column names are in a header line

Take a look at the first few lines of these files also:

cd ~/data
head joblist.txt
head sampleinfo.txt

Exercise 3-1

What field separators are used in ~/data/joblist.txt and ~/data/sampleinfo.txt?

 Answer...
cd ~/data
head joblist.txt | hexdump
head sampleinfo.txt | hexdump

The hexdump output shows that both files use Tab to separate fields.

How many lines to these sample files have?

 Hint...

Use the word count command with the -l (count lines) option:

wc -l 

 Answer...
cd ~/data
wc -l *.txt *.tsv

shows:

  200 walrus_sounds.tsv
 3841 joblist.txt
   44 sampleinfo.txt
 4085 total

Cut, sort, uniq

cut

The cut command lets you isolate ranges of data from its input lines (from files or standard input):

  • cut -f <field_number(s)> extracts one or more fields (-f) from each line
    • the default field delimiter is Tab
    • use -d <delim> to change the field delimiter
  • cut -c <character_number(s)> extracts one or more characters (-c) from each line
  • the <numbers> can be
    • a comma-separated list of numbers (e.g. 1,4,7)
    • a hyphen-separated range (e.g. 2-5)
    • a trailing hyphen says "and all items after that" (e.g. 3,7-)
  • cut does not re-order fields, so cut -f 5,3,1 acts like -f 1,3,5

Examples:

cd ~/data
cut -f 2 joblist.txt | head
head joblist.txt | cut -c 9-13

# no field reordering, so these two produce the same output
cut -f1,2 walrus_sounds.tsv | head
cut -f2,1 walrus_sounds.tsv | head   

Exercise 3-2

How would you extract the first 5 job_name and sample_name fields from ~/data/sampleinfo.txt without including the header? Recall that job_name and sample_name are fields 1 and 3 of ~/data/sampleinfo.txt.

 Hint...

Use tail -n +2 (or tail +2) to skip the header line and start at line 2.
Then use cut -f to isolate the desired fields.

 Answer...

tail -n +2 ~/data/sampleinfo.txt | head -5 | cut -f 1,3

sort

sort sorts its input lines using an efficient algorithm

  • by default sorts each line lexically (as strings), low to high
    • use -n sort numerically (-n)
    • use -V for Version sort (numbers with surrounding text)
    • use -r to reverse the sort order
  • use one or more -k <start_field_number>,<end_field_number> options to specify a range of "keys" (fields) to sort on
    • use this option when you want to preserve all data on the input lines, but sort on part(s) of the line
    • e.g. -k1,1 -2,2nr  to sort field 1 lexically then field 2 as a number high-to-low
    • by default, fields are delimited by whitespace -- one or more spaces or Tabs 
      • use -t <delim> to change the field delimiter (e.g. -t "\t" for Tab only; ignore spaces)

Examples:

Here we use cut to isolate text we want to sort:

cd ~/data
cut -f 2 joblist.txt | head | sort     # sort the 1st 10 Runs in "joblist.txt"  
cut -f 2 joblist.txt | head | sort -r  # reverse-sort the 1st 10 Runs in "joblist.txt"

# reverse-sort the Jobs in "joblist.txt" then look at the 1st 10
cut -f 1 joblist.txt | sort -r | head  

But we can also sort lines based on one or more fields specified by the -k option:

cd ~/data
# sort the lines of "joblist.txt" according to the data in field 2 (Job), high-to-low
# then view the top 10 lines
sort -k1,1r joblist.txt | head    

# sort lines of "walrus_sounds.tsv" by sound type (field 2) 
# then by walrus (field 1) & look at 20
cat walrus_sounds.tsv | sort -k2,2 -k1,1 | head -20

Exercise 3-3

Which walruses make the longest sounds?

 Answer...

sort -k3,3r ~/data/walrus_sounds.tsv | head

Looks like ET and Antje make the longest sounds

Here's an example of using the -V (Version sort) option to sort numbers-with-text:

# produce 4 lines of output with integers then an "x"
echo -e "12x\n2x\n91x\n31x"

# "Version sort" these lines with -V, high number to low
echo -e "12x\n2x\n91x\n31x" | sort -Vr

uniq

uniq takes sorted input and collapses adjacent groups of identical values

  • uniq -c says also report a count of the number of members in each group (before collapsing)

Examples:

cd ~/data
head walrus_sounds.tsv | cut -f 2 | sort            # look at the 1st 10 walrus sounds, sorted
head walrus_sounds.tsv | cut -f 2 | sort | uniq     # collapse the 1st 10 (sorted) sounds
head walrus_sounds.tsv | cut -f 2 | sort | uniq -c  # add a count of the items in each group

piping a histogram with cut/sort/uniq -c

One of my favorite "Unix tricks" is combining sort and uniq calls to produce a histogram-like ordered list of count/value pairs.

cd ~/data
cut -f 2 walrus_sounds.tsv | sort | uniq -c   # report counts of each type of walrus sound 
# output:      
     33 bellow
     52 chortle
     34 gong
     36 grunt
     45 whistle

Since each line of this output consists of a number then a sound name, separated by whitespace (one or more spaces or Tabs), we can use sort's -k1,1nr option to sort it by count, highest to lowest:

# Take the reported sound counts and reverse sort it numerically by column 1 (the count)
# to see the most common sounds made
cut -f 2 walrus_sounds.tsv | sort | uniq -c | sort -k1,1nr
# output:
    52 chortle
    45 whistle
    36 grunt
    34 gong
    33 bellow

We affectionately refer to this "cut | sort | uniq -c | sort -k1,1nr" idiom as "piping a histogram".

Exercise 3-4

How many different walruses are represented in the ~/data/walrus_sounds.tsv file? Which one has the most recorded sounds?

 Answer...

cut -f 1 walrus_sounds.tsv | sort | uniq | wc -l     
reports 3 different walrus names

cut -f 1 walrus_sounds.tsv | sort | uniq -c | sort -k1,1nr
Looks like ET has the most sounds:
  69 ET
  68 Antje 
  63 Jocko

Job names are in column 1 of the ~/data/sampleinfo.txt file. Create a histogram of Job names showing the count of samples (lines) for each, and show Jobs with the most samples first.

 Answer...

Job JA19060 has the most samples (35)

tail -n +2 sampleinfo.txt | cut -f 1 | sort | uniq -c | sort -k1,1nr | head

The Run names in joblist.txt start with SAyy where yy are year numbers. Report how many runs occurred in each year.

 Hint...
First isolate the Run field characters 1-4 (or 3,4) to isolate the years. Then sort, count unique, sort...
 Answer...


cut -f 2 joblist.txt | cut -c1-4 | sort | uniq -c | sort -k1,1nr

# output
   753 SA15
   713 SA14
   625 SA16
   531 SA17
   462 SA13
   422 SA18
   260 SA12
    74 SA19
     1 SA99

Which Run in joblist.txt has the most jobs?

 Answer...
cat joblist.txt | cut -f 2 | sort | uniq -c | sort -k1,1nr | head -1
# 23 SA13038

Ensuring uniqueness of field combinations

Sometimes you'll get a table of data that should contain unique values of a certain field, or a certain combination of fields. Using cut sort uniq wc -l can help verify this, and to find which are duplicates.

Example: Are all the Job names in joblist.txt unique?

cd ~/data
wc -l joblist.txt                           # Reports 3841 Job/Run entries
cut -f 1 joblist.txt | sort | uniq | wc -l  # But there are only 3167 unique Jobs

# So there are some Jobs that appear more than once -- but which ones?
# Use our "piping a histogram" trick but only look at the highest-count entries
cut -f 1 joblist.txt | sort | uniq -c | sort -k1,1nr | head

Exercise 3-5

Are all combinations of Job/Run in joblist.txt unique?

 Answer...

Yes

cd ~/data
wc -l joblist.txt                             # Reports 3841 Job/Run entries
cut -f 1,2 joblist.txt | sort | uniq | wc -l  # And 3841 unique Job/Run combinations

# or just, since there are only 2 fields:
sort joblist | uniq | wc -l

Yes, all entries are unique.

Introducing awk

awk is a powerful scripting language that is easily invoked from the command line. It is especially useful for handling tabular data.

One way of using it:

  • awk '<script>' - the '<script>'  is applied to each line of input (generally piped in)
    • always enclose '<script>' in single quotes to inhibit shell evaluation
    • awk has its own set of metacharacters that are different from the shell's

A basic awk script has the following form:

BEGIN {<expressions>}
{<body expressions>}
END {<expressions>}

Here's a simple awk script that takes the average of the numbers passed to it, using the seq function to generate numbers from 1-10, each on a line.

seq 10 | awk '
BEGIN{sum=0; ct=0;}
{sum = sum + $1
 ct = ct + 1}
END{print sum/ct,"is the mean of",ct,"numbers"}'

Notes:

  • Once the single quote to start the script is seen on the 1st line, there is no need for special line-continuation.
    • Just enter the script text then finish with a closing single quote when done.
    • Multiple expressions can appear on the same line if separated by a semicolon ( ; )
  • The BEGIN and END clauses are optional, and are executed only once, before and after input is processed respectively
  • BEGIN {<expressions>}  –  use to initialize variables before any script body lines are executed
    • Script variables ct and sum are initialized to 0 in the BEGIN block above
    • Some important built-in variables you man want to initialize:
      • FS (input Field Separator) - used to delimit fields
        • default is whitespace -- one or more spaces or Tabs 
        • e.g. FS=":" to specify a colon
      • OFS (Output Field Separator) - use to delimit output fields
        • default is a single space
        • e.g. OFS="\t" to specify a Tab
  • The body expressions are executed for each line of input.
    • Each line is parsed into fields based on the specified input field separator 
    • Fields can then be access via build-in variables $1 (1st field), $2 (2nd field) and so forth.
      • the built-in NF variable represents the Number of Fields in a given line
      • the built-in NR variable represent the Number of the current Record (line)
    • awk has the usual set of arithmetic operators (+, /, etc)
      • and comparison operators (=, >, <, etc)
      • and an if ( <expression> ) { <action> } conditional construct
  • The END block is executed when there is no more input
    • The print statement in the END block takes a comma-separated list of values
      • each value is separated by awk's default output field separator (a single space)
      • literal text is specified using double quotes ("is the mean of")

Exercise 3-6

Use awk to print out the highest Job (JA) and Run (SA) in joblist.txt.

 Hints...
  • Initialize variables to hold the highest Job and Run in a BEGIN block.
  • In the body, use $1 to refer to the Job, $2 to refer to the Run.
  • Use the > (greater than) comparison operator to compare the current value to the highest seen so far
  • Use the if ( <expression> ) { <action> } conditional construct for comparisons
  • Display the results in an END block.
 Answer...
cd ~/data
cat joblist.txt | awk '
BEGIN{maxJA=""; maxSA=""}
{
 if ($1 > maxJA) { maxJA = $1 }
 if ($2 > maxSA) { maxSA = $2 }
}
END{print "The highest Run is:",maxSA,"The highest Job is:",maxJA}'

The highest Run is: SA99999 The highest Job is: JA19142

A more complicated awk script

Now let's write a more complicated awk script to explore its capabilities further. Our goal is to sum up the walrus sound times in ~/data/walrus_sounds.tsv and print out that total in seconds, minutes and hours. And lets write the script bit-by-bit to show how we can "debug as we go" on the command line.

# First isolate the sound length field (field 3)
# We'll use head to test our code on just a few lines until we like our script
head walrus_sounds.tsv | cut -f 3 | awk '{print}'

# We see the times are in MM:SS (minutes, seconds) so we'll use FS=":" to 
# specify colon as the input field separatof
head walrus_sounds.tsv | cut -f3 | awk 'BEGIN{FS=":"}{print $1,$2}'
# Looks good - minutes are coming out as field 1 and seconds as field 2

# Now calculate the number of seconds for each line with some math
head walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"}
{seconds = $2 + ($1 * 60)
 print $1,$2,seconds}'

# Now add each sound's seconds to a global total
head walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + ($1 * 60)
 total = total + seconds
 print $1,$2,seconds, total}'

# Now process all the input and just output the final totals
cat walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + $1 * 60
 total = total + seconds}
END{ print "total seconds:",total
     print "total minutes:",total/60
     print "total hours:  ",total/60/60}'

# One final improvement: use the printf function to format the
# output to control how many decimal places are shown.
cat walrus_sounds.tsv | cut -f3 | awk '
BEGIN{FS=":"; total=0}
{seconds = $2 + $1 * 60
 total = total + seconds}
END{ printf("total seconds: %d\n", total)
     printf("total minutes: %.2f\n",total/60)
     printf("total hours:   %.2f\n",total/60/60)}'

To learn more, here's an excellent awk tutorial, very detailed and in-depth.

printf and sprintf functions come from the C programming language, but many higher-level languages implement similar text formatting. Wikipedia has a nice table of printf format specifiers (https://en.wikipedia.org/wiki/Printf#Type_field) as part of its thorough printf page.

Parsing field-oriented text with cut and awk

The basic functions of cut and awk are similar – both are field oriented. Here are the main differences:

  • Default field separators
    • Tab is the default field separator for cut
      • and the field separator can only be a single character
    • whitespace (one or more spaces or Tabs) is the default field separator for awk
  • Re-ordering
    • cut cannot re-order fields; cut -f 3,2 is the same as cut -f 2,3
    • awk does reorder fields, based on the order you specify
  • awk is a full-featured programming language while cut is just a single-purpose utility.

Compare:

echo -e "A B\tC" | awk '{print $2}'    # displays 'B'
echo -e "A B\tC" | awk '{print $3}'    # displays 'C'

echo -e "A B\tC" | cut -f 1            # displays 'A B'
echo -e "A B\tC" | cut -f 2            # displays 'C'

When to use these programs is partly a matter of taste. I use either cut or awk to deal with field-oriented data: usually cut if it's Tab-separated and awk otherwise.

For more complex data manipulation, even though awk is a full-featured programming language I find its pattern matching and text processing facilities awkward (pun intended), and so prefer perl with its rich regular expression capabilities.

Regular expressions in grep, sed and perl

Regular expressions are incredibly powerful and should be in every programmer's toolbox!

Regular expression parsing is implemented in many tools and programming languages – but every tool seems to implement slightly different standards! What to do? I'll describe some of my practices below, with the understanding that they represent one of many ways to skin the cat.

First, let's be clear: perl has the best regular expression capabilities on the planet, and they serve as the gold standard against which all other implementations are judged. Nonetheless, perl is not as convenient to used as grep from the command line, because command-line grep has so many handy options.

regular expressions

A regular expression (regex) is a pattern of characters to search for and metacharacters that control and modify how matching is done.

The Intro Unix: Some Linux commands: Regular expressions section lists a nice set of "starter" metacharacters. Open that page now as a reference for this section.

grep pattern matching

The bash shell actually has three slightly different functions: grep, egrep and fgrep. Feel free to learn them all (smile). I stick to the basic grep.

Examples:

cd
grep 'the'     haiku.txt  # matches 2 lines
grep '^the'    haiku.txt  # nothing - no lines start with "the"
grep '^Is'     haiku.txt  # matches 1 line
grep 'th'      haiku.txt  # matches 5 lines
grep 'th[a-z]' haiku.txt  # matches 4 lines (does not match 'with ')
grep 'th.. '   haiku.txt  # matches 1 line with 'that' (th + any 2 characters + space)

For basic command-line pattern matching, I first try grep. While by default it implements clunky POSIX-style pattern matching, its -P argument asks it to honor Perl regular expression syntax. I always use that -P option, and in most cases this works well.

echo "foo bar" | grep '(oo|ba)'         # no match!
echo "foo bar" | grep -P '(oo|ba)'      # match

echo -e "12\n23\n4\n5" | grep '\d\d'    # no matches!
echo -e "12\n23\n4\n5" | grep -P'\d\d'  # 2 lines match

My favorite command-line grep options are:

  • -v  (inverse) – only print lines with no match
  • -n  (line number) – prefix output with the line number of the match
  • -i  (case insensitive) – ignore case when matching
  • -c  (count) – just return a count of the matches
  • -l – instead of reporting each match, report only the name of files in which any match is found
  • -L  – like -l, but only reports the name of files in which no match is found
    • handy for checking a bunch of log files for success or error

Exercise 3-7

How many lines of ~/haiku.txt have the word 'the', case sensitive? How many case-insensitive?

 Answer...
cd 
grep 'the' haiku.txt | wc -l
# or
grep -c 'the' haiku.txt

# both report 2 lines containing 'the', case sensitive
grep -c -i 'the' haiku.txt
# reports 4 lines containing 'the' when case is considered

Use grep to display lines of haiku.txt and jabberwocky.txt that contain the word "the" (case sensitive), with line numbers.

 Answer...
cd 
grep -n 'the' haiku.txt jabberwocky.txt 

How many lines does each file (haiku.txt and jabberwocky.txt) have that contain the word "the"? How many all together?

 Hint...

When given explicit files, grep -c counts and reports matches in each.

 Answer...
grep -c 'the' haiku.txt jabberwocky.txt 

# output:
haiku.txt:2
jabberwocky.txt:15

How many lines total do the files haiku.txt and jabberwocky.txt have that contain the word "the"?

 Hint...

When input is piped it, the grep -c match count is for all the input.

 Answer...
cat haiku.txt jabberwocky.txt | grep -c 'the'

Or just:

grep 'the' haiku.txt jabberwocky.txt | wc -l

Both report 17 lines total

Does the word "brillig" appear in either haiku.txt or jabberwocky.txt?

 Hint...

Use grep -l to report only the names of files containing the word 'brillig'

 Answer...
# This displays the 2 lines of jabberwocky.txt that contain "brillig"
grep -n 'brillig' haiku.txt jabberwocky.txt 

# But using -l reports that only the jabberwocky.txt file contains any mention of "brillig"
grep -l 'brillig' haiku.txt jabberwocky.txt

Which lines of haiku.txt contain the word 'not' or the word 'working'?

 Hint...

Use grep -P and parentheses in the pattern:  ( | )

 Answer...
grep -P -n '(working|not)' haiku.txt

# output:
2:Is not the true Tao, until
7:"My Thesis" not found.
10:Today it is not working

perl pattern matching

If grep pattern matching isn't behaving the way I expect, I turn to perl. Here's how to invoke regex pattern matching from a command line using perl:

perl -n -e 'print $_ if $_=~/<pattern>/'

Examples:

echo -e "12\n23\n4\n5" | perl -n -e 'print $_ if $_ =~/\d\d/'

# or, for lines not matching
echo -e "12\n23\n4\n5" | perl -ne 'print if $_ !~/\d\d/'

Gory details:

  • -n tells perl to feed the input one line at a time (here 4 lines)
  • -e introduces the perl script
    • Always enclose a command-line perl script in single quotes to protect it from shell evaluation
    • perl has its own set of metacharacters that are different from the shell's
  • $_ is a built-in Perl variable holding the current line (including any invisible line-ending characters)
  • ~ is the perl pattern matching operator
    • =~ says pattern that matches;
    • ! ~ says pattern that does not match
  • the forward slashes ("/  /") enclose the regex pattern
  • the pattern matching operation returns true or false, to be used in a conditional statement
    • here "print current line if the pattern matches"

Exercise 3-8

Use perl pattern matching to count the number of Jobs (lines) in joblist.txt that were not run in 2015.

 Hint...
wc -l ~/data/joblist.txt
cat ~/data/joblist.txt | \
  perl -ne 'print $_ if $_ !~/SA15/;' | wc -l

# Of the 3841 job entries in joblist.txt, 3088 were not run in 2015

Use perl pattern matching to find all grunts and bellows made by walrus Antje.

 Hint...

The .* metacharacter pattern will match any number of characters (technically 0 or more).

 Answer...
# First filter only the sounds for walrus Antje
cat ~/data/walrus_sounds.tsv | \
  perl -ne 'print $_ if $_ =~/Antje/;' | head

# Now modify the pattern for grunt or bellow 
cat ~/data/walrus_sounds.tsv | \
  perl -ne 'print $_ if $_ =~/Antje.*(grunt|bellow)/;'

sed pattern substitution

The sed (string editor) command can be used to edit text using pattern substitution.

sed 's/<search pattern>/<replacement>/'

While sed is very powerful, the regex syntax for its more advanced features is quite different from "standard" grep or perl regular expressions. As a result, I tend to use it only for very simple substitutions, usually as a component of a multi-pipe expression.

# Look for 20 runs in the SA19xxx year and report their Job and Run numbers 
# without JA/SA but with other text
cat joblist.txt | grep -P 'SA19\d\d\d$' | head -20 | \
  sed 's/JA/job /' | sed 's/SA/run /'

Here's a nice idiom for making sore that text file line endings are just a linefeed ( \n ) not a carriage return plus a linefeed ( \r\n ).

cat some_file.txt | sed 's/\r\n/\n/' > some_file.fixed.txt

See also: Grymore sed tutorial

perl pattern substitution

If I have a more complicated pattern, or if sed pattern substitution is not working as I expect (which happens frequently!), I again turn to perl. Here's how to invoke perl pattern substitution from a command line:

perl -p -e '~s/<search pattern>/<replacement>/[modifiers]'

Parentheses ( ) around one or more text sections in the <search pattern> will cause matching text to be captured in built-in perl variables $1, $2, etc., following the order of the parenthesized text. The capture variables can then be used in the <replacement>.

All input lines are written, but substitution is only performed on lines matching the pattern. Here's a simple example that just performs the same substitution we did above with sed:

cd ~/data
cat joblist.txt | head | \
  perl -p -e '~s/JA/job /' | perl -pe '~s/SA/run /' 

Gory details:

  • -p tells perl to print each line (whether there was a substitution or not)
  • -e introduces the perl script (always encode it in single quotes to protect it from shell evaluation)
  • ~s is the perl pattern substitution operator
  • forward slashes /  /  / enclose the regex search pattern and the replacement text
  • parentheses ( ) enclosing a pattern "capture" text matching the pattern in built-in positional variables
    • $1 for the 1st captured text, $2 for the 2nd captured text, etc.
  • optional modifiers following the pattern control how the pattern is performed
    • g (global) - perform the substitutions on all occurrences of the pattern in each record of input
    • i (ignore case) - perform the case-insensitive matching of pattern text

Examples:

# Use parentheses to capture part of the search pattern:
cat joblist.txt | head | \
  perl -pe '~s/JA(\d+)\tSA(\d+)/Job $1 on Run $2/' 

# Illustrate use of optional modifiers
head joblist.txt | perl -pe 's/0/-/'      # replaces only the 1st 0 on a line
head joblist.txt | perl -pe 's/0/-/g'     # replaces all 0s on each line 
head joblist.txt | perl -pe 's/ja/Job /i' # performs case-insensitive search
head joblist.txt | perl -pe '
s/ja\d\d(\d\d\d).*Sa(\d\d)(\d\d\d)/year 20$2, run $3 job $1/i'

Exercise 3-9

Oops – we misspelled a walrus name: Antje should be Antie in ~/data/walrus_sounds.tsv. Use perl pattern substitution to fix this.

 Answer...
cat ~/data/walrus_sounds.tsv | perl -pe 's/^Antje/Antie/' | head
cat ~/data/walrus_sounds.tsv | perl -pe 's/^(An)tje/${1}tie/' | head

Use perl pattern substitution to transform the "MM:SS" times in ~/data/walrus_sounds.tsv. to text like "MM minutes and SS seconds".

 Answer...
head ~/data/walrus_sounds.tsv | perl -pe '
s/(\d\d)[:](\d\d)/$1 minutes and $2 seconds/'

Bash control flow

The bash for loop

The bash for loop has the basic syntax:

for <arg_name> in <list of whitespace-separated words>
do
   <expression>
  
<expression>
done

See https://www.gnu.org/software/bash/manual/html_node/Looping-Constructs.html.

Here's a simple example using the seq function, that returns a list of numbers.

for num in `seq 5` 
do
  echo "The number is: $num"
done

Gory details:

  • The `seq 5` expression uses backtick evaluation to generate a list of 5 numbers, 1-5
  • The do/done block expressions are executed once for each of the items in the list
  • Each time through the loop (the do/done block) the variable named num is assigned one of the values in the list
    • Then the value can be used by referencing the variable using $num
    • The variable name num is arbitrary – it can be any name we choose

Other ways of writing the same for loop:

for number in `seq 5`; do
  echo $number
done

for num in $(seq 5); do echo $num; done

Quotes matter

In the Review of some basics: Quoting in the shell section, we saw that double quotes allow the shell to evaluate certain metacharacters in the quoted text.

But more importantly when assigning multiple lines of text to a variable, quoting the evaluated variable preserves any special characters in the variable value's text such as Tab or newline characters.

Consider this case where a captured string contains newlines, as illustrated below.

echo -e "aa\nbb\ncc"
txt=$( echo -e "aa\nbb\ncc" )
echo $txt     # newlines converted to spaces
echo "$txt"   # newlines preserved
  • evaluating "$txt" inside double quotes preserves the newlines
  • evaluating $txt without double quotes converts each newline to a single space

This difference is very important!

  • you do want to preserve newlines when processing one line of text at a time
  • you do not want to preserve newlines when specifying the list of values a for loop processes, which must all be on one line

See the difference:

nums=$( seq 5 )
echo $nums    # newlines converted to spaces, values appear all on one line
echo "$nums"  # newlines preserved; values appear on 5 lines

echo $nums | wc -l    # newlines converted to spaces, so reports only one line
echo "$nums" | wc -l  # newlines preserved, so reports 5 lines

# This loop prints a line for each of the files
for n in $nums; do
  echo "the number is: '$n'"
done

# But this loop prints only one line
for n in "$nums"; do
  echo "the number is: '$n'"
done

Let's use a for loop to process some file names. We'll use find to list the full paths of some student Home directories then use the basename function to isolate the last path component, which will be the account name.

homedirs=$( find /stor/home -maxdepth 1 -name "student2?" -type d )
echo "$homedirs" | sort
for dir in $homedirs; do
  dirname=`basename $dir`
  echo "account name: $dirname; Home directory: $dir"
done

# A different way to skin the cat:
dirs=$( ls /stor/home | grep -P 'student2\d' )
echo "$dirs"
for acct in $dirs; do
  dirpath="/stor/home/$acct"
  echo "account name: $acct; Home directory: $dirpath"
done

Exercise 3-10

Use a for loop to sum the numbers from 1 to 10.

 Hint...

Here's the weird bash syntax for arithmetic (integer arithmetic only!):

n=0
n=$(( $n + 5 ))
echo $n
 Answer...
sum=0
for num in `seq 10`; do
  sum=$(( $sum + $num ))
done
echo "Sum is: $sum"

Use an awk script to sum the numbers from 1 to 10. 

 Answer...
echo "$(seq 10)" | awk '
BEGIN{sum=0}
{sum = sum + $1}
END{print "sum is:",sum}'

The if statement

The general form of an if/then/else statement in bash is:

if [ <test expression> ]
then <expression> [ expression... ]
else <expression> [ expression... ]
fi

Where

  • The <test expression> is any expression that evaluates to true or false
    • In the shell, the number 0 or an empty value is false
    • Anything else is true
    • There must be at least one space around the <test expression> separating it from the enclosing bracket [ ].
    • Double brackets [[  ]] can also be used to enclose the <test expression>.
  • When the <test expression> is true the then expressions are evaluated.
  • When the <test expression> is false the else expressions are evaluated.

Examples:

for val in `seq 1 5`; do
  if [ $val -gt 3 ]
    then echo "Value '$val' is greater than 3"
    else echo "Value '$val' is less than or equal to 3"
  fi
done

for val in Foo "$emptyvar" 7 '' $?; do
  if [ $val ]
    then echo "Value '$val' is true"
    else echo "Value '$val' is false"
  fi
done

A good reference on the many built-in bash conditionals: https://www.gnu.org/software/bash/manual/html_node/Bash-Conditional-Expressions.html

Reading file lines with while

The read function can be used to read input one line at a time, in a bash while loop.

While the full details of the read commad are complicated (see https://unix.stackexchange.com/questions/209123/understanding-ifs-read-r-line) this read-a-line-at-a-time idiom works nicely. 

cat -n haiku.txt | \
  while IFS= read line; do
    echo "Line is: '$line'"
  done 
  • The IFS= clears all of read's default Input Field Separator, which is normally whitespace (one or more space characters or tabs).
    • This is needed so that read will set the line variable to exactly the contents of the input line, and not specially process any whitespace in it.
  • The lines of ~/haiku.txt are piped into the while loop

If the input data is well structured, its fields can be read directly into variables. Notice we can pipe all the output to more – or could redirect it to a file.

while read walrus sound duration; do
  echo "walrus $walrus, sound: $sound, duration: $duration"
done < ~/data/walrus_sounds.tsv | more

Here's a more complex example:

tail -n +2 ~/data/sampleinfo.txt | \
while IFS= read line; do
  jobName=$(    echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [ "$jobName" == "" ]; then
    sampleName="Undetermined"; jobName="none"
  fi
  echo "job $jobName - sample $sampleName"
done | more

Once a line has been read, it can be parsed, for example, using cut, as shown above. Other notes:

  • The double quotes around the text that "$line" are important to preserve special characters inside the original line (here Tab characters).
    • Without the double quotes, the line's fields would be separated by spaces, and the cut field delimiter would need to be changed.
  • Some lines have an empty Job name field; we replace Job and Sample names in this case.

Exercise 3-11

Using the above code as a guide, use the Job name and Sample name information in ~/data/sampleinfo.txt to construct a pathname of the form Project_<job name>/<sample name>.fastq.gz, and write these paths to a file. Skip any entries with no job name using the keyword continue to skip any remaining code in the loop and start the next iteration.

 Answer...
tail -n +2 ~/data/sampleinfo.txt | \
while IFS= read line; do
  jobName=$( echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [[ "$jobName" == "" ]]; then continue; fi
  echo "Project_${jobName}/${sampleName}.fastq.gz"
done | tee pathnames.txt

A few odds and ends

Arithemetic in bash

Arithmetic in bash is very weird:

echo $(( 50 * 2 + 1 ))

n=0
n=$(( $n + 5 ))
echo $n

And it only returns integer values, after truncation.

echo $(( 4 / 2 ))
echo $(( 5 / 2 ))

echo $(( 24 / 5 ))

As a result, if I need to do anything other than the simplest arithemetic, I use awk:

awk 'BEGIN{print 4/2}'
echo 3 2 | awk '{print ($1+$2)/2}'

You can also use the printf function in awk to control formatting. Just remember that a linefeed ( \n ) has to included in the format string:

echo 3.1415926 | awk '{ printf("%.2f\n", $1) }'

You can even use it to convert a decimal number to hexadecimal using the %x printf format specifier. Note that the convention is to denote hexadecimal numbers with an initial 0x.

echo 65 | awk '{ printf("0x%x\n", $1) }'

Removing file suffixes and prefixes

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:

  •  ${<variable-name>%%.<suffix-to-remove>}
  •  ${<variable-name>##<prefix-to-remove>}

pathname=~/my_file.something.txt; echo $pathname
filename=`basename $pathname`; echo $filename

# isolate the filename prefix by stripping the ".something.txt" suffix
prefix=${filename%%.something.txt}
echo $prefix

# isolate the filename suffix by stripping the "my_file.something." prefix
suffix=${filename##my_file.something.}
echo $suffix

Exercise 3-12

Use the suffix-removal syntax above to strip the .bed suffix off files in ~/data/bedfiles.

 Answer...
cd ~/data/bedfiles
for bedf in *.bed; do
  echo "BED file is $bedf"
  pfx=${bedf%%.bed}
  echo " .. prefix is $pfx (after the .bed suffix is stripped)"
done

A few odds and ends

Input from a sub-shell

When parentheses ( ) enclose an expression, it directs that expression be evaluated in a sub-shell of the calling parent shell. Recall also that the less-than sign < redirects standard input. We can use these two pieces of syntax instead of a file in some command contexts.

# These two expressions are equivalent
head -3 haiku.txt
cat <(head -3 haiku.txt)

Now suppose we have a file and we want to add a header line to, but only to the 1st 5 lines. Here are two different ways of doing that, where the 2nd avoids having to write any intermediate files:

cd
echo -e "walrus\tsound\tlength" > hdr.txt
head -5 ~/data/walrus_sounds.tsv > ws_head.txt
cat hdr.txt ws_head.txt > ws2.txt

echo -e "walrus\tsound\tlength" | \
  cat - <(head -5 ~/data/walrus_sounds.tsv)

Input redirection like this will not work all the time – a program must be written to accept sub-shell redirection. Most built-in Linux command are written this way, but many 3rd party programs are not.

Using a heredoc for multiple text lines

In addition to the methods of writing multi-line text discussed in Intro Unix: Writing text: Multi-line text, there's another one that can be useful for composing a large block of text for output to a file. This is done using the heredoc syntax to define a block of text between two user-supplied block delimiters, sending the text to a specified command.

The general form of a heredoc is:

COMMAND << DELIMITER
..text...
..text...
DELIMITER

For example, using the delimiter EOF and the cat command. Here the block of text is just displayed on the Terminal. 

cat << EOF
This text will be output
And this USER environment variable will be evaluated: $USER
EOF

To write multi-line text to a file just use the 1> or > redirection syntax after the block delimiter you name:

cat << EOF 1> out.txt
This text will be output
And this USER environment variable will be evaluated: $USER
EOF

The out.txt file will then contain this text:

This text will be output
And this USER environment variable will be evaluated: student01

The 2nd (ending) block delimiter you specify for a heredoc must appear at the start of a new line.