Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The output looks like this, where the hexadecimal0x09 character is a Tab.

We will also use two data files from the GSAF's (Genome Sequencing and Analysis Facility) automated processing that delivers sequencing data to customers. These files have information about customer Samples (libraries of DNA molecules to sequence on the machine), grouped into sets assigned as Jobs, and sequenced on GSAF's sequencing machines as part of sequencer Runs.

...

A regular expression (regex) is a pattern of characters to search for and metacharacters that control and modify how matching is done.

The Intro Unix: Some Linux commands: Regular expressions section lists a nice set of "starter" metacharacters. Open that page now as a reference for this section.

...

  • -n tells perl to feed the input one line at a time (here 4 lines)
  • -e introduces the perl script
    • Always enclose a command-line perl script in single quotes to protect it from shell evaluation
    • perl has its own set of metacharacters that are different from the shell's
  • $_ is a built-in Perl variable holding the current line (including any invisible line-ending characters)
  • ~ is the perl pattern matching operator
    • =~ says pattern that matches;
    • ! ~ says pattern that does not match
  • the forward slashes ("/  /") enclose the regex pattern
  • the pattern matching operation returns true or false, to be used in a conditional statement
    • here "print current line if the pattern matches"

...

Use perl pattern matching to count the number of Runs in joblist.txt that were not run in 2015.

Expand
titleHint...


Code Block
languagebash
wc -l ~/data/joblist.txt
cat ~/data/joblist.txt | \
  perl -ne 'print $_ if $_ !~/SA15/;' | wc -l

# Of the 3841 entries in joblist.txt, 3088 were not run in 2015


...

Code Block
languagebash
for number in `seq 5`; do
  echo $number
done

for num in $(seq 5); do echo $num; done

Quotes matter

In the Review of some basics: Quoting in the shell section, we saw that double quotes allow the shell to evaluate certain metacharacters in the quoted text.

...

Expand
titleHint...

Here's the weird bash syntax for arithmetic (interger integer arithmetic only!):

Code Block
languagebash
n=0
n=$(( $n + 5 ))
echo $n


...

Code Block
languagebash
cat -n haiku.txt | \
  while IFS= read line; do
    echo "Line is: '$line'"
  done 
  • The IFS= clears all of read's default Input Field Separator, which is normally whitespace (one or more space characters or tabs).
    • This is needed so that read will set the line variable to exactly the contents of the input line, and not specially process any whitespace in it.
  • The lines of ~/haiku.txt are piped into the while loop

...

Code Block
languagebash
tail -n +2 ~/data/sampleinfo.txt | \
while IFS= read line; do
  jobName=$(    echo "$line" | cut -f 1 )
  sampleName=$( echo "$line" | cut -f 3 )
  if [ "$jobName" == "" ]; then
    sampleName="Undetermined"; jobName="none"
  fi
  echo "job $jobName - sample $sampleName"
done | more

...

  • The double quotes around the text that "$line" are important to preserve special characters inside the original line (here Tab characters).
    • Without the double quotes, the line's fields would be separated by spaces, and the cut field delimiter would need to be changed.
  • Some lines have an empty Job name field; we replace Job and Sample names in this case.

...

Sometimes you want to take a file path like ~/my_file.something.txt and extract some or all of the parts before the suffix, for example, to end up with the text my_file here. To do this, first strip off any directories using the basename function. Then use the odd-looking syntax:

  • $ ${<variable-name>%%.<suffix-to-remove>}
  • $ ${<variable-name>##<prefix-to-remove>}

Code Block
languagebash
pathname=~/my_file.something.txt; echo $pathname
filename=`basename $pathname`; echo $filename

# isolate the filename prefix by stripping the ".something.txt" suffix
prefix=${filename%%.something.txt}
echo $prefix

# isolate the filename suffix by stripping the "my_file.something." prefix
suffix=${filename##my_file.something.}
echo $suffix

Exercise 3-12

Use the suffix-removal syntax above to strip the .bed suffix off files in ~/data/bedfiles.

Expand
titleAnswer...


Code Block
languagebash
cd ~/data/bedfiles
for bedf in *.bed; do
  echo "BED file is $bedf"
  pfx=${bedf%%.bed}
  echo " .. prefix is $pfx (after the .bed suffix is stripped)"
done


A few odds and ends

Input from a sub-shell

When parentheses ( ) enclose an expression, it directs that expression be evaluated in a sub-shell of the calling parent shell. Recall also that the less-than sign < redirects standard input. We can use these two pieces of syntax instead of a file in some command contexts.

...

In addition to the methods of writing multi-line text discussed in Intro Unix: Writing text: Multi-line text, there's another one that can be useful for composing a large block of text for output to a file. This is done using the heredoc syntax to define a block of text between two user-supplied block delimiters, sending the text to a specified command.

...