Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titleUsing the head command
# shows 1st 10 lines
head small.fq

# shows 1st 100 lines -- might want to pipe this to more to see a bit at a time
head -100 small.fq | more

The vertical bar ( | ) above is the pipe operator, which connects one program's standard output to the next program's standard input. (Read more about Piping)

piping.pngImage Removed

So what if you want to see line numbers on your head or tail output? Neither command seems to have an option to do this.

...

Expand
titleAnswer


Code Block
languagebash
cat -n small.fq | tail


piping

So what is that vertical bar ( | ) all about? It is the pipe operator!

The pipe operator ( | ) connects one program's standard output to the next program's standard input. The power of the Linux command line is due in no small part to the power of piping.  Read (Read more about piping here: Piping. And read more about standard Piping and Standard Unix I/O streams here: Standard Streams in Unix/Linux.)

...

  • When you execute the head -100 small.fq | more command, head starts writing lines of the small.fq file to standard output.
  • Because of the pipe, the output does not go to the terminal Terminal, but is connected to the standard input  of the more command.
  • Instead of reading lines from a file you specify as a command-line argument, more obtains its input from standard input.

...

  • The more command writes a page of text to standard output, which is displayed on the Terminal.

Tip
titlePrograms often allow input from standard input

Most Linux commands are designed to accept input from standard input in addition to (or instead of) command line arguments so that data can be piped in.

Many bioinformatics programs also allow data to be piped in. Typically Often they will require you provide a special argument, such as stdin or -, to tell the program data is coming from standard input instead of a file.

tail

The yang to head's ying is tail, which by default it displays the last 10 lines of its data, and also uses the -NNN syntax to show the last NNN lines. (Note that with very large files it may take a while for tail to start producing output because it has to read through the file sequentially to get to the end.)

But what's really cool about tail is its -n +NNNNN syntax. This displays all the lines starting at line NNN NN. Note this syntax: the -n option switch follows by a plus sign ( + ) in front of a number – the plus sign is what says "starting at this line"! Try these examples:

...

Ok, now you know how to navigate an un-compressed file using head and tail, more or less. But what if your FASTQ file has been compressed by gzip? You don't want to un-compress the file, remember?

...

Tip

There will be times when you forget to pipe your large zcat or gunzip -c output somewhere – somewhere – even the experienced among us still make this mistake! This leads to pages and pages of data spewing across your terminal Terminal.

If you're lucky you can kill the output with Ctrl-c. But if that doesn't work (and often it doesn't) just close your Terminal window. This terminates the process on the server (like hanging up the phone), then you just can log back in.

...

One of the first thing to check is that your FASTQ files are the same length, and that length is evenly divisible by 4. The wc command (word countword count) using the -l switch to tell it to count lines, not words, is perfect for this. It's so handy that you'll end up using wc -l a lot to count things. It's especially powerful when used with filename wild carding.

...

Here's another trick: backticks evaluation. When you enclose a command expression in backtick quotes ( ` ) the enclosed expression is evaluated and its standard output substituted into the string. (Read more about Quoting in the shell).

Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input.

...

In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( {  } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced commands: awk)

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz


...

Code Block
languagebash
titleCounting FASTQ sequences with awk
cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l | awk '{print $1 / 4}'

Note that $1 means something different in awk – the 1st whitespace-delimited input field – than it does in bash, where it represents the 1st argument to a script or function (technically, the 1 environment variable). This is an example of where a metacharacter- the dollar sign ( $ ) here – has  a different meaning for two different programs.

The bash shell treats dollar sign ( $ ) as an evaluation operator, so will normally attempt to evaluate the environment variable name following the $ and substitute its value in the output (e.g. echo $SCRATCH). But we don't want that evaluation to be applied to the {print $1 / 4} script argument passed to awk; instead we want awk to see the literal string {print $1 / 4} as its script. To achieve this result we surround the script argument with single quotes ( ' ' ), which tells the shell to treat everything enclosed by the quotes as literal text, and not perform any metacharacter evaluation.

Processing multiple compressed files

...