Page Comparison

...

The first technique is the use of pagers – we've already seen this with the more command. Review its use now on our small uncompressed file:

Expand

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

Code Block

language	bash
title	Using the more pager

# Use spacebar to advance a page; Ctrl-c to exit
more small.fq

...

For a really quick peek at the first few lines of your data, there's nothing like the head command. By default head displays the first 10 lines of data from the file you give it or from its standard input. With an argument -NNN (that is a dash followed by some number), it will show that many lines of data.

Code Blockexpand

language	bash
title	title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

Code Block

language	bash
title	Using the head command

# shows 1st 10 lines
head small.fq

# shows 1st 100 lines -- might want to pipe this to more to see a bit at a time
head -100 small.fq | more

...

But what's really cool about tail is its -n +NNN syntax. This displays all the lines starting at line NNN. Note this syntax: the -n option switch follows by a plus sign ( + ) in front of a number – the plus sign is what says "starting at this line"! Try these examples:

Code Blockexpand

language	title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

Code Block

language	bash
title	Using the tail command

# shows the last 10 lines
tail small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we can see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 15 lines starting at line 900 because we pipe to head -15
tail -n +900 small.fq | head -15

...

Let's illustrate this using one of the compressed files in your fastq_prep sub-directory:

Code Blockexpand

language	bash
title	title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

Code Block

language	bash
title	Uncompressing output on the fly with gunzip -c

# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep

gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | more
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | head
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +901 | head -8

# Note that less will display .gz file contents automatically
less -N Sample_Yeast_L005_R1.cat.fastq.gz

...

One of the first thing to check is that your FASTQ files are the same length, and that length is evenly divisible by 4. The wc command (word count) using the -l switch to tell it to count lines, not words, is perfect for this. It's so handy that you'll end up using wc -l a lot to count things. It's especially powerful when used with filename wildcarding.wild carding.

Expand

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .

Code Block

language	bash
title	Counting lines with wc -l

wc -l small.fq
head -100 small.fq > small2.fq
wc -l small*.fq

...

Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input.

Code Blockexpand

language	bash
title	Counting sequences in a FASTQ file

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep

zcat

ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz

| echo "$((`wc -l` / 4))"

Whew!

Warning

title	bash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

A better way to do math

...

ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

Code Block

language	bash
title	Counting sequences in a FASTQ file

cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | echo "$((`wc -l` / 4))"

Whew!

Warning

title	bash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

A better way to do math

Well, doing math in bash is pretty awful – there has to be something better. There is! It's called awk, which is a powerful scripting language that is easily invoked from the command line.

In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( { } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced commands: awk)

Expand

title	Setup (if needed)

Code Block

language	bash

# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz

Code Block

language	bash
title	Counting FASTQ sequences with awk

cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l | awk '{print $1 / 4}'

...

Versions Compared

Old Version 18

New Version 19

Key

A better way to do math

A better way to do math