Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The first technique is the use of pagers – we've already seen this with the more command. Review its use now on our small uncompressed file:

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .



Code Block
languagebash
titleUsing the more pager
# Use spacebar to advance a page; Ctrl-c to exit
more small.fq

...

For a really quick peek at the first few lines of your data, there's nothing like the head command. By default head displays the first 10 lines of data from the file you give it or from its standard input. With an argument -NNN (that is a dash followed by some number), it will show that many lines of data.

Code Blockexpand
languagebash
titletitleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .



Code Block
languagebash
titleUsing the head command
# shows 1st 10 lines
head small.fq

# shows 1st 100 lines -- might want to pipe this to more to see a bit at a time
head -100 small.fq | more

...

But what's really cool about tail is its -n +NNN syntax. This displays all the lines starting at line NNN. Note this syntax: the -n option switch follows by a plus sign ( + ) in front of a number – the plus sign is what says "starting at this line"! Try these examples:

Code Blockexpand
languagetitleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .



Code Block
languagebash
titleUsing the tail command
# shows the last 10 lines
tail small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# cat -n adds line numbers to its output so we can see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 15 lines starting at line 900 because we pipe to head -15
tail -n +900 small.fq | head -15

...

Let's illustrate this using one of the compressed files in your fastq_prep sub-directory:

Code Blockexpand
languagebash
titletitleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .



Code Block
languagebash
titleUncompressing output on the fly with gunzip -c
# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep

gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | more
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | head
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +901 | head -8

# Note that less will display .gz file contents automatically
less -N Sample_Yeast_L005_R1.cat.fastq.gz

...

One of the first thing to check is that your FASTQ files are the same length, and that length is evenly divisible by 4. The wc command (word count) using the -l switch to tell it to count lines, not words, is perfect for this. It's so handy that you'll end up using wc -l a lot to count things. It's especially powerful when used with filename wildcarding.wild carding.

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
cp $CORENGS/misc/small.fq .



Code Block
languagebash
titleCounting lines with wc -l
wc -l small.fq
head -100 small.fq > small2.fq
wc -l small*.fq

...

Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input.

Code Blockexpand
languagebash
titleCounting sequences in a FASTQ file
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
zcat
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
| echo "$((`wc -l` / 4))"

Whew!

Warning
titlebash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

A better way to do math

...

ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz



Code Block
languagebash
titleCounting sequences in a FASTQ file
cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | echo "$((`wc -l` / 4))"

Whew!

Warning
titlebash arithmetic is integer valued only

Note that arithmetic in the bash shell is integer valued only, so don't use it for anything that requires decimal places!

A better way to do math

Well, doing math in bash is pretty awful – there has to be something better. There is! It's called awk, which is a powerful scripting language that is easily invoked from the command line.

In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( {  } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced commands: awk)

Expand
titleSetup (if needed)


Code Block
languagebash
# Setup (if needed)
mkdir -p $SCRATCH/core_ngs/fastq_prep
cd $SCRATCH/core_ngs/fastq_prep
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R1.cat.fastq.gz
ln -sf $CORENGS/yeast_stuff/Sample_Yeast_L005_R2.cat.fastq.gz



Code Block
languagebash
titleCounting FASTQ sequences with awk
cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | wc -l | awk '{print $1 / 4}'

...