Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Line 1 is the unique read name. The format for Illunia Illumina reads is as follows, using the read name above:

...

@HWI-ST1097:127:C0W5VACXX:5:1101:4820:2124  1:N:0:CTCAGA

  • The line as a whole will be unique for this read fragment. However,
    • the corresponding R1 and R2 reads will have identical machine_id:lane:flowcell_grid_coordinates information. This common part of the name ties the two read ends together
    .
    • the end_number:failed_qc:0:barcode information will be different for R1 and R2
  • Most sequencing facilities will not give you qc-failed reads (failed_qc = Y) unless you ask for them.

Line 2 is the sequence reported by the machine, starting with the first base of the insert (the 5' adapter has usually been removed by the sequencing facility). These are ACGT or N uppercase characters.

Line 3 is always starts with '+' from GSAF   (it can optionally include a sequence description)

...

Exercise: What character in the quality score string in the fastq FASTQ entry above represents the best base quality? Roughly what is the error probability estimated by the sequencer?

...

The most common compression program used for individual files is gzipand its counterpart gunzip whose compressed files have the .gz extension. The tar and zip programs are most commonly used for compressing directories.

...

Code Block
languagebash
titleCompare compressed and uncompressed files
ls -lh $CORENGS/yeast_stuff/*L005*.fastq
ls -lh $CORENGS/yeast_stuff/*L005*.fastq.gz
Tip
titlePathname wildcarding

The asterisk character ( * ) is a pathname wildcard that matches 0 or more characters.

Read more about pathname wildcards here: Pathname wildcards and special characters

...

Expand
titleAnswer

FASTQ's are ~ 150 MB
compressed Compressed they are ~ 50 MB
this This is about 3x compression

You may be tempted to want to un-compress your sequencing files in order to manipulate them more directly – but resist that temptation! Nearly all modern bioinformatics tools are able to work on .gz files, and there are tools and techniques for working with the contents of compressed files without ever un-compressing them.

...

Code Block
languagebash
titleUsing the tail command
# shows the last 10 lines
tail small.fq

# shows the last 100 lines -- might want to pipe this to more to see a bit at a time
tail -100 small.fq | more

# shows all the lines starting at line 900 -- better pipe it to a pager!
# tailcat -n +900 adds line numbers to its output so we can see where we are in the file
cat -n small.fq | tail -n +900 | more

# shows 15 lines starting at line 900 because we pipe to head -15
tail -n +900 small.fq | head -15

...

Code Block
languagebash
titleUncompressing output on the fly with gunzip -c
# make sure you're in your $SCRATCH/core_ngs/fastq_prep directory
cd $SCRATCH/core_ngs/fastq_prep

gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | more
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | head
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail
gunzip -c Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +900901 | head -158

# Note that less will display .gz file contents automatically
less Sample_Yeast_L005_R1.cat.fastq.gz

...

Code Block
languagebash
titleCounting lines with wc -l
zcat Sample_Yeast_L005_R1.cat.fastq.gz | more
zcat Sample_Yeast_L005_R1.cat.fastq.gz | head
zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail
zcat Sample_Yeast_L005_R1.cat.fastq.gz | tail -n +900901 | head -158
Tip

There will be times when you forget to pipe your large zcat or gunzip -c output somewhere – even the experienced among us still make this mistake! This leads to pages and pages of data spewing across your terminal.

If you're lucky you can kill the output with Ctrl-c. But if that doesn't work (and often it doesn't) just close your Terminal window. This terminates the process on the server (like hanging up the phone), then you just can log back in.

...

Here's how you would combine this math expression with zcat line counting on your file using the magic of backtick evaluation. Notice that the wc -l expression is what is reading from standard input – not echo.

Code Block
languagebash
titleCounting sequences in a FASTQ file
cd $SCRATCH/core_ngs/fastq_prep
zcat Sample_Yeast_L005_R1.cat.fastq.gz | echo "$((`wc -l` / 4))"

...

In the code below we pipe the output from wc -l (number of lines in the FASTQ file) to awk, which executes its body (the statements between the curly braces ( {  } ) for each line of input. Here the input is just one line, with one field – the line count. The awk body just divides the 1st input field ($1) by 4 and writes the result to standard output. (Read more about awk in Advanced commands: awk)

...

Code Block
languagebash
titleFor loop to count sequences in multiple FASTQs
for fname in *.gz; do
  echo "Processing $fname"
  echo "...$fname has $((`zcat $fname | wc -l` / 4)) sequences"
done

...