Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Before you start the alignment and analysis processes, it can be useful to perform some initial quality checks on your raw data. If you don't do this (or even if you do), you may notice later that something looks fishy in the the output: for example, many of your reads are not mapping or the ends of many of your reads do not align. Both can give you clues about whether you need to process the reads to improve the quality of data that you are putting into your analysis. Each year this tutorial is discussed at some length for if it should be included as a main tutorial, if it should be included as an optional tutorial, or if it should be ignored all together as the quality of data increases. Recently a colleague of ours spent several days working with and trying to understand some data he got back before reaching out for help, after a few hours of running into a wall, fastqc was used to determine Here we will assume you have data from GSAF's Illumina HiSeq or MiSeq sequencer.that the library was not constructed correctly in less than 30 minutes. So here we present the tutorial as quick thing to check which may save you significant amounts of time later on. 

Learning Objectives

This tutorial covers the commands necessary to use several common programs for evaluating read files in FASTQ format and for processing them (if necessary).

...

Table of Contents

Table of Contents

idev -m 60 -q development

As we discussed yesterday, we do not want to run things on the head node, so we should start a new idev session. The reservation name has changed from CCBB_Bio_Summer_School_2016_Day1 to CCBB_Bio_Summer_School_2016_Day2. See if you can figure out how to start your own idev session.

Warning

When following along here, please start an idev session for running any example commands:

Code Block
Code Block
titleclick here for idev command solution
collapsetrue
idev  -m 240 -r CCBB_Bio_Summer_School_2016_Day2 -A UT-2015-05-18 -N 2 -n 8

Illumina sequence data format (FASTQ)

...

  1. Line 1 is the read identifier, which describes the machine, flowcell, cluster, grid coordinate, end and barcode for the read. Except for the barcode information, read identifiers will be identical for corresponding entries in the R1 and R2 fastq files.
  2. Line 2 is the sequence reported by the machine.
  3. Line 3 is always '+' from GSAF (it can optionally include a sequence description but rarely or never actually does)
  4. Line 4 is a string of Ascii-encoded base quality scores, one character per base in the sequence. For each base, an integer quality score = -10 log(probabilty base is wrong) is calculated, then added to 33 to make a number in the ASCII printable character range.

See the Wikipedia FASTQ format page for more information.

Exercise: Examine the 2nd sequence in a FASTQ file

What is the 2nd sequence in the file $BI$WORK/gva_courseGVA2016/mapping/data/SRR030257_1.fastq ?

head $BI/gva_course

Use the head command.

Expand
titleHintStuck? click here for a hint
Code Block
Expand
titleAnswer

head command.

Code Block
titleStill stuck? Click here for the full Head command
collapsetrue
head $WORK/GVA2016/mapping/data/SRR030257_1.fastq
Executing the command above reports that the
Expand
titleAnswer

The 2nd sequence has ID = @SRR030257.2 HWI-EAS_4_PE-FC20GCB:6:1:407:

767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC

767/1, and the sequence TAAGCCAGTCGCCATGGAATATCTGCTTTATTTAGC

If thats what you thought it was congratulations, if it is different, do you see where we got it from? If it doesn't make sense ask for help.

Counting sequences

If you get an error from running a program, one of the first thing to check is that the length of your FASTQ files is evenly divisible by four and — if the program expects paired reads — that the R1 and R2 files have the same number of reads. The wc command (word count) using the -l switch to tell it to count lines, not words, is perfect for this:

Code Block
titleUsing wc -l to count lines
collapsetrue
wc -l $BI$WORK/gva_courseGVA2016/mapping/data/SRR030257_1.fastq

...

Expand
titleAnswer

The wc -l command says there are 15200720 lines. FASTQ files have 4 lines per sequence, so the file has 15,200,720/4 or 3,800,180 sequences.

What if your fastq file has been compressed, for example by gzip? By As mentioned many programs have problems if R1 and R2 do not have the same number of reads. While you can obviously change the wc -l command to check R2 rather than R1, fastq files are often stored in a compressed state to save disk space. By using pipes to link commands, you can still count the lines, and you don't have to uncompress the file to do it!to do it! Specifically, you can use gunzip -c to write decompressed data to standard output (-c means "to console", and leaves the original *.gz file untouched). You then pipe that output to wc -l to get the line count. 

Code Block
titleUsing wc -l on a compressed file
collapsetrue
gunzip -c $BI$WORK/GVA2016/webmapping/yeast_stuffdata/Sample_Yeast_L005_R1.cat.fastq.gz | wc -l

Here you use gunzip -c to write decompressed data to standard output (-c means "to console", and leaves the original *.gz file untouched). You then pipe that output to wc -l to get the line count.

Exercise: Counting compressed FASTQ lines

How many sequences are in the compressed FASTQ file above?

Expand
titleAnswer
The wc -l command says there are 2368720 lines so the file has 2,368,720/4 or 592,180 sequences.
SRR030257_2.fastq.gz | wc -l

How many lines/sequences does the compressed file contain? Does this agree with what you found for R1?

Expand
titleHow do I do math on the command line?

The bash shell has a really strange syntax for arithmetic: it uses a double-parenthesis operator. Go figureit uses a double-parenthesis operator. Additionally unlike a calculator that automatically prints the result to the screen when it performs an operation, we have to explicitly tell bash that we want to see what the result is. We do this using the echo command, and assigning the result to a non-named temporary variable.

Code Block
titleArithmetic in Bash
echo $((236872015200720 / 4))

While checking the number of reads a file has can solve some of the most basic problems, it doesn't really provide any direct evidence as to the quality of the sequencing data. To get this type of information before starting meaningful analysis other programs must be used.

FASTQ Evaluation Tools

The first order of business after receiving sequencing data should be to check your data quality. This often-overlooked step helps guide the manner in which you process the data, and can prevent many headaches that could require you to redo an entire analysis after they rear their ugly heads.

...