/
Differential gene expression analysis

Differential gene expression analysis

Overview

In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and a mutant strain of the bacterium Listeria monocytogenes.

Learning Objectives

  • Review mapping reads with an example of how to use qsub to map many data sets in parallel on TACC.
  • Review samtools and SAM/BAM conversion.
  • Learn how to use bedtools/HTseq to count reads overlapping genes.
  • Become familiar with basic R usage and installing BioConductor modules.
  • Learn how to use edgeR/DESeq to identify differentially expressed genes.

Table of Contents

Preliminary

Download data files

Copy the data files for this example into your $SCRATCH space:

cds
cp -r $BI/ngs_course/listeria_RNA_seq/data listeria_RNA_seq

File Name

Description

Sample

SRR034450.fastq

Single-end Illumina 36-bp reads

wild-type, biological replicate 1

SRR034451.fastq

Single-end Illumina 36-bp reads

?sigB mutant, biological replicate 1

SRR034452.fastq

Single-end Illumina 36-bp reads

wild-type, biological replicate 2

SRR034453.fastq

Single-end Illumina 36-bp reads

?sigB mutant, biological replicate 2

NC_017544.1.fasta

Reference Genome sequence (FASTA)

Listeria monocytogenes strain 10403S

NC_017544.1.gff

Reference Genome features (GFF)

Listeria monocytogenes strain 10403S

This data was submitted to the Sequence Read Archive (SRA) to accompany this paper:

  • Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed

You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753

If you want to skip the read alignment step...

To get right to the new stuff, you can copy the mapped read BAM files and the reference sequence files that you will need using these commands:

cds
cp -r $BI/ngs_course/listeria_RNA_seq/mapped_data listeria_RNA_seq

Then, skip over the #Create BAM file of mapped reads section below.

Using the R environment for statistical computing

Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is probably going to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we normally use, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact. The R shell is inside the bash shell. So when you quit R, you will be back where you were in the bash shell.

R is the favorite language of pirates.

R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.

Regrettably, R is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash, you can write R scripts that carry out complicated analyses.

Do not copy the > characters in the R examples.

They are the R prompt to remind you which commands are to be run inside the R shell!

Hints for working with R

  • Don't forget: it's q() to quit.
  • For help, type ?command. Try ?read.table. The q key gets you out of help, just like for a man page.
  • The left arrow <- (less-than-dash) is the same as an equals sign =. You can use them interchangeably.
  • The prompt we will sometimes be showing for R is >. Don't type this for a command. It is like the login1$ at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in the R shell.
  • You can type the name of a variable to have its value displayed. Like this...
    > x <- 10 + 5 + 6
    > x
    [1] 21
    

Bioconductor modules for R

Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can theoretically do things for you like convert sequences (none of us use it for that), but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.

Here's how you install two modules that we will need for this exercise:

The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R.

Starting R and loading the modules for this tutorial
login1$ module load R
login1$ R

R version 2.15.3 (2013-03-01) -- "Security Blanket"
Copyright (C) 2013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
Warning in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  'lib = "/opt/apps/R/2.15.3/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/2.15
to install packages into?  (y/n) y
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

When you start R later, you will not need to re-install the modules. You can load them with just these commands:

Starting R and loading modules after they are installed
login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor module!

Create BAM file of mapped reads

Map reads using Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.

Be careful we are now mapping single