Differential gene expression analysis
Overview
In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and a mutant strain of the bacterium Listeria monocytogenes.
Learning Objectives
- Review mapping reads with an example of how to use qsub to map many data sets in parallel on TACC.
- Review samtools and SAM/BAM conversion.
- Learn how to use bedtools/HTseq to count reads overlapping genes.
- Become familiar with basic R usage and installing BioConductor modules.
- Learn how to use edgeR/DESeq to identify differentially expressed genes.
Table of Contents
Preliminary
Download data files
Copy the data files for this example into your $SCRATCH space:
cds cp -r $BI/ngs_course/listeria_RNA_seq/data listeria_RNA_seq
File Name |
Description |
Sample |
---|---|---|
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
?sigB mutant, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 2 |
|
Single-end Illumina 36-bp reads |
?sigB mutant, biological replicate 2 |
|
Reference Genome sequence (FASTA) |
Listeria monocytogenes strain 10403S |
|
Reference Genome features (GFF) |
Listeria monocytogenes strain 10403S |
This data was submitted to the Sequence Read Archive (SRA) to accompany this paper:
- Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed
You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753
If you want to skip the read alignment step...
To get right to the new stuff, you can copy the mapped read BAM files and the reference sequence files that you will need using these commands:
cds cp -r $BI/ngs_course/listeria_RNA_seq/mapped_data listeria_RNA_seq
Then, skip over the #Create BAM file of mapped reads section below.
Using the R environment for statistical computing
Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is probably going to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we normally use, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$
to >
when you are in the R shell, to help clue you in to this fact. The R shell is inside the bash shell. So when you quit R, you will be back where you were in the bash shell.
R is the favorite language of pirates.
R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.
Regrettably, R
is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash
, you can write R
scripts that carry out complicated analyses.
Do not copy the > characters in the R examples.
They are the R prompt to remind you which commands are to be run inside the R shell!
Hints for working with R
- Don't forget: it's
q()
to quit. - For help, type
?command
. Try?read.table
. Theq
key gets you out of help, just like for aman
page. - The left arrow
<-
(less-than-dash) is the same as an equals sign=
. You can use them interchangeably. - The prompt we will sometimes be showing for R is
>
. Don't type this for a command. It is like thelogin1$
at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in theR
shell. - You can type the name of a variable to have its value displayed. Like this...
> x <- 10 + 5 + 6 > x [1] 21
Bioconductor modules for R
Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can theoretically do things for you like convert sequences (none of us use it for that), but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.
Here's how you install two modules that we will need for this exercise:
The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R.
login1$ module load R login1$ R R version 2.15.3 (2013-03-01) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > source("http://bioconductor.org/biocLite.R") Warning in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) : 'lib = "/opt/apps/R/2.15.3/lib64/R/library"' is not writable Would you like to use a personal library instead? (y/n) y Would you like to create a personal library ~/R/x86_64-unknown-linux-gnu-library/2.15 to install packages into? (y/n) y ... > biocLite("DESeq") ... > biocLite("edgeR") ... > q() Save workspace image? [y/n/c]: n
When you start R later, you will not need to re-install the modules. You can load them with just these commands:
login1$ R > library("DESeq") > library("edgeR")
These commands will work for any Bioconductor module!
Create BAM file of mapped reads
Map reads using Bowtie
For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)
We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.
Be careful we are now mapping single
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.