Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Simply put, all these tools do three steps (and they can vary in how they do these steps):

  • Normalization of gene counts

  • Represent the gene counts by a distribution that defines the relation between mean and variance (dispersion).

  • Perform a statistical test on this distribution to identify genes that are significantly different between the conditions.

  • Provide fold change,  P-value information, false discovery rate for each gene.

Why Normalize?

Normalization smooths out technical variations among the samples we are comparing so that we can more confidently attribute variations we see to biological reasons. 

...

  1. Sequencing depth: Say we are comparing gene counts in sample A against sample B.  If you start out with 10 million reads in sample A  vs 1 million reads in sample B, a 10 fold increase in expression in sample A is going to be purely due to its sequencing depth.
  2. Gene length: A gene that is twice as long is likely to have twice as many reads sampling it.
  3. Other factors like GC content

Most commonly done normalization

RPKM:  Normalizes for sequencing depth and gene length.

RPKM = reads per kilobase per million mapped reads

RPK=

...

No.of

...

Mapped reads

...

/

...

length of transcript in

...

kb (transcript length/1000)

RPKM = RPK/total no.of reads in million (total no of reads/ 1000000)

 

 DESeqedgeRDEXSeqCuffdiff
NormalizationBy library size using median Median scaling factorBy library size using median

Median scaling

By library size using median scaling

FPKM +intra and

inter-library scaling

factor

/TMM

Median scaling factor

FPKM (a slight variation on RPKM)

DistributionNegative binomialNegative binomialNegative binomial Negative binomial
DE TestFisher exact testFisher exact testModified T testT test
Advantages

Straightforward, has a method to

work on data with no replicates

 

Straightforward, good with

small number of replicates

Good for identifying

exon-level changes

Good for identifying

isoform-level changes, splicing changes

R and Bioconductor, very briefly...

...

,

promotor changes. Not as straightforward,

somewhat of a black box

DEG tools compared

From paper: Wesolowski et al, Biosensors, 2013

Image Added

Image AddedImage Added

 

R and Bioconductor, very briefly...

R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. 

...

Warning

The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R.

Code Block
titleStarting R and loading the modules for this tutorial
login1$ module load R login1$ R R version 2.15.3 (2013-03-01) -- "Security Blanket" Copyright (C) 2013 The R Foundation for Statistical Computing ISBN 3-900051-07-0 Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

.

Code Block
titleStarting R and loading the modules for this tutorial
login1$ module load R
login1$ R

> source("http://bioconductor.org/biocLite.R")
Warning in install.packages("BiocInstaller", repos = a["BioCsoft", "URL"]) :
  'lib = "/opt/apps/R/2.15.3/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/2.15
to install packages into?  (y/n) y
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

...

 

Code Block
titleStarting R and loading modules after they are installed
login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor package!

DESeq

Input:

DESeq takes as input count data in table form, with each column representing a biological replicate/biological condition.  The count data must be raw counts of sequencing reads, not already normalized data. 

Example:

                  untreated1        untreated2        untreated3        untreated4       treated1         treated2       treated3

        FBgn0000003    0                       0                       0                       0                       0                    0                  1
        FBgn0000008    92                     161                   76                     70                     140                88                70
        FBgn0000014    5                       1                       0                       0                       4                    0                  0

Our gene_counts.gff file is almost there- we can remove some extra gene related information in the counts file to make it less complex.

Code Block
titleUsing DESeq
login1$ R R
#LOAD LIBRARY, READ IN GENE COUNTS, ADD METADATA
> library("DESeq")
> counts = read.delim("gene_counts.htseq.final.gff", header=F, row.names=1)
> head(counts)
> colnames(counts) = c("C11","C12","C13","C21","C22","C23")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( counts ),
  condition = c( "C1", "C1", "C1", "C2", "C2", "C2"),
  libType = c( "paired-end", "paired-end", "paired-end", "paired-end", "paired-end", "paired-end" )
)
> my.design
> conds> conds <- factor(my.design$condition))
 
 
#CREATE A DESEQ DATA OBJECT FROM COUNTS
> cds <- newCountDataSet( counts, conds )
> cds

 
#NORMALIZATION: ESTIMATE SIZE FACTORS
> cds <- newCountDataSetestimateSizeFactors( counts, condscds )
> sizeFactors( cds )
> cds <- estimateSizeFactorshead( counts( cds, normalized=TRUE ) >)

sizeFactors( 
cds#ESTIMATE )DISPERSION/VARIANCE
> cds <- estimateDispersions( cds )
)


#DO TEST FOR DIFFERENTIAL EXPRESSION AND WRITE RESULTS INTO FILE
> result <- nbinomTest( cds, "C1", "C2" )
> head(result)
> result = result[order(result$pval), ]
> head(result)
> write.csv(result, "DESeq-C1-vs-C2.csv")


#GENERATE MA PLOT
> pdf("DESeq-MA-plot.pdf")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()

Save workspace image? [y/n/c]: n
login1$ head DESeq-wt-vs-mut.csv

edgeR

These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DESeq.

Code Block
titleUsing edgeR
login1$ R
...

#LOAD LIBRARY, READ IN COUNTS, ADD SOME METADATA
> library("edgeR")
> counts = read.delim("gene_counts.htseq.gff", header=F, row.names=1)
> colnames(counts) = c("C11", "C12", "C13", "C21", "C22", "C23")
> head(counts)
> group <- factor(c("C1", "C1", "C1", "C2", "C2", "C2", "C2")) "C2"))
 
 
#CREATE EDGER SPECIFIC DATA OBJECT FROM COUNTS DATA
> dge = DGEList(counts=counts,group=group)
 
 
#ESTIMATE DISPERSIONS(VARIANCE)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
 
 
#DO TEST FOR DIFFERENTIAL EXPRESSION
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)

 
#GENERATE MA PLOT
> pdf("edgeR-MA-plot.pdf")
> plot(
  etp$table$logCPM,
  etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()
> dev.off()

 
#WRITE OUT GENE RESULTS
> write.csv(etp$table, "edgeR-wt-vs-mut.csv")
> q()
Save workspace image? [y/n/c]: y
login1$ head edgeR-wt-vs-mut.csv
Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.
mut.csv

DEXSeq

This package is meant for finding differential exon usage between samples from different conditions.

Relative usage of an exon  =   transcripts from the gene that contain this exon / all transcripts from the gene 

For each exon (or part of an exon) and each sample :

  1. count how many reads map to this exon
  2. count how many reads map to other exons of the same gene.
  3. calculate ratio of 1 to 2. 
  4. Look for changes in this ratio across conditions
  5. Look for statistically significant changes in this ratio across conditions, by using replicates.

This lets you identify changes in alternative splicing, changes in usage of alternative transcript start sites.

Cuffdiff

Cuffdiff (a part of the tuxedo suite) is a popular tool for testing for differential expression.  We will cover this along with the rest of the tuxedo suite tomorrow. 

 

Lets look at our results

Code Block
titleStarting R and loading modules after they are installed
login1$ R
> head DESeq-wt-vs-mut.csv
> head edgeR-wt-vs-mut.csv

 

The graphs we generated are located at:

http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/DESeq-MA-plot.pdf

http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/edgeR-MA-plot.pdf