Objectives
In this exercise, we will examine (but not run, since it takes too long with any file big enough to get a result worth looking at) the details of analyzing a RIP-seq experiment using Cuffdiff to identify enriched genes in the IP relative to the control (and vice versa). In this case, Ago2 was pulled-down and associated whole-mRNAs fragmented and sequenced. The entire Tuxedo pipeline was used to analyze the data, including a Tophat2 alignment coupled with differential expression analysis with Cuffdiff.
Primary Data
The data consists of a set of files located in this directory:
$BI/rnaseq_course/ripseq_exercises/
You will see two subdirectories called "RIP" and "CLIP", each with files we will look at during these exercises. Go ahead and copy the "ripseq_exercises" directory and all its contents to your scratch area and load up the cufflinks module (which contains cuffdiff):
cds export BI=/corral-repl/utexas/BioITeam/ cp -r $BI/rnaseq_course/ripseq_exercises/ . module load cufflinks/2.1.1
The cuffdiff command syntax looks like this, and has a TON of options:
cuffdiff cuffdiff v2.1.1 (4046M) ----------------------------- Usage: cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam] Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam
Cuffdiff
Cuffdiff supports a wide range of dispersion estimation and library normalization methods, the details of which can be relevant when doing complex RIP-seq analysis. However, for most applications, the defaults of cuffdiff are perfectly fine, or at least a good place to start. If you look in the cuffdiff.cmds file located in the RIP subdirectory, you will find this:
cd $SCRATCH/ripseq_exercises/RIP/ cat cuffdiff.cmds #Returns: cuffdiff -p 6 --frag-bias-correct /BowtieIndex/genome.fa -u \ --library-type=fr-unstranded --emit-count-tables --multi-read-correct --min-alignment-count=5 \ --min-reps-for-js-test=1 --dispersion-method=per-condition --library-norm-method=geometric \ -o cuffdiff_out -L IP,INPUT ./genes.gtf \ ./ip_sample/accepted_hits.bam ./input_sample/accepted_hits.bam
Option | Parameter | Relevance |
---|---|---|
- o | Output directory | Directs the (many) output files and directories |
- p | Number of threads | Should match wayness setting in any batch jobs |
- u | Multi read correct | Identify multiply-mapped reads and modify feature counts |
- L | Labels | Labels to attach conditions; if not given, will just be numbers |
--frag-bias-correct (also -b) | Fragment bias correction | Adjusts output for sequence-specific library prep bias and requires the FASTA file used in alignment |
--min-alignment-count | Reads required to test region | STRONGLY affects runtime, since it determines the number of tests performed by cuffdiff |
--min-reps-for-js-test=1 | Replicates required to perform tests | Default is 3 replicates per condition, so if you have only 2 replicates, this must be set |
--library-type=fr-unstranded | Library type | Most of the time, this is the appropriate setting |
--dispersion-method=per-condition | Dispersion estimation method | Dispersions calculated for different conditions separately |
--library-norm-method=geometric | Library normalization method | Geometric mean of each library used to adjust for sequencing depth |
We're not going to run cuffdiff, since I didn't provide any BAM files. This is because, unlike many other programs, cuffdiff does not reduce the size of the library by removing duplicated reads, because it assumes that increases in total expression level indicate increases in expression level (at least most of the time). Instead, move into the cuffdiff_out directory, and start looking at the output files:
cd $SCRATCH/ripseq_exercises/RIP/cuffdiff_out/ ls -la less gene_exp.diff
The Cuffdiff manual page describes the content of each of these files, including many that simply record your command parameters or provide absolute abundance estimates (e.g. FPKMs). The most important for immediate analysis, though, is usually the "diff" files, which describes the results of differential expression tests on different sets of features. Here are a few commands that let you see interesting groups of genes and their associated expression and test values:
cut -f 1,10 gene_exp.diff | sort -g -k2,2 | grep -v 'e+308' | less # This gives a view of genes with the biggest fold change awk '{if ($14=="yes") print $0}' gene_exp.diff | cut -f 1,10 | sort -g -k2,2 | less # This gives genes that are called as differentially expressed AND are enriched in the IP relative to the Input, along with their fold changes
But what if you want to know about binding regions, not whole transcripts? And what if you don't have (or want to use) a reference annotation to define the coordinates that are tested between samples?