...
In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline to identify novel transcripts. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis the workflow. We will look at:
- How the workflow was run and what steps are involved.
- What genes and isoforms are significantly differentially expressed and what novel transcripts were identified.
Resources
Useful resources for Tuxedo suite are:
- the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
- the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
- the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.
...
Back to the Big Picture
Let's revisit that pipeline diagram here.
You can see see that tuxedo suite can be used in two main ways:
Paths through the Tuxedo workflow
There are three major paths through this workflow:
NO NOVEL JUNCTIONS: Simple differential gene expression analysis against a set of known splice variants.
- A GTF/GFF file is provided, and you specify that no novel junctions should be explored
- This is by far the fastest path through the workflow.
NOVEL JUNCTIONS:
- Same as 1), but novel splice junctions should be explored in addition to known splice junctions
- A GTF/GFF file is provided, and you let the tool search for novel junctions also
- Use the input data to construct de novo splice junctions without reference to any known splice junctions
- No GTF/GFF is provided
Step I: What does
...
Tophat do?
...
As discovered in previous sections, tophat maps your data to your reference in a transcriptome-aware manner, that will also identify junctions. We've already looked at how you can tell it to identify novel junctions.
tophat recap: options for novelty
--no-novel-juncs | Only look for reads across junctions indicated in the supplied GFF or junctions file. (ignored without -G/-j) |
-G/--GTF <GTF/GFF3 file> | Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file. |
Step 2: What does cufflinks do? and how does it do it?
For each sample, cufflinks assembles aligned reads into transcripts and calculates their abundance.
The new RABT feature
Cuuflinks uses RABT (Reference annotation based transcript assembly) as a method to use existing annotation to guide the assembly of transcripts.
Step 3: What does cuffmerge do? and how does it do it?
For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. cuffmerge Cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program.
assembly figure
rabt figure
OUTput files
CUFFMERGE CUFFCOMPARE
.
...
Step 4: What does cuffcompare do? (Optional)
Cuffcompare allows you to compare your assembled transcripts to existing annotation.
Step 5: What does cuffdiff do?
Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.
Step 2: Run cufflinks
Code Block | ||
---|---|---|
| ||
cufflinks [options] <hits.bam>
|
Look at $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands to see how it was run.
...
cat $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands
Take a minute to look at the output files produced by one cufflinks run.
The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.
Code Block | ||
---|---|---|
| ||
cd $BI/ngs_course/tophat_cufflinks/C1_R1_clout
ls -l
drwxrwxr-x 2 nsabell G-801021 32768 May 22 15:10 cuffcmp
-rwxr-xr-x 1 daras G-803889 14M Aug 16 12:49 transcripts.gtf
-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 0 Aug 16 12:33 skipped.gtf
|
Step 3: Merging assemblies using cuffmerge
Create a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
Code Block |
---|
cd $BI/ngs_course/tophat_cufflinks
find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>
|
...
Code Block |
---|
cat $BI/ngs_course/tophat_cufflinks/assembly_list.txt
|
Take a minute to look at the output files produced by cuffmerge. The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated.
Code Block | ||
---|---|---|
| ||
cd $BI/ngs_course/tophat_cufflinks/merged_asm
ls -l
-rwxrwxr-x 1 daras G-803889 1571816 Aug 16 2012 genes.fpkm_tracking
-rwxrwxr-x 1 daras G-803889 2281319 Aug 16 2012 isoforms.fpkm_tracking
drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 logs
-r-xrwxr-x 1 daras G-803889 32090408 Aug 16 2012 merged.gtf
-rwxrwxr-x 1 daras G-803889 0 Aug 16 2012 skipped.gtf
drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 tmp
-rwxrwxr-x 1 daras G-803889 34844830 Aug 16 2012 transcripts.gtf
|
Step 4: Finding differentially expressed genes and isoforms using cuffdiff
Code Block |
---|
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>
|
Exercise: What does cuffdiff -b do?
...
-b is for enabling fragment bias correction.
Let's look at the commands to perform these steps and how the output files look...