Objectives
In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:
- How the workflow was run and what steps are involved.
- What genes and isoforms are significantly differentially expressed.
Resources
Useful resources for Tuxedo suite are:
- the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
- the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
- the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.
Let's revisit that pipeline diagram here.
You can see see that tuxedo suite can be used in two main ways:
Paths through the Tuxedo workflow
There are three major paths through this workflow:
NO NOVEL JUNCTIONS: Simple differential gene expression analysis against a set of known splice variants.
- A GTF/GFF file is provided, and you specify that no novel junctions should be explored
- This is by far the fastest path through the workflow.
NOVEL JUNCTIONS:
- Same as 1), but novel splice junctions should be explored in addition to known splice junctions
- A GTF/GFF file is provided, and you let the tool search for novel junctions also
- Use the input data to construct de novo splice junctions without reference to any known splice junctions
- No GTF/GFF is provided
What does tophat do?
As discovered in previous sections, tophat maps your data to your reference in a transcriptome-aware manner, that will also identify junctions. We've already looked at how you can tell it to identify novel junctions.
tophat recap: options for novelty
What does cufflinks do? and how does it do it?
For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program
assembly figure
rabt figure
OUTput files
CUFFMERGE CUFFCOMPARE
.
What cuffdiff does
Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.
Step 2: Run cufflinks
cufflinks [options] <hits.bam>
Look at $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands to see how it was run.
Take a minute to look at the output files produced by one cufflinks run.
The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.
cd $BI/ngs_course/tophat_cufflinks/C1_R1_clout ls -l drwxrwxr-x 2 nsabell G-801021 32768 May 22 15:10 cuffcmp -rwxr-xr-x 1 daras G-803889 14M Aug 16 12:49 transcripts.gtf -rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking -rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking -rwxr-xr-x 1 daras G-803889 0 Aug 16 12:33 skipped.gtf
Step 3: Merging assemblies using cuffmerge
Create a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
cd $BI/ngs_course/tophat_cufflinks find . -name transcripts.gtf > assembly_list.txt cuffmerge <assembly_list.txt>
Take a minute to look at the output files produced by cuffmerge. The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated.
cd $BI/ngs_course/tophat_cufflinks/merged_asm ls -l -rwxrwxr-x 1 daras G-803889 1571816 Aug 16 2012 genes.fpkm_tracking -rwxrwxr-x 1 daras G-803889 2281319 Aug 16 2012 isoforms.fpkm_tracking drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 logs -r-xrwxr-x 1 daras G-803889 32090408 Aug 16 2012 merged.gtf -rwxrwxr-x 1 daras G-803889 0 Aug 16 2012 skipped.gtf drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 tmp -rwxrwxr-x 1 daras G-803889 34844830 Aug 16 2012 transcripts.gtf
Step 4: Finding differentially expressed genes and isoforms using cuffdiff
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>
Exercise: What does cuffdiff -b do?