Tuxedo Suite For Splice Variant Analysis and Identifying Novel Transcripts

Objectives

In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:

How the workflow was run and what steps are involved.
What genes and isoforms are significantly differentially expressed.

Resources

Useful resources for Tuxedo suite are:

the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.

Let's revisit that pipeline diagram here.

You can see see that tuxedo suite can be used in two main ways:

Paths through the Tuxedo workflow

There are three major paths through this workflow:

NO NOVEL JUNCTIONS: Simple differential gene expression analysis against a set of known splice variants.

- A GTF/GFF file is provided, and you specify that no novel junctions should be explored
- This is by far the fastest path through the workflow.

NOVEL JUNCTIONS:

Same as 1), but novel splice junctions should be explored in addition to known splice junctions
- A GTF/GFF file is provided, and you let the tool search for novel junctions also
Use the input data to construct de novo splice junctions without reference to any known splice junctions
- No GTF/GFF is provided

What does tophat do?

As discovered in previous sections, tophat maps your data to your reference in a transcriptome-aware manner, that will also identify junctions. We've already looked at how you can tell it to identify novel junctions.

tophat recap: options for novelty

What does cufflinks do? and how does it do it?

For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program

assembly figure

rabt figure

OUTput files

CUFFMERGE CUFFCOMPARE

.

What cuffdiff does

Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.

Step 2: Run cufflinks

General syntax for cufflinks command

cufflinks [options] <hits.bam>

Look at $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands to see how it was run.

How do I look into the file?

cat $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands

Take a minute to look at the output files produced by one cufflinks run.
The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.

Cufflinks output files

cd $BI/ngs_course/tophat_cufflinks/C1_R1_clout
ls -l


drwxrwxr-x 2 nsabell G-801021    32768 May 22 15:10 cuffcmp
-rwxr-xr-x 1 daras G-803889  14M Aug 16 12:49 transcripts.gtf
-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889    0 Aug 16 12:33 skipped.gtf

Step 3: Merging assemblies using cuffmerge

Create a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

cd $BI/ngs_course/tophat_cufflinks
find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>

assembly_list.txt contents

cat $BI/ngs_course/tophat_cufflinks/assembly_list.txt

Take a minute to look at the output files produced by cuffmerge. The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated.

cuffmerge output

cd $BI/ngs_course/tophat_cufflinks/merged_asm
ls -l

-rwxrwxr-x  1 daras G-803889  1571816 Aug 16  2012 genes.fpkm_tracking
-rwxrwxr-x  1 daras G-803889  2281319 Aug 16  2012 isoforms.fpkm_tracking
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 logs
-r-xrwxr-x  1 daras G-803889 32090408 Aug 16  2012 merged.gtf
-rwxrwxr-x  1 daras G-803889        0 Aug 16  2012 skipped.gtf
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 tmp
-rwxrwxr-x  1 daras G-803889 34844830 Aug 16  2012 transcripts.gtf

Step 4: Finding differentially expressed genes and isoforms using cuffdiff

cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>

Exercise: What does cuffdiff -b do?

Solution

-b is for enabling fragment bias correction.