Differential expression with splice variant analysis at the same time: the Tuxedo pipeline
...
- the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
- the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
- the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.
Objectives
In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:
...
The 1st, Tophat step is always required and sets the stage for all that follows. Tophat does a transcriptome-aware alignment of the input sequences to a reference genome using either the Bowtie or Bowtie2 aligner (in theory it can use other aligners, but we do not recommend this).
Split Read Alignment (Splice Finding)
To do this, Tophat goes through several steps:
...
What cufflinks and cuffmerge do
Cuffmerge Assembly
For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program.
...
Although we won't cover these issues here, there are some issues you should consider before embarking on the Tuxedo pipeline:
Should my FASTQ sequences be trimmed to remove low-quality 3' bases?
Expand |
---|
|
Possibly, if FastQC or other base quality reports show the data is really poor. But generally the fact that Tophat splits long reads into smaller fragments mitigates the need to do this. |
Should I remove adapter sequences before running Tophat?
Expand |
---|
|
This is usually a good idea because un-template adapter bases have a more drastic effect on reducing mappability than do low-quality 3' bases. |
Should I attempt to remove sequences that map to undesired RNAs before running Tophat? (rRNA for example)
Expand |
---|
|
This is also usually a good idea, because such rRNA sequences can be a substantial proportion of your data (depending on library prep method), and this can skew cuffdiff's fragment counting statistics. |
How would, for example, rRNA sequence removal be done?
Expand |
---|
|
Maybe something like this: - Align your sequences to a reference "genome" consisting only of rRNA gene sequences.
- Extract only the sequences that do not align to the rRNA reference into a new FASTQ file and use that as Tophat input.
|
What other pre-processing steps might I consider?
Expand |
---|
|
There are many, and it will depend on your data and what you want to get out of it. If you have paired-end data, tophat asks you to provide the mean fragment (insert) size and the standard deviation for insert sizes in your library. One common pre-processing step to achieve this would be to do a quick paired-end alignment of, for example, about 1 million sequences to a reference genome. Then you could calculate the mean and standard deviation of insert sizes for properly paired reads from the resulting BAM file records, and pass these values to Tophat. |
...
Expand |
---|
| Reminder on how to scp files from lonestar |
---|
| Reminder on how to scp files from lonestar |
---|
|
On your computer's side: Go to the directory where you want to copy files to. Code Block |
---|
scp my_user_name@lonestar.tacc.utexas.edu:/home/.../stuff.fastq ./
|
Replace the "/home/..." with the "pwd" information obtained earlier. This command would transfer "stuff.fastq" from the specified directory on Lonestar to your current directory on your computer. |
...
On lonestar, to run tophat, cufflinks etc, following modules need to be loaded.
Code Block |
---|
module load boost/1.45.0
module load bowtie
module load tophat
module load cufflinks/2.0.2
|
Code Block |
---|
title | General syntax for tophat command |
---|
|
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
|
...
Code Block |
---|
title | Take a minute to look at the output files produced by one tophat run. |
---|
|
cd $BI/ngs_course/tophat_cufflinks/C1_R1_thout
ls -l
-rwxr-xr-x 1 daras G-803889 323M Aug 16 11:47 accepted_hits.bam
-r-xr-xr-x 1 daras G-803889 237K Aug 16 11:46 accepted_hits.bam.bai
-rwxr-xr-x 1 daras G-803889 52 Aug 16 11:46 deletions.bed
-rwxr-xr-x 1 daras G-803889 54 Aug 16 11:46 insertions.bed
-rwxr-xr-x 1 daras G-803889 2.9M Aug 16 11:46 junctions.bed
-rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 left_kept_reads.info
drwxr-xr-x 2 daras G-803889 32K Aug 16 11:46 logs
-rwxr-xr-x 1 daras G-803889 70 Aug 16 11:46 right_kept_reads.info
-rwxr-xr-x 1 daras G-803889 9.7K Aug 16 11:46 unmapped_left.fq.z
-rwxr-xr-x 1 daras G-803889 9.9K Aug 16 11:46 unmapped_right.fq.z
|
...
Expand |
---|
|
Any one of these: Code Block |
---|
less $BI/ngs_course/tophat_cufflinks/reference/genes.gtf # :q to exit
cat $BI/ngs_course/tophat_cufflinks/reference/genes.gtf | head
cat $BI/ngs_course/tophat_cufflinks/reference/genes.gtf | cut -f 1-8 | more
cat $BI/ngs_course/tophat_cufflinks/reference/genes.gtf | cut -f 9 | more
|
|
...
Expand |
---|
|
Code Block |
---|
cd $BI/ngs_course/tophat_cufflinks/C1_R1_thout
ls -la
samtools view -x accepted_hits.bam | less # :q to quit
|
|
...
Expand |
---|
|
Code Block |
---|
| Looking at the CIGAR string |
---|
| Looking at the CIGAR string |
---|
| |
The CIGAR string "58M76N17M" representst a spliced sequence. The codes mean: - 56M - the first 58 bases match the reference
- 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
- 17M - the last 17 bases match the reference
|
...
Expand |
---|
|
Code Block |
---|
cd $BI/ngs_course/tophat_cufflinks/C1_R1_thout
samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l
|
|
...
Code Block |
---|
title | General syntax for cufflinks command |
---|
|
cufflinks [options] <hits.bam>
|
...
Code Block |
---|
title | Cufflinks output files |
---|
|
cd $BI/ngs_course/tophat_cufflinks/C1_R1_clout
ls -l
drwxrwxr-x 2 nsabell G-801021 32768 May 22 15:10 cuffcmp
-rwxr-xr-x 1 daras G-803889 14M Aug 16 12:49 transcripts.gtf
-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 0 Aug 16 12:33 skipped.gtf
|
...
Create a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
Code Block |
---|
cd $BI/ngs_course/tophat_cufflinks
find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>
|
Expand |
---|
| assembly_list.txt contents |
---|
| assembly_list.txt contents |
---|
|
Code Block |
---|
cat $BI/ngs_course/tophat_cufflinks/assembly_list.txt
|
|
...
Code Block |
---|
|
cd $BI/ngs_course/tophat_cufflinks/merged_asm
ls -l
-rwxrwxr-x 1 daras G-803889 1571816 Aug 16 2012 genes.fpkm_tracking
-rwxrwxr-x 1 daras G-803889 2281319 Aug 16 2012 isoforms.fpkm_tracking
drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 logs
-r-xrwxr-x 1 daras G-803889 32090408 Aug 16 2012 merged.gtf
-rwxrwxr-x 1 daras G-803889 0 Aug 16 2012 skipped.gtf
drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 tmp
-rwxrwxr-x 1 daras G-803889 34844830 Aug 16 2012 transcripts.gtf
|
Step 4: Finding differentially expressed genes and isoforms using cuffdiff
Code Block |
---|
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>
|
...
Expand |
---|
| Reminder on how to run samtools flagstat |
---|
| Reminder on how to run samtools flagstat |
---|
|
Code Block |
---|
samtools flagstat C1_R1.bam
|
|
...
Expand |
---|
| Reminder on how to download and run IGV |
---|
| Reminder on how to download and run IGV |
---|
|
Warning |
---|
Do this on your local machine, not on TACC |
Code Block |
---|
title | Downloading and running IGV |
---|
|
wget http://www.broadinstitute.org/igv/projects/downloads/IGV_2.1.14.zip
unzip IGV_2.1.14.zip
cd IGV_2.1.14
java -Xmx2g -jar igv.jar
|
|
...
The cuffdiff output is in a directory called diff_out. We are going to spend some time parsing through this output. So, copy it over to your scratch directory and move to your SCRATCH directory.
Code Block |
---|
cds
cp -r $BI/ngs_course/tophat_cufflinks/diff_out .
ls diff_out
|
...
Code Block |
---|
|
cds
cd diff_out
ls -l
-rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts
-rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes
-rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts
-rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences
-rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking
-rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests
-rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage
-rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output.
|
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
Code Block |
---|
title | Linux one-liner for sorting cuffdiff output by log2 fold-change values |
---|
|
cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
|
...
Expand |
---|
|
Code Block |
---|
title | One-line command to get 10 most up regulated genes |
---|
|
cat gene_exp.diff |grep 'yes'|sort -k10nr,10|head
|
Code Block |
---|
title | One-line command to get 10 most down regulated genes |
---|
|
cat gene_exp.diff |grep 'yes'|sort -k10n,10|head
|
|
...
Expand |
---|
|
Code Block |
---|
title | One-line command to get 10 most up-regulated isoforms |
---|
|
cat isoform_exp.diff |grep 'yes'|sort -k10nr,10|head
|
|
...
A) First load R and enter R environment
Code Block |
---|
module load R
R
|
B) Within R environment, set up cummeRbund
Code Block |
---|
source("http://bioconductor.org/biocLite.R")
biocLite("cummeRbund")
|
C) Load cummeRbund library and read in the differential expression results. If you save and exit the R environment and return, these commands must be executed again.
Code Block |
---|
library(cummeRbund)
cuff_data <- readCufflinks('diff_out')
|
...
NOTE: Any graphic outputs will be automatically saved as "Rplots.pdf" which can create problems when you want to create multiple plots with different names in the same process. To save different plots with different names, preface any of the commands below with the command:
Code Block |
---|
pdf(file="myPlotName.pdf")
|
And after executing the necessary commands, add the line:
Thus, to use the csScatter command and save the results in a file called scatterplot.pdf, one would execute the following commands:
Code Block |
---|
pdf(file="scatterplot.pdf")
csScatter(genes(cuff_data), 'C1', 'C2')
dev.off()
|
Code Block |
---|
title | To pull out significantly differentially expressed genes and isoforms |
---|
|
gene_diff_data <- diffData(genes(cuff_data))
sig_gene_data <- subset(gene_diff_data, (significant == 'yes'))
nrow(sig_gene_data)
isoform_diff_data <-diffData(isoforms(cuff_data))
sig_isoform_data <- subset(isoform_diff_data, (significant == 'yes'))
nrow(sig_isoform_data)
|
Code Block |
---|
title | To draw a scatterplot |
---|
|
csScatter(genes(cuff_data), 'C1', 'C2')
|
Code Block |
---|
title | To plot gene level and isoform level expression for gene regucalcin |
---|
|
mygene1 <- getGene(cuff_data,'regucalcin')
expressionBarplot(mygene1)
expressionBarplot(isoforms(mygene1))
|
Code Block |
---|
title | To plot gene level and isoform level expression for gene Rala |
---|
|
mygene2 <- getGene(cuff_data, 'Rala')
expressionBarplot(mygene2)
expressionBarplot(isoforms(mygene2))
|
...
Expand |
---|
|
Code Block |
---|
title | R command to generate box plot of gene level fpkms |
---|
|
csBoxplot(genes(cuff_data))
|
Code Block |
---|
title | R command to generate box plot of isoform level fpkms |
---|
|
csBoxplot(isoforms(cuff_data))
|
|
...
Expand |
---|
|
Code Block |
---|
title | One possible solution |
---|
|
gene_diff_data <- diffData(genes(cuff_data))
sig_gene_data <- subset(gene_diff_data, (ln_fold_change > 1.5))
head(sig_gene_data)
sig_geneids <- c(sig_gene_data$gene_id)
myGenes <- getGenes(cuff_data, sig_geneids)
expressionBarplot(myGenes)
|
|
...