Running the new tuxedo suite

The new tuxedo suite consists of three tools for transcript-level analysis of RNA-Seq data:

HISAT2: Spliced alignment of reads to the genome
Stringtie: Assembly of transcripts based on mapping to genome, including novel transcripts. Quantification of these transcripts.
Ballgown: Identification of differentially expressed genes and transcripts.

Get set up for the exercise

cds
cd my_rnaseq_course
cd day_3_partB/stringtie_ballgown

1.Stringtie to assemble the transcripts

Transcript assembly = assembly of mapped reads into transcriptional units.

Why?

• Define a precise map of all transcripts expressed in a sample.

• How does our transcriptome look in comparison to the known transcriptome?

• Look for novel transcripts between conditions/samples.

• Look for differences in expression for these novel transcripts between conditions/samples.

After assembly, stringtie calculates abundance for these assembled transcripts. These values are normalized using FPKM (Fragments Per Kilobase of Exon Per Million) (variation of RPKM)

INPUT: BAM files containing reads mapped to the genome (in our case, we'll use hisat mapping output)

COMMAND: Construct the stringtie command (defaults are fine) by looking at the general syntax and the usage statement. Either run one stringtie command or run all 6.

stringtie [options] <inputbam>

Some of the important options:

-p - number of threads

-G/--GTF-guide (both annotated and novel transcripts)

-o - output file name

-l - common label for every transcript assembled

-p brings up a key concept when submitting jobs on TACC: wayness. Go back to Submitting Jobs to Lonestar6 for more.

OUTPUT: Each stringtie command generates a new gtf file, so you will have one for each sample.

Look at a stringtie generated GTF file

head results/C1_R1.transcripts.gtf

2. Stringtie merge to merge assemblies of all samples

Stringtie merge is like a metaassembler which collects the transcripts assembled in each sample and generates one gtf file containing all of them. This may cause some transcripts to be extended upon etc.

INPUT: A file indicating the location of all the sample gtf files that need to be merged will be the input to stringtie merge. That is already provided to you and its called mergelist.

COMMAND: Construct the stringtie merge command (defaults are fine) by looking at the the usage statement- Type stringtie to see usage information for stringtie merge at the bottom.

OUTPUT: One gtf file containing transcripts from every sample.

Look at merged GTF file

head results/stringtie_merged.gtf

3. Compare the newly assembled transcriptome to the existing transcriptome

gffcompare will let us compare the transcriptome that is representative of our samples to the annotated transcriptome. This lets us identify transcripts that are different from the annotation and therefore potentially novel.

COMMAND:

Gffcompare command

gffcompare -r ../reference/genes.formatted.gtf -R -o stringtie_merged results/stringtie_merged.gtf

OUTPUT: One gtf file containing transcripts from every sample.

Look at gffcompare results

less results/stringtie_merged.stats
head results/stringtie_merged.annotated.gtf

CLASS CODES

Priority	Code	Description
1	=	Complete match of intron chain
2	c	Contained
3	j	Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4	e	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5	i	A transfrag falling entirely within a reference intron
6	o	Generic exonic overlap with a reference transcript
7	p	Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
8	r	Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9	u	Unknown, intergenic transcript
10	x	Exonic overlap with reference on the opposite strand
11	s	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors

Pull out top 10 genes with potentially novel (class code j) transcripts

Construct a unix command to pull out top 10 genes with potentially novel (class code j) transcripts
Hint: Use a combination of cut and grep (looking for class_code "j")

4. Redo stringtie abundance calculations using the newly merged transcripts.

This needs to be done to generate count files for each sample, that can then be used by ballgown R package for differential expression analysis.

Commands to recalculate abundance

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R1/C1_R1_ballgown.gtf ../hisat_results/C1_R1.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R2/C1_R2_ballgown.gtf ../hisat_results/C1_R2.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R3/C1_R3_ballgown.gtf ../hisat_results/C1_R3.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R3/C2_R3_ballgown.gtf ../hisat_results/C2_R3.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R2/C2_R2_ballgown.gtf ../hisat_results/C2_R2.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R1/C2_R1_ballgown.gtf ../hisat_results/C2_R1.bam

OUTPUT: This generates a directory per sample within the ballgown directory. Each directory contains count files with counts for every transcript we assembled.

stringtie e results

ls ballgown
ls ballgown/C1_R1

5. Run ballgown R package to identify differentially expressed transcripts or genes, using our newly assembled transcriptome (gtf file).

INPUT: Transcript counts for each sample, merged gtf file representing all the transcripts assembled in our samples.

Ballgown R script

#OPEN R and run the following within the R console:
#install ballgown bioconductor package
source("http://bioconductor.org/biocLite.R")
biocLite("ballgown") 


#load libraries
library(ballgown)
library(dplyr)
library(genefilter)
library(devtools)

#load csv file with information about samples
pheno_data = read.csv("phenotype.csv")
pheno_data

#load all the count files for each sample from the ballgown directory
bg <- ballgown(dataDir = "ballgown", samplePattern = "C", pData=pheno_data)
bg

#filter all genes with a variance of less than 1
bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)

#run stat test looking for differences based on condition- both in transcript and gene level
results_transcripts = stattest(bg, feature="transcript",covariate="Condition",getFC=TRUE, meas="FPKM")
results_genes = stattest(bg, feature="gene",covariate="Condition",getFC=TRUE, meas="FPKM")

#sort by pvalue and write to file
results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)
write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "gene_results.csv", row.names=FALSE)

#boxplot of fpkm values across samples
fpkm = texpr(bg,meas="FPKM")
fpkm = log2(fpkm+1)
boxplot(fpkm,col=as.numeric(pheno_data$Condition),las=2,ylab='log2(FPKM+1)')
save.image(file="ballgown.Rdata")
savehistory(file="ballgown.history")

OUTPUT: A file containing fold changes, p values and qvalues of every gene and every transcript from our assembly.

Parse ballgown results

head results/gene_results.csv
 
#DEGS by pvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/gene_results.csv |wc -l
#DEGS by qvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/gene_results.csv |wc -l

Parse ballgown results

head results/transcript_results.csv
 
#DEGS by pvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/transcript_results.csv |wc -l
#DEGS by qvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/transcript_results.csv |wc -l

Ballgown can also be used to generate many plots of your difference expression results. The most interesting plot is one showing the structure and level of expression of different transcripts belonging the same gene. Example is below.

BACK TO COURSE OUTLINE