Running the new tuxedo suite


The new tuxedo suite consists of three tools for transcript-level analysis of RNA-Seq data:

  • HISAT2:  Spliced alignment of reads to the genome
  • Stringtie:  Assembly of transcripts based on mapping to genome, including novel transcripts. Quantification of these transcripts.
  • Ballgown: Identification of differentially expressed genes and transcripts.


Get set up for the exercise
cds
cd my_rnaseq_course
cd day_3_partB/stringtie_ballgown 

1.Stringtie to assemble the transcripts

Transcript assembly = assembly of mapped reads into transcriptional units.

 Why?

• Define a precise map of all transcripts expressed in a sample.
• How does our transcriptome look in comparison to the known transcriptome?
• Look for novel transcripts between conditions/samples.
• Look for differences in expression for these novel transcripts between conditions/samples.

After assembly, stringtie calculates abundance for these assembled transcripts. These values are normalized using FPKM (Fragments Per Kilobase of Exon Per Million)  (variation of RPKM)
INPUT:  BAM files containing reads mapped to the genome (in our case, we'll use hisat mapping output)

COMMAND: Construct the stringtie command (defaults are fine) by looking at the general syntax and the usage statement.  Either run one stringtie command or run all 6.

stringtie [options] <inputbam>

Some of the important options:

 -p  - number of threads

  -G/--GTF-guide  (both annotated and novel transcripts)

  -o  - output file name

  -l   - common label for every transcript assembled

-p brings up a key concept when submitting jobs on TACC: wayness. Go back to Submitting Jobs to Lonestar6 for more.

OUTPUT: Each stringtie command generates a new gtf file, so you will have one for each sample.

Look at a stringtie generated GTF file
head results/C1_R1.transcripts.gtf 

2. Stringtie merge to merge assemblies of all samples

Stringtie merge is like a metaassembler which collects the transcripts assembled in each sample and generates one gtf file containing all of them. This may cause some transcripts to be extended upon etc.

INPUT:  A file indicating the location of all the sample gtf files that need to be merged will be the input to stringtie merge. That is already provided to you and its called mergelist.

COMMAND: Construct the stringtie merge command (defaults are fine) by looking at the the usage statement- Type stringtie to see usage information for stringtie merge at the bottom

OUTPUT: One gtf file containing transcripts from every sample.

Look at merged GTF file
head results/stringtie_merged.gtf  


3. Compare the newly assembled transcriptome to the existing transcriptome

gffcompare will let us compare the transcriptome that is representative of our samples to the annotated transcriptome. This lets us identify transcripts that are different from the annotation and therefore potentially novel.

COMMAND: 

Gffcompare command
gffcompare -r ../reference/genes.formatted.gtf -R -o stringtie_merged results/stringtie_merged.gtf


OUTPUT: One gtf file containing transcripts from every sample.

Look at gffcompare results
less results/stringtie_merged.stats
head results/stringtie_merged.annotated.gtf

CLASS CODES

PriorityCodeDescription
1=Complete match of intron chain
2cContained
3jPotentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4eSingle exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5iA transfrag falling entirely within a reference intron
6oGeneric exonic overlap with a reference transcript
7pPossible polymerase run-on fragment (within 2Kbases of a reference transcript)
8rRepeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9uUnknown, intergenic transcript
10xExonic overlap with reference on the opposite strand
11sAn intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors
Pull out top 10 genes with potentially novel (class code j) transcripts
Construct a unix command to pull out top 10 genes with potentially novel (class code j) transcripts
Hint: Use a combination of cut and grep (looking for class_code "j")


4. Redo stringtie abundance calculations using the newly merged transcripts. 

This needs to be done to generate count files for each sample, that can then be used by ballgown R package for differential expression analysis.

Commands to recalculate abundance
stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R1/C1_R1_ballgown.gtf ../hisat_results/C1_R1.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R2/C1_R2_ballgown.gtf ../hisat_results/C1_R2.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R3/C1_R3_ballgown.gtf ../hisat_results/C1_R3.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R3/C2_R3_ballgown.gtf ../hisat_results/C2_R3.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R2/C2_R2_ballgown.gtf ../hisat_results/C2_R2.bam

stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R1/C2_R1_ballgown.gtf ../hisat_results/C2_R1.bam

OUTPUT: This generates a directory per sample within the ballgown directory. Each directory contains count files with counts for every transcript we assembled.

stringtie e results
ls ballgown
ls ballgown/C1_R1


5. Run ballgown R package to identify differentially expressed transcripts or genes, using our newly assembled transcriptome (gtf file).

INPUT:  Transcript counts for each sample, merged gtf file representing all the transcripts assembled in our samples.


Ballgown R script
#OPEN R and run the following within the R console:
#install ballgown bioconductor package
source("http://bioconductor.org/biocLite.R")
biocLite("ballgown") 


#load libraries
library(ballgown)
library(dplyr)
library(genefilter)
library(devtools)

#load csv file with information about samples
pheno_data = read.csv("phenotype.csv")
pheno_data

#load all the count files for each sample from the ballgown directory
bg <- ballgown(dataDir = "ballgown", samplePattern = "C", pData=pheno_data)
bg

#filter all genes with a variance of less than 1
bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE)

#run stat test looking for differences based on condition- both in transcript and gene level
results_transcripts = stattest(bg, feature="transcript",covariate="Condition",getFC=TRUE, meas="FPKM")
results_genes = stattest(bg, feature="gene",covariate="Condition",getFC=TRUE, meas="FPKM")

#sort by pvalue and write to file
results_transcripts = arrange(results_transcripts,pval)
results_genes = arrange(results_genes,pval)
write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE)
write.csv(results_genes, "gene_results.csv", row.names=FALSE)

#boxplot of fpkm values across samples
fpkm = texpr(bg,meas="FPKM")
fpkm = log2(fpkm+1)
boxplot(fpkm,col=as.numeric(pheno_data$Condition),las=2,ylab='log2(FPKM+1)')
save.image(file="ballgown.Rdata")
savehistory(file="ballgown.history")


OUTPUT: A file containing fold changes, p values and qvalues of every gene and every transcript from our assembly. 

Parse ballgown results
head results/gene_results.csv
 
#DEGS by pvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/gene_results.csv |wc -l
#DEGS by qvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/gene_results.csv |wc -l
Parse ballgown results
head results/transcript_results.csv
 
#DEGS by pvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/transcript_results.csv |wc -l
#DEGS by qvalue
awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/transcript_results.csv |wc -l

Ballgown can also be used to generate many plots of your difference expression results. The most interesting plot is one showing the structure and level of expression of different transcripts belonging the same gene. Example is below.

BACK TO COURSE OUTLINE