Running the new tuxedo suite
The new tuxedo suite consists of three tools for transcript-level analysis of RNA-Seq data:
- HISAT2: Spliced alignment of reads to the genome
- Stringtie: Assembly of transcripts based on mapping to genome, including novel transcripts. Quantification of these transcripts.
- Ballgown: Identification of differentially expressed genes and transcripts.
cds cd my_rnaseq_course cd day_3_partB/stringtie_ballgown
1.Stringtie to assemble the transcripts
Transcript assembly = assembly of mapped reads into transcriptional units.
Why?
COMMAND: Construct the stringtie command (defaults are fine) by looking at the general syntax and the usage statement. Either run one stringtie command or run all 6.
stringtie [options] <inputbam>
Some of the important options:
-p - number of threads
-G/--GTF-guide (both annotated and novel transcripts)
-o - output file name
-l - common label for every transcript assembled
-p brings up a key concept when submitting jobs on TACC: wayness. Go back to Submitting Jobs to Lonestar6 for more.
OUTPUT: Each stringtie command generates a new gtf file, so you will have one for each sample.
head results/C1_R1.transcripts.gtf
2. Stringtie merge to merge assemblies of all samples
Stringtie merge is like a metaassembler which collects the transcripts assembled in each sample and generates one gtf file containing all of them. This may cause some transcripts to be extended upon etc.
INPUT: A file indicating the location of all the sample gtf files that need to be merged will be the input to stringtie merge. That is already provided to you and its called mergelist.
COMMAND: Construct the stringtie merge command (defaults are fine) by looking at the the usage statement- Type stringtie to see usage information for stringtie merge at the bottom.
OUTPUT: One gtf file containing transcripts from every sample.
head results/stringtie_merged.gtf
3. Compare the newly assembled transcriptome to the existing transcriptome
gffcompare will let us compare the transcriptome that is representative of our samples to the annotated transcriptome. This lets us identify transcripts that are different from the annotation and therefore potentially novel.
COMMAND:
gffcompare -r ../reference/genes.formatted.gtf -R -o stringtie_merged results/stringtie_merged.gtf
OUTPUT: One gtf file containing transcripts from every sample.
less results/stringtie_merged.stats head results/stringtie_merged.annotated.gtf
CLASS CODES
Priority | Code | Description | |
1 | = | Complete match of intron chain | |
2 | c | Contained | |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
5 | i | A transfrag falling entirely within a reference intron | |
6 | o | Generic exonic overlap with a reference transcript | |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
9 | u | Unknown, intergenic transcript | |
10 | x | Exonic overlap with reference on the opposite strand | |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors |
Construct a unix command to pull out top 10 genes with potentially novel (class code j) transcripts Hint: Use a combination of cut and grep (looking for class_code "j")
4. Redo stringtie abundance calculations using the newly merged transcripts.
This needs to be done to generate count files for each sample, that can then be used by ballgown R package for differential expression analysis.
stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R1/C1_R1_ballgown.gtf ../hisat_results/C1_R1.bam stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R2/C1_R2_ballgown.gtf ../hisat_results/C1_R2.bam stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C1_R3/C1_R3_ballgown.gtf ../hisat_results/C1_R3.bam stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R3/C2_R3_ballgown.gtf ../hisat_results/C2_R3.bam stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R2/C2_R2_ballgown.gtf ../hisat_results/C2_R2.bam stringtie -e -p 8 -B -G results/stringtie_merged.gtf -o ballgown/C2_R1/C2_R1_ballgown.gtf ../hisat_results/C2_R1.bam
OUTPUT: This generates a directory per sample within the ballgown directory. Each directory contains count files with counts for every transcript we assembled.
ls ballgown ls ballgown/C1_R1
5. Run ballgown R package to identify differentially expressed transcripts or genes, using our newly assembled transcriptome (gtf file).
INPUT: Transcript counts for each sample, merged gtf file representing all the transcripts assembled in our samples.
#OPEN R and run the following within the R console: #install ballgown bioconductor package source("http://bioconductor.org/biocLite.R") biocLite("ballgown") #load libraries library(ballgown) library(dplyr) library(genefilter) library(devtools) #load csv file with information about samples pheno_data = read.csv("phenotype.csv") pheno_data #load all the count files for each sample from the ballgown directory bg <- ballgown(dataDir = "ballgown", samplePattern = "C", pData=pheno_data) bg #filter all genes with a variance of less than 1 bg_filt = subset(bg,"rowVars(texpr(bg)) >1",genomesubset=TRUE) #run stat test looking for differences based on condition- both in transcript and gene level results_transcripts = stattest(bg, feature="transcript",covariate="Condition",getFC=TRUE, meas="FPKM") results_genes = stattest(bg, feature="gene",covariate="Condition",getFC=TRUE, meas="FPKM") #sort by pvalue and write to file results_transcripts = arrange(results_transcripts,pval) results_genes = arrange(results_genes,pval) write.csv(results_transcripts, "transcript_results.csv", row.names=FALSE) write.csv(results_genes, "gene_results.csv", row.names=FALSE) #boxplot of fpkm values across samples fpkm = texpr(bg,meas="FPKM") fpkm = log2(fpkm+1) boxplot(fpkm,col=as.numeric(pheno_data$Condition),las=2,ylab='log2(FPKM+1)') save.image(file="ballgown.Rdata") savehistory(file="ballgown.history")
OUTPUT: A file containing fold changes, p values and qvalues of every gene and every transcript from our assembly.
head results/gene_results.csv #DEGS by pvalue awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/gene_results.csv |wc -l #DEGS by qvalue awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/gene_results.csv |wc -l
head results/transcript_results.csv #DEGS by pvalue awk -F ',' '{if ((($3>=1)||($3<=-1))&&($4<=0.05)) print }' results/transcript_results.csv |wc -l #DEGS by qvalue awk -F ',' '{if ((($3>=1)||($3<=-1))&&($5<=0.05)) print }' results/transcript_results.csv |wc -l
Ballgown can also be used to generate many plots of your difference expression results. The most interesting plot is one showing the structure and level of expression of different transcripts belonging the same gene. Example is below.
BACK TO COURSE OUTLINE
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.