Day 3 take aways

 

We looked at  the portions of the RNA-Seq analysis pipeline, that comes after mapping data:  gene/transcript quantification and finding differentially expressed genes when we are not interested in novel genes. We also looked at the tuxedo pipeline for when we are interested in finding genes and transcripts not already in our annotated transcriptome.

 

PART A

 1. Gene counting/transcript counting/quantification/abundance

    • All this means is to get gene/transcript counts from the mapped results.  
    • Naive ways of doing this (bedtools) and more sophisticated ways of doing this (htseq).
    • If you use kallisto (for "mapping"), it already gives you transcript counts, so you will not need to do this step.
    • Output is count tables for each sample in your study.

  2. Differential expression analysis

    • Input is a count table (rows being genes and columns being all your samples)
    • Normalization of counts
    • Statistical testing using count data to identify genes whose expression profiles vary significantly between conditions.
    • Output is a table again with rows being genes. For every gene, you will now have a log2 fold change, Pvalue, and and adjusted Pvalue.
    • Typically, impose cutoffs on fold change and adjusted p value to identify DEGs (differentially expressed genes).

PART B

Use the new tuxedo pipeline to assemble transcripts representative of your samples and identify DEGs among them. The new tuxedo pipeline is an immense improvement over the last one in terms of speed. The pipeline unfortunately has many steps and most steps are sequential. The output of one step becomes the input of the next:

    • Hisat to map the raw data to the genome
    • Stringtie to assemlbe and quantify transcripts using the mapped output. 
    • Stringtie merge to merge the transcripts for all samples into one representative transcriptome (gtf/gff file)
    • gffcompare to compare the newly assembled transcripts to annotated transcripts, in an attempt to identify potentially novel ones.
    • Stringtie again, run in -e mode to recalculate quantification of transcripts in the newly merged transcriptome. Output is a count table for every sample.
    • Ballgown to run statistical testing on the count data to identify differentially expressed genes/transcripts.

 

BACK TO COURSE OUTLINE