Let's recap what we learned yesterday:
We looked at finding differentially expressed genes when we are not interested in novel genes.
- Spliced mapping: Tophat. We used tophat to map reads from two conditions C1 and C2 to our genome.
- Spliced mapping is more conducive for rna-seq data.
- Spliced alignments look different from unspliced alignments in their cigar scores ("N")
- How to convert mapping results (spliced or unspliced) to gene counts? We looked at two tools which count overlaps of reads to known genes.
- Inputs: mapping output (sam or bam file) and annotation file (gff/gtf file)
- Bedtools- for simple counting. Any time a read overlaps a gene, it's counted towards that gene.
HtSeq - for fine tuned counting. You can choose how you want to count reads that map only partially to a gene, that map to multiple genes etc.
Output File ExampleFBgn0000008 304 311 273 264 296 296 FBgn0000014 47 40 39 36 63 43 FBgn0000015 41 35 28 22 35 35
Output: Gene id, following by raw counts for that gene.
How to take the gene counts for different conditions and compare then to identify genes that are differentially expressed?
- Lots of R packages to do this: DESeq2, edgeR are the most commonly used.
- normalize, calculate variance, statistical test, output genes along with fold change, p value, FDR
- DESEQ2 run: read in htseq output, specified the design (i.e. conditions we want to compare against and the levels in the conditions), made a DESEQ object, ran negative binomial test, got back a csv file with log2 fold changes, pvalues and adjusted pvalues for each gene in our input list.
We learned some unix as well!
- Very useful commands like sed, grep, awk, cut and wc.
BACK TO COURSE OUTLINE
- Very useful commands like sed, grep, awk, cut and wc.