Day 2 Take Away Points

Let's recap what we learned yesterday:

We looked at finding differentially expressed genes when we are not interested in novel genes.

Spliced mapping: Tophat. We used tophat to map reads from two conditions C1 and C2 to our genome.
1. Spliced mapping is more conducive for rna-seq data.
2. Spliced alignments look different from unspliced alignments in their cigar scores ("N")
How to convert mapping results (spliced or unspliced) to gene counts? We looked at two tools which count overlaps of reads to known genes.
1. Inputs: mapping output (sam or bam file) and annotation file (gff/gtf file)
2. Bedtools- for simple counting. Any time a read overlaps a gene, it's counted towards that gene.
3. HtSeq - for fine tuned counting. You can choose how you want to count reads that map only partially to a gene, that map to multiple genes etc.
  Output File Example
```
FBgn0000008 304 311 273 264 296 296
FBgn0000014 47 40 39 36 63 43
FBgn0000015 41 35 28 22 35 35
```
4. Output: Gene id, following by raw counts for that gene.
How to take the gene counts for different conditions and compare then to identify genes that are differentially expressed?
1. Lots of R packages to do this: DESeq2, edgeR are the most commonly used.
2. normalize, calculate variance, statistical test, output genes along with fold change, p value, FDR
3. DESEQ2 run: read in htseq output, specified the design (i.e. conditions we want to compare against and the levels in the conditions), made a DESEQ object, ran negative binomial test, got back a csv file with log2 fold changes, pvalues and adjusted pvalues for each gene in our input list.
We learned some unix as well!
1. Very useful commands like sed, grep, awk, cut and wc.
BACK TO COURSE OUTLINE