This brief tutorial will walk you through data analysis of an RNA-seq experiment.
In this experiment, E. coli was inoculated into culture and the culture was then sampled at 4 hours and 24 hours post inoculation. The experiment was run in triplicate.
RNA was extracted from the 6 samples, fragmented, and sequenced. All sequencing runs were of the paired-end 2x100 type, so each RNA fragment is read from both ends, 100 bp from each end.
Here is a table showing the data we have:
Sample | Condition | Replicate | Sequencing Runs | Data Files |
---|---|---|---|---|
MURI_17 | 4 hr | 1 | SA13172 | MURI_17_SA13172_ATGTCA_L007 |
MURI_26 | 4 hr | 2 | SA14027 | MURI_26_SA14027_TTAGGC_L006 |
MURI_98 | 4 hr | 3 | SA14008 | MURI_98_SA14008_TTAGGC_L005, MURI_98_SA14008_TTAGGC_L006 |
MURI_21 | 24 hr | 1 | SA13172 | MURI_21_SA13172_GTGGCC_L007 |
MURI_30 | 24 hr | 2 | SA14027 | MURI_30_SA14027_CAGATC_L006 |
MURI_102 | 24 hr | 3 | SA14008, SA14032 | MURI_102_SA14008_CAGATC_L005, MURI_102_SA14008_CAGATC_L006, MURI_102_SA14032_CAGATC_L006 |
In class, we will explore and characterize the raw data. Here are some elements (programs & techniques) we may use (you will need some of these for the homework):
For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.
- Based on what you learned about the T-test (that is, using terms associated with a T-test), explain what criteria you might use to consider it "invalid" to combine the multiple raw sequence data files from samples. (5 points)
- Outline the steps needed to reduce the raw data to numbers suitable for evaluation of your criteria in question #1. (5 points)
- Perform the steps you outlined in #2 and tell whether or not it was valid to combine the data files. (20 points)
- Starting with the raw "count" data, explore the effect on PCA of NOT normalizing. Turn in a print out of the new PCA plot. (10 points)
- Although we did not explore this in class, DNA mutations were automatically tallied during our mapping process. These results are in the files ending in ".bcf". Using the tool "bedtools" to view these results, test the hypothesis that transitions are more common than transversions. Support your answer with data from this experiment. (20 points)
- Continuing with the mutation analysis, examine whether the mutation frequency in this sample set differs between protein coding and non-protein coding regions of the genome. Support your answer with data from this experiment. (30 points).