NGS RNA-seq exploration short course

This brief tutorial will walk you through data analysis of an RNA-seq experiment.

In this experiment, E. coli was inoculated into culture and the culture was then sampled at 4 hours and 24 hours post inoculation. The experiment was run in triplicate.

RNA was extracted from the 6 samples, fragmented, and sequenced. All sequencing runs were of the paired-end 2x100 type, so each RNA fragment is read from both ends, 100 bp from each end.

Here is a table showing the data we have:

Sample	Condition	Replicate	Sequencing Runs	Data Files
MURI_17	4 hr	1	SA13172	MURI_17_SA13172_ATGTCA_L007
MURI_26	4 hr	2	SA14027	MURI_26_SA14027_TTAGGC_L006
MURI_98	4 hr	3	SA14008	MURI_98_SA14008_TTAGGC_L005, MURI_98_SA14008_TTAGGC_L006
MURI_21	24 hr	1	SA13172	MURI_21_SA13172_GTGGCC_L007
MURI_30	24 hr	2	SA14027	MURI_30_SA14027_CAGATC_L006
MURI_102	24 hr	3	SA14008, SA14032	MURI_102_SA14008_CAGATC_L005, MURI_102_SA14008_CAGATC_L006, MURI_102_SA14032_CAGATC_L006

In class, we will explore and characterize the raw data. Here are some elements (programs & techniques) we may use (you will need some of these for the homework):

Assessing the raw data...

Is this E. coli?

Is this E. coli RNA?

Does this look like RNA-seq data?

Assessing the quality of the raw data:

Mapping data to a reference genome...

To map a pair of data files to the reference E. coli genome:

Assessing the read mapping...

Assessing the read-mapping:

Counting reads per gene... (also called "count" data)

Counting the number of sequence reads within a gene, for all genes in the genome:

Analyze the count data

Normalize (only for mapped reads in this case)

Check Principle Components Analysis (PCA) and box plots

Calculate fold-change

Log transform

Volcano plot

For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.

Based on what you learned about the T-test (that is, using terms associated with a T-test), explain what criteria you might use to consider it "invalid" to combine the multiple raw sequence data files from samples.
Outline the steps needed to reduce the raw data to numbers suitable for evaluation of your criteria in question #1
Perform the steps you outlined in #2 and tell whether or not it was valid to combine the data files
Starting with the raw "count" data, explore the effect on PCA of NOT normalizing.