...
Sample | Condition | Replicate | Sequencing Runs | Data Files |
---|---|---|---|---|
MURI_17 | 4 hr | 1 | SA13172 | MURI_17_SA13172_ATGTCA_L007 |
MURI_26 | 4 hr | 2 | SA14027 | MURI_26_SA14027_TTAGGC_L006 |
MURI_98 | 4 hr | 3 | SA14008 | MURI_98_SA14008_TTAGGC_L005, MURI_98_SA14008_TTAGGC_L006 |
MURI_21 | 24 hr | 1 | SA13172 | MURI_21_SA13172_GTGGCC_L007 |
MURI_30 | 24 hr | 2 | SA14027 | MURI_30_SA14027_CAGATC_L006 |
MURI_102 | 24 hr | 3 | SA14008, SA14032 | MURI_102_SA14008_CAGATC_L005, MURI_102_SA14008_CAGATC_L006, MURI_102_SA14032_CAGATC_L006 |
...
In class demo
In class, we will explore and characterize the raw data.
Expand | ||
---|---|---|
Log in to your appsoma.com account Select the "Code" tab if you are not already there. Select "Biolinux-03" from the drop-down menu to the right of the RUN button Select "Shell"
Now, within the shell window, use some of the linux commands you've learned to move your self into a working directory (called "scratch") and link to the data:
|
Here are some elements (programs & techniques) we may use (you will need some of these for the homework):
...
Expand | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||||||||||||||||||
Mappers/aligners work by first creating a compact index of the reference genome.
Then, you tell the mapper to map your raw data (the fastq.gz files) to the indexed reference genome.
But in the true spirit of linux applications, this is only one modular step of the whole process. The output of the mapper is in text form, not binary, so it's big and slow to access. It's also in the order of the raw reads, not the genome, so accessing a genomic location is really slow. And we don't have any summary data about how the mapping process went yet (except for the log created during mapping). So there are a series of common commands to post-process a run.
And, just for fun, let's ask these tools to look for SNPs:
(Note that there are LOTS of programs for finding SNPs... this happens to be a pretty good one that uses Bayesian statistics and is fast.)
|
Expand | ||
---|---|---|
| ||
Go take a look at some of the .flagstat files! |
...
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Now, we'll switch from running bash commands to running commands within the R statistical package. Move into the "finaldata" directory and start R like this:
You should now see a ">" prompt instead of your linux prompt, telling you that you are now in an R shell, not a bash shell. (You type "q()" to exit the R shell). Load some libraries and the raw data and do some basic transforms of the raw data to get it ready for analysis in R
Check Principle Components Analysis (PCA)
To view the plot you just created, go to the "Data" tab in Appsoma, navigate to your scratch/finaldata area and download the Rplots.pdf file. Check a box plot
That's not very interesting or useful - we're plotting gene expression data on a linear scale! Let's go to log scale, fixing some issues with the raw data that would throw off the log calculation:
|
...
Homework
For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.
...