Page Comparison

...

In class, we will explore and characterize the raw data.

Anchor

	hw_step1
	hw_step1

Expand

title	This is your homework due 11/25...

Log in to your appsoma.com account

Select the "Code" tab if you are not already there.

Select "Biolinux-03" from the drop-down menu to the right of the RUN button

Select "Shell"

Now, within the shell window, use some of the linux commands you've learned to move your self into a working directory (called "scratch") and link to the data:

Code Block
cd scratch mkdir rawdata cd rawdata; ln -s /home/scott/e0/* . ; cd ..; mkdir finaldata; cd finaldata; ln -s /home/scott/e3/* .; cd ..;

...

Expand

title	Analyze the count data

Now, we'll switch from running bash commands to running commands within the R statistical package.

Move into the "finaldata" directory and start R like this:

Code Block
cd finaldata R

You should now see a ">" prompt instead of your linux prompt, telling you that you are now in an R shell, not a bash shell. (You type "q()" to exit the R shell).

Load some libraries and the raw data and do some basic transforms of the raw data to get it ready for analysis in R

Code Block

library(ggbiplot)
wall<-read.table(file="all3x3.counts",sep="\t",header=FALSE); load the raw data into variable "wall"
wallt<-t(wall[,2:7]); # This just transposes the read count data into a new variable, wallt

Check Principle Components Analysis (PCA)

Code Block
wallt.pca<-prcomp(wallt) print(ggbiplot(wallt.pca, groups=c("4hr","4hr","4hr","24hr","24hr","24hr"), ellipse = TRUE, circle=TRUE, obs.scale = 1, var.scale = 1, var.axes=FALSE))

To view the plot you just created, go to the "Data" tab in Appsoma, navigate to your scratch/finaldata area and download the Rplots.pdf file.

Check a box plot

Code Block
boxplot(wall)

That's not very interesting or useful - we're plotting gene expression data on a linear scale! Let's go to log scale, fixing some issues with the raw data that would throw off the log calculation:

Code Block
boxplot(log(wall[,2:7]))

Homework

For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.

Based on what you learned about the T-test (that is, using terms associated with a T-test), explain what criteria you might use to consider it "invalid" to combine the multiple raw sequence data files from samples. (5 points)
~~Outline the steps needed to reduce the raw data to numbers suitable for evaluation of your criteria in question #1. (5 points)~~
~~Perform the steps you outlined in #2 and tell whether or not it was valid to combine the data files. (20 points)~~
~~Starting with the raw "count" data, explore the effect on PCA of NOT normalizing. Turn in a print out of the new PCA plot. (10 points)~~
Although we did not explore this in class, DNA mutations were automatically tallied during our mapping process. These results are in the files ending in ".bcf". Using the tool "bedtools" to view these results, test the hypothesis that transitions are more common than transversions. Support your answer with data from this experiment. (20 points)
Continuing with the mutation analysis, examine whether the mutation frequency in this sample set differs between protein coding and non-protein coding regions of the genome. Support your answer with data from this experiment. (30 points).

~~Email your answers/PDFs to shunickesmith <at> gmail.com, cc: Prof. Matouschek no later than TBD, 10:00 am (BETTER: before Thanksgiving break).~~

Homework - Revised 5:00 pm Thursday 11/20/14

For your homework, you will investigate the validity of combining data files from different sequencing runs. Only a few of these questions require working at a computer keyboard, but I encourage you to work in groups to solve the entire set of questions.

Before you begin, pull up this web site in a new window - it's an all-class, live group chat to which you can post questions, get answers and even answer questions of your fellow students. Dr. Hunicke-Smith and Benni will be monitoring it periodically (largely during the daytime and early evenings).

By next Tuesday, 11/25, 10:00 am do the following:

Follow the steps listed above on this web page for "This is your homework due 11/25..." to log into appsoma and setup access to the data you will need from here on.
Move into the "rawdata" directory, find the first four lines of the read 1 sequence file for the MURI 102 sample from sequencing run SA14008 - put them into a new file in that directory called "s1.fq" and copy it into an email.
Move into the "finaldata" directory and make sure you can see the gene expression data file "all3x3.counts"
Using the linux "sort" command, sort all3x3.counts 6 times, sorting on the expression values of each of the 6 samples separately from lowest to highest, redirecting the output of each sort into a separate file.
Using the linux command "tail -1" on each of these 6 files, copy the name of the most abundant gene from each sample into the same email.

Remember - use Etherpad to ask questions and get answers! Email your answers to shunickesmith <at> gmail.com, cc: Prof. Matouschek no later than 11/25, 10:00 am.

Versions Compared

Old Version 7

New Version 8

Key

Homework

Homework - Revised 5:00 pm Thursday 11/20/14