Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this exercise, you will use a recently developed R package called RIPSeeker to identify binding regions in a small CLIP-seq dataset.  This dataset has been MASSIVELY down-sampled, since otherwise staging and processing the data would be difficult.  As a result, this dataset should produce only ONE peak in the entire human genome.  Let's try to find it!

Primary Data

This data is derived from a human (HeLa) Ago2 CLIP-seq experiment (Ameyar-zazoua M et al., Nat Struct Mol Biol. 2012;19(10):998-1004), and since you already copied it over at the start of the last exercise, it should be at the path:

Code Block
$SCRATCH/ripseq_exercises/CLIP

...

RIPSeeker is a Bioconductor R package, and we are going to run the job using the "Rscript" utility using the script and commands file that exist in the directory you just staged.  Even on TINY files like the ones we are using, it will take a while to run, so we're going to kick it off now, and then look at what we're doing while it's processing.

The actual R commands are in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive.  So, to execute the script and its files, run these commands and use qstat to make sure the job is running.

Code Block
cd $SCRATCH/ripseq_exercises/CLIP/
$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB
qsub ripseeker.sge
qstat

The output (when it is done) should deposit two files in the directory you submitted the job from.  Those files are:

Code Block
RIPregions.gff3
seekOutput.RData

The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP.  Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting.  The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R.  This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them.

 

Now, let's take a closer look at what we just did.  Check out the "ripseeker_script.R" file by using cat or less:

Code Block
source("http://bioconductor.org/biocLite.R")
biocLite("RIPSeeker")
library(RIPSeeker)
bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = FALSE
rerunWithDisambiguatedMultihits = FALSE

seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

Now, before running anything, letLet's go through this section by section.  First, we need to get the RIPSeeker packages from Bioconductor and load them into R.

...

Code Block
> binSize=200  #Set the bin size, or have it be found automatically (between 200 and 1200 bp)
> cNAME="input"  #Which files are input
> genomeBuild = "hg19"   #Genome to use
> outDir <- file.path(getwd())  #Output file prefix (here, the current directory)
> reverseComplement = FALSE   #Do not reverse complement BAM entries
> uniqueHit = FALSE   #If TRUE, use only unique hits to train initial model
> assignMultihits = FALSE  #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = FALSE   #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
 
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

The way we're going to run this at TACC is by having the actual R commands in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive.

Code Block
$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB
qsub ripseeker.sge

The output should deposit two files in your current directory:

Code Block
RIPregions.gff3
seekOutput.RData

The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP.  Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting.  The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R.  This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them. 

So, now that you've generated all this data, go ahead and look at what the single peak is!

...