Objectives
In this exercise, you will use a recently developed R package called RIPSeeker to identify enriched transcripts binding regions in a small CLIP-seq dataset. This dataset has been MASSIVELY down-sampled, since otherwise staging and processing the data would be difficult. As is commonly the case with CLIP-seq (and other protocols that use SDS-PAGE purification before library preparation), there is no
Primary Data
a result, this dataset should produce only ONE peak in the entire human genome. Let's try to find it!
Primary Data
This data is derived from a human (HeLa) Ago2 CLIP-seq experiment (Ameyar-zazoua M et al., Nat Struct Mol Biol. 2012;19(10):998-1004), and since you already copied it over at the start of the last exercise, it should be at the path:
Code Block |
---|
ls $SCRATCH/ripseq_exercises/CLIP |
When you look in the "CLIP" subdirectory, you should see these files:
Code Block |
---|
input.chr19.bam
ip.chr19.bam
results
ripseeker.cmds
ripseeker_script.R |
Results is a directory containing pre-made, well, results. The two BAM files are our test datasets, and they are a sub-sample of reads from only chr19. The ripseeker.cmds and ripseeker_script.R files are both commands files, which we will examine later. Before that, we need to make some adjustments so we can install some new libraries. First, execute these commands:
Code Block |
---|
cd $SCRATCH/ripseq_exercises/CLIP
cp /scratch/02423/nsabell/RIPSeeker_1.4.0.tar.gz .
cp /scratch/02423/nsabell/ripseeker_script.R .
mkdir lib
chmod 777 lib |
RIPSeeker
RIPSeeker is a Bioconductor R package, and we are going to run the job using the "Rscript" utility using the script and commands file that exist in the directory you just staged. Even on TINY files like the ones we are using, it will take a while to run, so we're going to kick it off now, and then look at what we're doing while it's processing.
The actual R commands are in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive. So, to execute the script and its files, run these commands and use qstat to make sure the job is running.
Code Block |
---|
cd $SCRATCH/ripseq_exercises/CLIP/
$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB
qsub ripseeker.sge
qstat |
The output (when it is done) should deposit two files in the directory you submitted the job from. Those files are:
Code Block |
---|
RIPregions.gff3
seekOutput.RData |
The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP. Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting. The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R. This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them.
Now, let's take a closer look at what we just did. Check out the "ripseeker_script.R" file by using cat or less:
Code Block |
---|
install.packages(lib="./lib", pkgs="./RIPSeeker_1.4.0.tar.gz", repos=NULL)
library(RIPSeeker, lib.loc="./lib")
bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")
cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = FALSE
rerunWithDisambiguatedMultihits = FALSE
seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir) |
Let's go through this section by section. First, we need to get the RIPSeeker packages from Bioconductor and load them into R.
Code Block |
---|
> install.packages(lib="./lib", pkgs="./RIPSeeker_1.4.0.tar.gz", repos=NULL)
> library(RIPSeeker, lib.loc="./lib") |
Then, we need to read in the BAM files we're going to use with their paths and use the RIPSeeker (or rather GenomicRanges) function "combineAlignGals" to generate a "gapped alignment" object in R using hg19 genomic and contig coordinates.
Code Block |
---|
> bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
> ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
> ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19") |
Now, we're ready to execute the primary RIPSeeker function, appropriately called ripSeek(). Since it takes so many options, we're going to declare a bunch of variables before executing the command, and then have the command refer back to those already-declared variables. This is what it would look like:
Code Block |
---|
> binSize=200 #Set the bin size, or have it be found automatically (between 200 and 1200 bp)
> cNAME="input" #Which files are input
> genomeBuild = "hg19" #Genome to use
> outDir <- file.path(getwd()) #Output file prefix (here, the current directory)
> reverseComplement = FALSE #Do not reverse complement BAM entries
> uniqueHit = FALSE #If TRUE, use only unique hits to train initial model
> assignMultihits = FALSE #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = FALSE #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir) |
So, now that you've generated all this data, go ahead and look at what the single peak is!
Expand | ||
---|---|---|
Go here to look at it in the genome browser - see anything interesting? |
We can also open R interactively to look at the contents of the big "seekOutput.RData" file we made. We need to use the R set up by the BioITeam, rather than the one in the module system, so we can use a version 3.x. These are a possible sequence of commands:
Code Block |
---|
cd $SCRATCH/ripseq_exercises/CLIP/
$BI/bin/R
#Now we are in R
library(RIPSeeker, lib.loc="./lib")
rip_data = load("seekOutput.RData")
rip
# [1] "mainSeekOutputRIP" "RIPGRList" "mainSeekOutputCTL"
# [4] "annotatedRIPGR" |
Here, "rip_data" is a list of four objects that we just loaded into R, and each contains different sub-objects and information.
Object | Component Objects | Information Content |
---|---|---|
mainSeekOutputRIP | nbhGRList alignGal alignGalFiltered | Information such as the HMM posterior probabilities and filtered alignment objects for the RIP sample |
RIPGRList | a GRangesList | A list (in GRanges object form) of all peak regions called by RIPSeeker |
mainSeekOutputCTL | nbhGRList alignGal alignGalFiltered | Information such as the HMM posterior probabilities and filtered alignment objects for the CTL sample |
annotatedRIPGR | sigGRangesAnnotated enrichedGO | Annotations for each of the regions flagged in RIPGRList, as well as calling of significantly enriched GO terms for those regions; this only works if "ChIPpeakAnno" is loaded and you set the functions biomaRt, biomaRt_dataset, and goAnno (see below) |
Alternatives
We could have added these lines to the ripseeker_script.R file:
Code Block |
---|
> biocLite("ChIPpeakAnno")
> library(ChIPpeakAnno)
> biomart="ensembl" #Source of genome annotations
> biomaRt_dataset="hsapiens_gene_ensembl" #Specific dataset from biomaRt source
> goAnno="org.Hs.eg.db" #Annotation database |
Which would automatically annotate the resulting peak coordinates with their gene names. We didn't do this because the package that is used, ChIPpeakAnno, has quite a few dependencies and takes a while to download (maybe ~15 minutes) and run.