Exercise #2: CLIP-Seq

Objectives

In this exercise, you will use a recently developed R package called RIPSeeker to identify binding regions in a small CLIP-seq dataset. This dataset has been MASSIVELY down-sampled, since otherwise staging and processing the data would be difficult. As a result, this dataset should produce only ONE peak in the entire human genome. Let's try to find it!

Primary Data

This data is derived from a human (HeLa) Ago2 CLIP-seq experiment (Ameyar-zazoua M et al., Nat Struct Mol Biol. 2012;19(10):998-1004), and since you already copied it over at the start of the last exercise, it should be at the path:

$SCRATCH/ripseq_exercises/CLIP

When you look in the "CLIP" subdirectory, you should see these files:

input.chr19.bam
ip.chr19.bam
results
ripseeker.cmds
ripseeker_script.R

Results is a directory containing pre-made, well, results. The two BAM files are our test datasets, and they are a sub-sample of reads from only chr19. The ripseeker.cmds and ripseeker_script.R files are both commands files, which we will examine before running.

RIPSeeker

RIPSeeker is a Bioconductor R package, and we are going to run the job using the "Rscript" utility using the script and commands file that exist in the directory you just staged. Check out the "ripseeker_script.R" file by using cat or less:

source("http://bioconductor.org/biocLite.R")
biocLite("RIPSeeker")
library(RIPSeeker)
bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = FALSE
rerunWithDisambiguatedMultihits = FALSE

seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

Now, before running anything, let's go through this section by section. First, we need to get the RIPSeeker packages from Bioconductor and load them into R.

> source("http://bioconductor.org/biocLite.R")
> biocLite("RIPSeeker")
> library(RIPSeeker)

Then, we need to read in the BAM files we're going to use with their paths and use the RIPSeeker (or rather GenomicRanges) function "combineAlignGals" to generate a "gapped alignment" object in R using hg19 genomic and contig coordinates.

> bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
> ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
> ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

Now, we're ready to execute the primary RIPSeeker function, appropriately called ripSeek(). Since it takes so many options, we're going to declare a bunch of variables before executing the command, and then have the command refer back to those already-declared variables. This is what it would look like:

> binSize=200  #Set the bin size, or have it be found automatically (between 200 and 1200 bp)
> cNAME="input"  #Which files are input
> genomeBuild = "hg19"   #Genome to use
> outDir <- file.path(getwd())  #Output file prefix (here, the current directory)
> reverseComplement = FALSE   #Do not reverse complement BAM entries
> uniqueHit = FALSE   #If TRUE, use only unique hits to train initial model
> assignMultihits = FALSE  #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = FALSE   #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
 
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

The way we're going to run this at TACC is by having the actual R commands in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive.

$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB
qsub ripseeker.sge

The output should deposit two files in your current directory:

RIPregions.gff3
seekOutput.RData

The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP. Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting. The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R. This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them.

So, now that you've generated all this data, go ahead and look at what the single peak is!

Click here to expand...

cat RIPregions.gff3
chr19	rtracklayer	sequence_feature	47545201	47545400	.	*	.	count=829; fpk=4145; logOddScore=65.8422503917559; pval=0; pvalAdj=0; CTL_count=1491; CTL_fpk=7455; CTL_logOddScore=-61.5676075214777; eFDR=0

Go here to look at it in the genome browser - see anything interesting?

Alternatives

We could have added these lines to the ripseeker_script.R file:

> biocLite("ChIPpeakAnno")
> library(ChIPpeakAnno)
 
> biomart="ensembl"   #Source of genome annotations
> biomaRt_dataset="hsapiens_gene_ensembl"   #Specific dataset from biomaRt source
> goAnno="org.Hs.eg.db"   #Annotation database

Which would automatically annotate the resulting peak coordinates with their gene names. We didn't do this because the package that is used, ChIPpeakAnno, has quite a few dependencies and takes a while to download (maybe ~15 minutes) and run.