Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objectives

In this exercise, you will use a recently developed R package called RIPSeeker to identify binding regions in a small CLIP-seq dataset.

...

Code Block
source("http://bioconductor.org/biocLite.R")
biocLite("RIPSeeker")
library(RIPSeeker)

bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
biomart="ensembl"
biomaRt_dataset="hsapiens_gene_ensembl"
goAnno="org.Hs.eg.db"
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = TRUE
rerunWithDisambiguatedMultihits = TRUE

seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", uniqueHit = TRUE, assignMultihits = TRUE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, biomart=biomart, biomaRt_dataset = biomaRt_dataset, goAnno = goAnno, outDir=outDir)

Now, before running anything, let's go through this section by section.  First, we need to get the RIPSeeker packages from Bioconductor and load them into R.

...

Then, we need to read in the BAM files we're going to use with their paths and use the RIPSeeker (or rather GenomicRanges) function "combineAlignGals" to generate a "gapped alignment" object in R using hg19 genomic and contig coordinates.  

Code Block
> bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
> ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
Processing ./ip.sort.bam ... All hits are returned with flags.
1 BAM files are combined
> ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")
Processing ./input.sort.bam ... All hits are returned with flags.
1 BAM files are combined

Now, we're ready to execute the primary RIPSeeker function, appropriately called ripSeek().  Since it takes so many options, we're going to declare a bunch of variables before executing the command, and then have the command refer back to those already-declared variables.  This is what it would look like:

Code Block
> binSize=NULL
> cNAME="input"  #Which files are input
> genomeBuild = "hg19"   #Genome to use
> outDir <- file.path(getwd())  #Output file prefix (here, the current directory)
> biomart="ensembl"   #Source of genome annotations
> biomaRt_dataset="hsapiens_gene_ensembl"   #Specific dataset from biomaRt source
> goAnno="org.Hs.eg.db"   #Annotation database
> reverseComplement = FALSE   #Do not reverse complement BAM entries
> uniqueHit = FALSE   #If TRUE, use only unique hits to train initial model
> assignMultihits = TRUE  #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = TRUE   #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
 
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", uniqueHit = TRUE, assignMultihits = TRUE, rerunWithDisambiguatedMultihits = FALSE, binSize=200binSize, biomart=biomart, biomaRt_dataset = biomaRt_dataset, goAnno = goAnno, outDir=outDir)

If we execute the ripSeek() command, we get an output that looks something like this:

...

The way we're going to run this at TACC is by having the actual R commands in a script called ripseeker_script.R, which is then called by a commands file using Rscript, which is a utility to run R scripts from the command line (or, in this case, a batch job) when it can't be interactive.

Code Block
$BI/bin/launcher_creator.py -n ripseeker -t 01:00:00 -j ripseeker.cmds -q development -a CCBB

qsub ripseeker.sge

The output should deposit several files in your current directory, which includes the following:

Code Block
$

 

 

 

 

Alternatives