Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This data is derived from an a human (HeLa) Ago2 CLIP-seq experiment (Ameyar-zazoua M et al., Nat Struct Mol Biol. 2012;19(10):998-1004), and is located atsince you already copied it over at the start of the last exercise, it should be at the path:

Code Block
$BI$SCRATCH/rnaseq_course/ripseq_exercises/CLIP

Stage it in your Scratch area in When you look in the "CLIP" subdirectory, you should see these files:

Code Block
input.chr19.bam
ip.chr19.bam
results
ripseeker.cmds
ripseeker_script.R

Results is a directory containing pre-made, well, results.  The two BAM files are our test datasets, and they are a sub-sample of reads from only chr19.  The ripseeker.cmds and ripseeker_script.R files are both commands files, which we will examine before running.

RIPSeeker

RIPSeeker is a Bioconductor R package, and we are going to run the job using the "Rscript" utility using the script and commands file that exist in the directory you just staged.  Check out the "ripseeker_script.R" file .by using cat or less:

Code Block
source("http://bioconductor.org/biocLite.R")
biocLite("RIPSeeker")
library(RIPSeeker)
bamFiles = list.files(".", "\\.bam$", recursive=TRUE, full.names=TRUE)
ripGal=combineAlignGals(bamFiles[2], genomeBuild="hg19")
ctlGal=combineAlignGals(bamFiles[1], genomeBuild="hg19")

cNAME="input"
binSize=200
genomeBuild = "hg19"
outDir <- file.path(getwd())
reverseComplement = FALSE
uniqueHit = FALSE
assignMultihits = FALSE
rerunWithDisambiguatedMultihits = FALSE

seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

...

Now, we're ready to execute the primary RIPSeeker function, appropriately called ripSeek().  Since it takes so many options, we're going to declare a bunch of variables before executing the command, and then have the command refer back to those already-declared variables.  This is what it would look like:

Code Block
> binSize=NULL200  #Set the bin size, or have it be found automatically (between 200 and 1200 bp)
> cNAME="input"  #Which files are input
> genomeBuild = "hg19"   #Genome to use
> outDir <- file.path(getwd())  #Output file prefix (here, the current directory)
> reverseComplement = FALSE   #Do not reverse complement BAM entries
> uniqueHit = FALSE   #If TRUE, use only unique hits to train initial model
> assignMultihits = FALSE  #If TRUE and uniqueHit is TRUE, assign multi-hits to regions based on initial model
> rerunWithDisambiguatedMultihits = FALSE   #If TRUE and prior two options are TRUE, re-train model after multi-hit assignment
 
> seekOut.AGO2 <- ripSeek(bamPath = bamFiles, cNAME = cNAME, reverseComplement = FALSE, genomeBuild = "hg19", \
uniqueHit = FALSE, assignMultihits = FALSE, rerunWithDisambiguatedMultihits = FALSE, binSize=binSize, outDir=outDir)

...

The important file is RIPregions.gff3, which is a coordinate file of the regions predicted to be enriched in the IP.  Because of the extremely slimmed-down dataset we used, there is only one peak, but this pipeline (with more reads that are actually real) would produce a file that is more interesting.  The seekOutput.Rdata file is HUGE, but it can be loaded directly back into R.  This is useful if you want to take the coordinate intervals in the GFF file (for example) and do something else with them.

So, now that you've generated all this data, go ahead and look at what the single peak is!

Expand
Code Block
cat RIPregions.gff3
chr19	rtracklayer	sequence_feature	47545201	47545400	.	*	.	count=829; fpk=4145; logOddScore=65.8422503917559; pval=0; pvalAdj=0; CTL_count=1491; CTL_fpk=7455; CTL_logOddScore=-61.5676075214777; eFDR=0

Go here to look at it in the genome browser - see anything interesting?

Alternatives

We could have added these lines to the ripseeker_script.R file:

Code Block
> biocLite("ChIPpeakAnno")
> library(ChIPpeakAnno)
 
> biomart="ensembl"   #Source of genome annotations
> biomaRt_dataset="hsapiens_gene_ensembl"   #Specific dataset from biomaRt source
> goAnno="org.Hs.eg.db"   #Annotation database

...