Theoretical and Experimental Background

RNA-Protein Immunoprecipitation Methods

Ribonucleoprotein (RNP) immunoprecipitation (IP), broadly, is an experimental protocol that isolates a protein and any associated RNA molecules from a mixed cellular lysate using an antibody with affinity for some portion of the target protein. Methods targeting RNA sequences in order to profile associated proteins by mass spectrometry exist, but are not nearly as common and not the subject of this class. Beyond these details, there exists wide variation between protocols used to reproducibly isolate RNP complexes and extract its RNA component for next-generation sequencing. While any number of protocol variants can give different experimental results, there are certain protocol parameters that are so significantly important to the end result that they can be thought of as protocol categories unto themselves. The following is a schematic of an example protocol representative of the most fundamental protocol, which we will call RIP-Seq:

Zhao J, Ohsumi TK, Kung JT, et al. Genome-wide identification of polycomb-associated RNAs by RIP-seq. Mol Cell. 2010;40(6):939-53.

Within this method, significant variation arises in several protocol steps, most commonly any fragmentation steps between RNA extraction and adaptor ligation, the size selected during gel purification, and the details of the Illumina sequencing protocol (single vs paired end, read length, etc). However, the most important detail is that, in the IP experiment, full-length RNA molecules are, in theory, recovered and no exogenous method is applied to covalently link RNA to protein - the "pull-down" uses only the strength of the endogenous interactions, and can be susceptible to associations that occur in vitro rather than in vivo.

A second category of protocols is distinguished by the use of a form of cross-linking, and can be broadly called CLIP-Seq (standing for Cross-Linking and ImmunoPrecipitation). Formaldehyde, a popular cross-linking reagent in other applications, is not used since the conditions necessary to reverse those cross-links (heating at 65°C for hours or overnight) rapidly degrades RNA. Instead, UV light is used, at varying frequencies. When UV light alone is used to cross-link, the protocol is often called HITS-CLIP (standing for HIgh Throughput Sequencing of CrossLinked and ImmunoPrecipitated RNA). In another protocol variant, a modified base (typically 4-thiouridine) is added to cells during growth, and is incorporated into RNA synthesis. This base, when exposed to UV light, forms cross-links with nearby protein molecules and mutates the RNA sequence at that position in a predictable manner. This is called PAR-CLIP (standing for PhotoActivatable Ribonucleoside CLIP). A third protocol variant, called iCLIP, involves circularization of the RNA molecule during library preparation, and can occur downstream of either HITS-CLIP or PAR-CLIP. The following diagram gives a sense of the relevant differences from a great review article:

König J, Zarnack K, Luscombe NM, Ule J. Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2011;13(2):77-83.

Each of these protocols can be executed in numerous ways, and the details of implementation (such as the specific enzyme used during RNase digestion in a PAR-CLIP experiment) can affect results in different ways. However, all of these experiments share the structural properties of A) producing a FASTQ file which is then aligned to the genome/transcriptome to produce a BAM file, and 2) producing these files for an IP sample and an Input sample, Igg sample, or both as controls.

Important Protocol-Data Structure Relationships

Variable	RIP-Seq	HITS-CLIP	PAR-CLIP
Cross-linking	None	With 254 nm radiation; Introduces somewhat less predictable mutations	After incubation with 4-thiouridine; With 365 nm radiation; Introduces predominantly T to C transitions in sequencing data
Lysis	Very gentle to avoid disrupting RNP complexes	Often uses RNases, sonication, or both to disrupt large masses of cross-linked RNPs	Similar to HITS-CLIP, with different optimization of biochemical details (e.g. enzyme choice and concentration)
RNA Selection	Can use poly-A selection and/or ribo-depletion; Size selection only after fragmentation since targeted RNAs may be of variable length; No RNP electrophoresis	No poly-A selection; Ribo-depletion possible; Population selection by excision from gel relative to native protein; Involves radioactive RNP electrophoresis	Similar to HITS-CLIP, with different optimization of biochemical details
Sequencing	Long, paired-end reads often used; Data spans the whole length of captured RNA molecules with some bias towards the ends	Short, single-end reads often used; Data piles up on the cross-linked sites with many mutations	Short, single-end reads often used, though paired-end reads can be employed on short fragments to provide extra confidence in T to C transitions

High-level Data Views

As you have already learned, after getting your raw sequencing data, most pipelines will evaluate the quality of the raw data, followed by filtering/pre-processing and alignment. In the case of RIP-Seq experiments, these steps with tools like the Tuxedo pipeline (for example) can often use knowledge that depends on the protocol steps described above, such as the distribution of fragment sizes or the tolerance of the aligner for mutations (e.g. mismatches). What you always end up with, though, are several SAM/BAM files, preferably one for each Input, Igg, and IP. Let's take a look at some high-level differences between some datasets that suggest obvious protocol differences.

First, below is a snapshot of what a RIP-Seq "peak" region might look like. This data was from a well-known PRC2 RIP-Seq dataset included in the same paper that the RIP-Seq schematic diagram was taken from:

The reads occur mostly in exons (though you can't see that in this view) and are distributed over the length of both Tsix and Xist, both validated binding partners of PRC2. Though the identification of a binding site is not really possible from this data, the presence of reads distributed along relatively long stretches of the transcript increases confidence that the observed enrichment is not a technical artifact of alignment (or something similar).

In a CLIP or PAR-CLIP experiment, this "peak" would look quite different, something like this:

Figure from Hafner M, Lianoglou S, Tuschl T, Betel D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods. 2012;58(2):94-105.
Data from Lipchina I, Elkabetz Y, Hafner M, et al. Genome-wide identification of microRNA targets in human ES cells reveals a role for miR-302 in modulating BMP response. Genes Dev. 2011;25(20):2173-86.

This is a peak from an Ago2 PAR-CLIP experiment in human embryonic stem cells. The bars indicate the read count at each position, where red indicates a matching base and yellow indicates a mismatching base. As the scale bar indicates, libraries with extensive digestion ("footprinting"), traceable mutations, and known motifs allow the discovery of extremely narrow target sites at base pair resolution. Those additional criteria also help filter out false positives, which short RNA libraries are susceptible to since filtering for PCR duplicates is complicated by the expectation of low-complexity read distributions.