Alternative Applications of RNA-seq
Table of Contents
Overview and Objectives
Most RNA-seq is probably done to study gene expression. This objective implies a set of experimental methods that specifically isolate mRNA molecules within a particular expected size range. The relevant steps can include processes like rRNA depletion, size selection, and/or fragmentation. Most of this course, accordingly, focuses on pipelines and tools that assume as input mRNA-targeted data (with all that implies), and will output lists of differentially expressed genes mapped to various conditions. However, RNA has numerous biological functions beyond acting as mRNA, and many of these functions can be studied using many of the same principles. Such studies simply require taking into account the expected differences between the data of interest and gene expression-oriented data. In this section, we will discuss two major categories of 'non-mRNA' RNA-seq data: (1) sequencing to study small RNAs such as microRNAs and (2) sequencing to study RNA-protein interactions.
Studying Small RNAs
Many types of small RNA have been characterized, and their biological functions are extremely wide-ranging. The table below describes the different forms and biological functions of some small and/or non-coding RNAs (though small RNAs are, almost by definition, non-coding).
Yao Y, Sun Q. Exploration of small non coding RNAs in wheat (Triticum aestivum L.). Plant Mol Biol. 2012;80(1):67-73.
Clearly, there are many biologically important functions executed by small RNA, and they can be studied by sequencing by simply cutting (for example) the 25-50bp range out of a size selection gel followed by otherwise normal library preparation. Otherwise, all these species share certain qualities that allow sequencing data derived from each to be analyzed in a similar fashion. These qualities can include (but are not limited to):
- Single-end sequencing: if an RNA-species is (for example) 16-25 bp, than paired-end sequencing of any kind provides little (though some) additional data compared to a single end run provided the reads are long enough
- Increased adapter contamination: as implied above, adapter sequence is almost always included in your reads requiring either pre-processing to remove such sequence or alignment adjustment to account for it
- Low-complexity libraries: there are often far fewer members of a category of small RNA in a genome than there are reads in the data, meaning the exact same sequence will occur in multiple reads
- Extensive genomic duplication: there are often many copies of the same sequence of a given small RNA in a genome, meaning most genomic alignments will contain numerous "multi-mappers"
All of these issues can be taken into account effectively, and in some regards can produce results simpler to understand and evaluate relative to standard gene expression data. Our first exercise will focus on one class of small RNAs, microRNAs, and will use principles that generalize to other interesting small RNAs.
Studying RNA-Protein Interactions
Similarly, RNA-protein interactions are required for an equally diverse set of biological functions, and hundreds of RNA-binding proteins have been identified. It is frequently interesting to isolate protein-RNA complexes, remove the protein, and sequence the resulting RNA. The methods involve combine components of RNA-seq, because the underlying molecule is RNA, and chromatin immunoprecipitation (ChIP), because the most common mechanism to isolate a protein-RNA complex is with an antibody raised against a fragment of the protein of interest. Below is a sample protocol flow for a RIP-seq experiment.
Zhao J, Ohsumi TK, Kung JT, et al. Genome-wide identification of polycomb-associated RNAs by RIP-seq. Mol Cell. 2010;40(6):939-53.
For "normal" RIP-seq, one usually expects to recover full RNA molecules regardless of where on an RNA molecule the protein was bound, since all of it is 'pulled down' together. However, such protocols generally do not use any chemical or physical means to covalently attach the RNA to the protein, which allows for the possibility that the RNA and protein complexes disassociate and re-associate from each other during sample preparation (there have been published papers that claim this - see here). Moreover, proteins will often bind to specific RNA sequence motifs or positions, and retrieval of the full RNA molecule provides no information about the specific binding site. To accommodate these concerns, methods have been developed to cross-link protein to RNA in a way that leaves a signature of interaction where the protein and RNA actually come into contact. Below is a table of the three methods that modify the RNA in various ways to enable binding site detection by sequencing.
König J, Zarnack K, Luscombe NM, Ule J. Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2011;13(2):77-83.
In our second exercise, we will use a recently developed tool to analyze some sample PARCLIP data to identify specific binding sites of a protein across the entire human transcriptome.
Important Software
For standard RIP-seq, many of the methods already covered in this class are useful since one can expect to recover a full RNA molecule, and IP and Mock/No Antibody/IgG samples can be thought of as "conditions" to be compared by differential expression analysis. If the only comparison is between IP and Input, than the tools you have already learned about can be used to quantitate expression for each transcript, and fold changes can be subsequently calculated. However, more specific tools do exist, particularly for CLIP-seq and its variants. Below is a table from a semi-recent paper that summarizes some of the most widely used tools in RIP and CLIP experiments.
Li Y, Zhao DY, Greenblatt JF, Zhang Z. RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic Acids Res. 2013;41(8):e94.
Some of these tools, like Cuffdiff (and similar tools like edgeR and DESeq) can be used as you would for normal differential expression analysis in standard RIP-seq experiments. These programs tend to be available directly at TACC. Others, like MACS, are available at TACC but are not really designed for use with RNA-seq data. Finally, programs like RIPseeker and PARalyzer are much less widely-used (since they are much more recent), but are designed for extremely specific experimental structures. PARalyzer, for example, is explicitly and only to be used with PAR-CLIP data.
There are many options beyond those listed in the above table. This is not an exhaustive list, but here are a few other options. Many of these can be used directly online, and do not even require TACC or command-line interaction:
- ASPeak - a really interesting, but EXTREMELY new (~ 7 months) R package; incorporates RNA-seq, mock IP, and real IP data
- PIPE-CLIP - a Galaxy-based tool for CLIP-seq analysis; you upload your files to their served, and get enriched regions (peaks) in return
- CLIPZ - a GUI for end-to-end pipeline analysis of CLIP-seq data; it uses BWA and an internal algorithm to identify binding sites and perform motif analysis
- dCLIP and PARCLIP-HMM - MATLAB based packages for analyzing any "footprinted" RIP-seq data; they have a lot of overlap with other tools
- PARma - a tool designed 100% for microRNA-Ago2 PAR-CLIP experiment seed identification
As you can see, the field is pretty active, with tools ranging from peak identification within certain constraints up to formal statistical testing and integration of other sources of information such as native RNA abundance and miRNA seed sequences.
In the exercises that follow, we will use samtools to generate miRNA profiles (Exercise #1) and implement PARalyzer to analyze a real (but down-sampled) PARCLIP data (Exercise #2).
Exercise #1: miRNA Sequencing and Profiling (miRNA-seq)
In this exercise, we will analyze a sample microRNA-seq dataset derived from H1 human embryonic stem cells that was generated for the ENCODE project and made publicly available a few years ago. These are 1x36 Illumina reads derived from all cellular RNA that is less than 200bp. Our end goal will be to obtain a microRNA profile, or counts of how many reads are derived from each microRNA.
Reference Building
Recall that, because these RNAs are very short, they may align multiple times throughout the genome. Moreover, our goal (as is frequently the case) is to quantitate all known small RNAs of a given class, rather than discover new members. Thus, it makes sense to align our sequences against a database of miRNA sequences (or snRNA, or tRNA, or...) where identical sequences are collapsed. As we will see, this facilitates down stream analysis and is also significantly faster since the genomic search space is dramatically reduced.
To obtain a FASTA file with all human miRNA sequences, execute these commands:
mkdir -p $SCRATCH/my_rnaseq_course/day_4b/mirbase cd $SCRATCH/my_rnaseq_course/day_4b/mirbase cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/hairpin_cDNA_hsa.fa . less hairpin_cDNA_hsa.fa
If you run commands like wc -l or less on the reference FASTA, you will see how reduced the sequence space is. Additionally, each "contig" is a feature, meaning that rather than count reads in a genome (using something like HTSeq), we could simply count reads per contig, which is substantially faster. You can do this with ANY reference FASTA data, since most aligners can accept any FASTA as a reference index. To build the index:
module load perl module load bowtie/2.2.0 cd $SCRATCH/my_rnaseq_course/day_4b/mirbase bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa
That was fast! For comparison, doing the same command with the hg19 reference FASTA would take several hours (and should never be done on a login node). If you ls, you will see:
hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa.1.bt2 hairpin_cDNA_hsa.fa.2.bt2 hairpin_cDNA_hsa.fa.3.bt2 hairpin_cDNA_hsa.fa.4.bt2 hairpin_cDNA_hsa.fa.rev.1.bt2 hairpin_cDNA_hsa.fa.rev.2.bt2
These files, together, constitute the bowtie2 reference index.
Data Staging
We will first copy them over from the BioITeam area on Corral, stage them in a directory in your scratch area, and look at them a little bit. The commands to do that would look something like this:
cd $SCRATCH/my_rnaseq_course/day_4b cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz . less human_mirnaseq.fastq.gz
A sample miRNA FASTQ entry, using less, might look like this:
@TUPAC_0037_FC62EE7AAXX:2:1:2000:1139#0/1 TAGCAGCACGTCAGTATTGNCGTAAAAAAAAAAAAG +TUPAC_0037_FC62EE7AAXX:2:1:2000:1139#0/1 ffafffffff\U_La[[W[B^a^abfffcccccccc
The third line has the name attached after the "+", which is an artifact of a storage method that we won't go into here. However, everything else is basically the same - read name, followed by sequence, strand, and quality scores. However, note the string of A's towards the end. This is because, as for many very short RNAs, our read extends past the actual RNA fragment. In this case, the 'adapter' sequence is obvious - it is just a poly-A string. However, what if it wasn't? Indeed, working with publicly available small RNA data, you will often not know what the adapter is (it may not be obvious), or you might not even know if for data coming from your lab (if we're being honest).
Bowtie2 Local Alignment
So, we would like to use an alignment strategy that can intelligently ignore the parts that won't align to a reference (the 'adapter') and align correctly the parts that align well. This is called a 'local' alignment, in contrast to a 'global' alignment, which would count the 'mismatches' in the adapter against the alignment score. Fortunately, you have already used a local-alignment-capable aligner in this class. Tophat2 runs on the Bowtie2 alignment engine, which (if used directly, i.e. not with Tophat2), can perform local alignment.
To run the alignment, we execute a command that is very similar to BWA or Tophat2, but with different syntax:
cd $SCRATCH/my_rnaseq_course/day_4b bowtie2 --local -N 1 -L 16 -x mirbase/hairpin_cDNA_hsa.fa -U human_mirnaseq.fastq.gz -S human_mirnaseq.sam
You can create and submit a cmds file like you have already learned how to do - this alignment would only take around 30 minutes or less. However, if TACC is being very slow (which is likely), than you can get what the results will look like by copying the result to your area with the following commands
cd $SCRATCH/my_rnaseq_course/day_4b cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.sam .
miRNA Profiling with SAMTools
Recall that, because each contig in our reference is a feature, we do not need a true annotation to quantify our miRNAs. Instead, we just need to count how many reads aligned to each contig, and sort them accordingly. Fortunately, samtools has a simple utility called idxstats that reports exactly this. The following commands will produce this information by converting the SAM file to BAM, sorting and indexing it, then running idxstats:
module load samtools/0.1.19 cd $SCRATCH/my_rnaseq_course/day_4b samtools view -bS human_mirnaseq.sam > human_mirnaseq.bam samtools sort human_mirnaseq.bam human_mirnaseq.sorted samtools index human_mirnaseq.sorted.bam samtools idxstats human_mirnaseq.sorted.bam > idxstats.txt
This will create a couple new files, but the one we care about is idxstats.txt, which contains our miRNA profile. If you wanted to look at the top ten most abundant miRNAs, then simply Unix commands will do the trick:
sort -r -n -k 3 idxstats.txt | head
This should report something like this:
hsa-mir-302b 73 147605 0 hsa-mir-302a 69 17719 0 hsa-mir-21 72 14999 0 hsa-mir-302c 68 9590 0 hsa-mir-5588 63 3147 0 hsa-mir-18a 71 2884 0 hsa-mir-20a 71 2246 0 hsa-mir-16-2 81 1972 0 hsa-mir-335 94 1499 0 hsa-mir-15b 98 1466 0
The second column is contig length and the fourth is "unaligned reads," which is obviously zero. So, the third column gives read count and the first gives contig name.
Exercise #2: Photo-Activatable Ribonucleoside Cross-Linking and Immuno-Precipitation (PARCLIP-Seq)
Photo-activatable ribonucleoside cross-linking and immunoprecipitation followed by sequencing (PARCLIP-seq) is one of several emerging technologies for studying protein-RNA interactions with dramatically increased nucleotide resolution compared to traditional RIP-seq methods. Earlier, we discussed the differences between RIP, CLIP, and PARCLIP. In this exercise, we will take some public PARCLIP data and analyze it using SAMtools and PARalyzer, which is a tool developed to identify the specific mutational signature indicating real protein-RNA interaction. We are using heavily reduced/downsampled data from a paper by Kishore et al (found here) that targeted human Argonaute2 (Ago2).
Data Staging and PARalyzer Usage
PARalyzer can be downloaded from (and its documentation found) here. It was created by one of the first groups to every successfully publish a PARCLIP dataset. Since it is not in the TACC module system, you can copy it and all necessary associated files to your scratch area by executing the following commands:
cd $SCRATCH/my_rnaseq_course/day_4b cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/PARalyzer_v1_5 . cd PARalyzer_v1_5 ls -la
As you will see, this directory contains many files, including the PARalyzer executable itself. It also contains the SAM file that we will analyze. Go ahead and take a look at the SAM file using less. In particular, below is a sample read that contains the characteristic mutational signal that (in principle) indicates close proximity to a protein during crosslinking:
SRR189785.13517867 0 chr1 565558 255 24M * 0 0 CATACTCCTCAACTACCCACATAG ?=7>AC>B>@CA<BB<BB@>94AA XA:i:1 MD:Z:12T11 NM:i:1
The above read demonstrates the characteristic T>C mutation in the center of the read that (taken together with several such events), implies real cross-linking. In contrast to pure read counting, identification of clusters with many similar mutations is the objective of our analysis.
The other two files in the PARalyzer directory that are worth mentioning are "hg19.2bit", which is a special binary form of the hg19 human genome build, and "sample.ini," which contains all specifications for PARalyzer. In fact, to run PARalyzer, we provide it only with a memory allotment and the name of the sample.ini file. Go ahead and see what is in sample.ini using less or cat:
less sample.ini #BANDWIDTH=3 #CONVERSION=T>C #MINIMUM_READ_COUNT_PER_GROUP=10 #MINIMUM_READ_COUNT_PER_CLUSTER=1 #MINIMUM_READ_COUNT_FOR_KDE=1 #MINIMUM_CLUSTER_SIZE=1 #MINIMUM_CONVERSION_LOCATIONS_FOR_CLUSTER=2 #MINIMUM_CONVERSION_COUNT_FOR_CLUSTER=1 #MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION=1 #MINIMUM_READ_LENGTH=1 #MAXIMUM_NUMBER_OF_NON_CONVERSION_MISMATCHES=5 #SAM_FILE=./Kishore_Ago2_parclip.sam #GENOME_2BIT_FILE=./hg19.2bit #OUTPUT_CLUSTERS_FILE=./Kishore_Ago2_clusters.csv #EXTEND_BY_READ
All of these are parameters used by PARalyzer, and we don't have time to get into each one. If you head to the PARalyzer website, they have a README file that contains full descriptions. The option we care about, for our purposes, is "SAM_FILE" which includes the path to our input SAM file, and "OUTPUT_CLUSTERS_FILE" which specifies the comma-separated output file.
Running PARalyzer
To bring up the PARalyzer help page (which is very minimal), execute these commands:
cd $SCRATCH/PARalyzer_v1_5 PARalyzer 4g
Since the bulk of the tool is written in Java, the '4g' specifies how much memory we allocate to the Java virtual machine. To actually perform the analysis, we would run the command (either at the command line or through a batch submission file):
cd $SCRATCH/PARalyzer_v1_5 PARalyzer 4g sample.ini
Running this command would produce the output file Kishore_Ago2_clusters.csv, but would take quite a bit time since the tool does not scale particularly well to increasing read depths. Consequently, I have already prepared the output file, and named it Kishore_Ago2_clusters_done.csv. Feel free to try to generate the output at some point, if you like, but in the interests of time, we will proceed to parsing the PARalyzer output to identify interesting interactions.
Results Parsing
Use head to look at the Kishore_Ago2_clusters_done.csv file using the following commands:
cd $SCRATCH/PARalyzer_v1_5 head Kishore_Ago2_clusters_done.csv #Chromosome,Strand,ClusterStart,ClusterEnd,ClusterID,ClusterSequence,ReadCount,ModeLocation,ModeScore,ConversionLocationCount,ConversionEventCount,NonConversionEventCount #chr1,+,565558,565581,G1.1,CATACTCCTCAATTACCCACATAG,19,565571,0.6521311984865596,2,4,110 #chr1,+,565635,565662,G2.1,ACTATTTATATTATCCTAACTACTACCG,36,565662,0.6357747408016309,2,2,352 #chr1,+,565663,565686,G2.2,CATTCCTACTACTCAACTTAAACT,85,565664,0.6581599027923404,7,10,630 #chr1,+,566884,566910,G3.1,ACCTAACCATCTTCTCCTTACACCTAG,22,566910,0.6731219485026998,3,4,172 #chr1,+,567716,567744,G4.1,TAAATCTAACTTTCTTCCCACAACACTTT,88,567726,0.6173693006052031,6,8,941 #chr1,+,16266654,16266679,G8.1,GTTTTTTTTAATCTGTGCCAAAAATG,41,16266668,0.8083176199348235,4,39,425 #chr1,+,32637046,32637078,G12.1,TTATTTAATTTTTTTTTCTTTTGCACATTTTTC,14,32637065,0.728509998913673,4,11,279 #chr1,+,36645990,36646010,G15.1,TTTTTAATCTGCACCTTATAG,18,36646000,0.8427833343143215,2,19,159 #chr1,+,41220043,41220064,G18.1,TGTAAACATCCTTGACTGGAAG,24,41220050,0.7405424084870008,3,12,131
The first four fields provide the genomic coordinates of each read cluster, while the other fields each provide various pieces of information about the cluster. The most important field is likely "ModeScore," which represents the "strength" of the signal from that cluster. Also, the "sequence" is fairly important as well, since PARalyzer has a clear bias towards read groups with poly-T chains (which is unsurprising, given the signal we are looking for). These can be accommodated with pre-processing or trimming for poly-A or poly-T sequences.
Now, we want to view the highest confidence clusters, and see which transcripts, if any, they interact with. To do so, we can pipe together a set of commands that will give us what we want. For example:
cd $SCRATCH/PARalyzer_v1_5 sort -r -t , -k 9 -n Kishore_Ago2_clusters_done.csv | head #chr17_gl000205_random,-,86334,86355,G751.1,TGCGCAGAACCTTCCCATCCTT,34,86355,0.9996957487817257,2,6,157 #chr14,+,99289965,99289986,G537.1,AAAAGCCGGGTTGAGAGGGTGA,15,99289986,0.9960425240641467,2,16,27 #chr17,+,76220266,76220302,G703.1,AATAGCACAAACTACAATTAAAACTAAGCACAAAGCC,33,76220266,0.9954430928200928,4,32,96 #chr3,-,49397535,49397566,G1141.1,CCTTTTTCATTTATCTATAATTTACCTAAGAT,10,49397535,0.9877804735048394,2,9,149 #chrM,+,4869,4900,G1739.1,ACAAAAACTAGCCCCCATCTCAATCATATACC,12,4869,0.9781133037398211,3,3,51 #chr19,+,33668019,33668058,G812.1,CGAATCCCAGCGGTGCCTCAACCGAGCGTCCAAGCTCTTT,11,33668030,0.9776451604002823,2,3,51 #chr7,+,10493027,10493065,G1469.1,CACAGGGGAAGGAAATAACATTGCACTTTATAAACACTG,52,10493027,0.9735726980290786,5,48,356 #chr20,+,3898183,3898212,G980.1,GGTCAAGCAGCATTGTACAGGGCTATGAAA,504,3898183,0.9717404318250604,6,290,2229 #chr4,-,109818065,109818085,G1236.1,TGTGGAATAGTTTAAAACTGT,32,109818085,0.9668473986530141,2,39,196 #chr12,+,31549462,31549500,G380.1,CACAGGGGAAGGAAATAACATTGCACTTTATAAACACTG,71,31549462,0.965539145175358,6,68,478
Above, we see the top ten clusters as identified by PARalyzer. This is not the only useful ranking metric, since one might want to rank clusters by read count, or length, or whatever - but it gives some useful starting information. Let's look at the mutational signature of the reads that are within a top cluster in more detail using SAMtools with the following commands:
module load samtools/0.1.19 cd $SCRATCH/PARalyzer_v1_5 samtools view Kishore_Ago2_parclip.bam chr17:76220260-76220302 | cut -f 13 | less
Clearly, there are MANY cases of T>C mutations that (in reality - you can't see this from the above commands alone) are all at the same nucleotide. If one were to view the reads in IGV or another browser, one would expect to see something like this:
Figure from Hafner M, Lianoglou S, Tuschl T, Betel D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods. 2012;58(2):94-105.
Data from Lipchina I, Elkabetz Y, Hafner M, et al. Genome-wide identification of microRNA targets in human ES cells reveals a role for miR-302 in modulating BMP response. Genes Dev. 2011;25(20):2173-86.
In the figure, red represents reference bases and yellow represents T>C mutation events. miRNA binding sites are shown because this PAR-CLIP experiment targeted Ago2, so the regions retrieved were primarily miRNA binding sites.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.