Objectives
In this lab, you will explore a popular transcriptome-aware mapper called Tophat. Simulated RNA-seq data will be provided to you; the data contains paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). The objectives of this lab is to:
- Learn how Tophat2 works and how to use it.
- Learn how it is different from using a mapper like BWA.
12 raw data files have been provided for all our further RNA-seq analysis:
- c1_r1, c1_r2, c1_r3 from the first biological condition
- c2_r1, c2_r2, and c2_r3 from the second biological condition
Introduction
Tophat is part of the tuxedo suite of RNA-Seq tools. Tophat does a transcriptome-aware alignment of the input sequences to a reference genome using either the Bowtie or Bowtie2 aligner (in theory it can use other aligners, but we do not recommend this).
How Tophat Works
Image from: http://genomebiology.com/2013/14/4/R36
- The input sequences are aligned to the transcriptome for your reference genome, if you provided a GTF/GFF file.
- sequences that align to the transcriptome are retained, and their coordinates are translated to genomic coordinates
- sequences that do not align to the transcriptome are subjected to further analysis below
- Remaining sequences are broken into sub-fragments of at least 25 bases, and these sub-fragments are aligned to the reference genome.
- if two adjacent sub-fragments align to non-adjacent genomic locations, they are "trans frags" that will be used to infer splice junctions
At the end of the Tophat process, you have a BAM file describing the alignment of the input data to genomic coordinates. This file can be used as input for downstream applications like Cuffmerge-Cufflinks-Cuffdiff, which will be described in further sections. You will also have files describing the junctions found.
More documentation on tophat2 can be found here: http://tophat.cbcb.umd.edu/manual.shtml
Why splice aware/split alignment is important?
Get your data...
Six raw data files were provided as the starting point:
- c1_r1, c1_r2, c1_r3 from the first biological condition
- c2_r1, c2_r2, and c2_r3 from the second biological condition
- Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).
cds cd my_rnaseq_course cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . & cd tophat_exercise
Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).
Run tophat
On lonestar, to run tophat, following modules need to be loaded.
module load boost/1.45.0 module load bowtie module load tophat
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
Look at run_commands/tophat.commands to see how it was run.
cd results/C1_R1_thout ls -l -rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam -rw------- 1 daras G-801020 563 May 16 23:35 align_summary.txt -rw------- 1 daras G-801020 52 May 16 23:35 deletions.bed -rw------- 1 daras G-801020 54 May 16 23:35 insertions.bed -rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed drwx------ 2 daras G-801020 32K May 16 23:35 logs -rw------- 1 daras G-801020 184 May 16 23:35 prep_reads.info -rw------- 1 daras G-801020 442 May 16 23:35 unmapped.bam
Exercise 1a: Providing a transcript annotation file
Which tophat option is used to provide a transcript annotation file (GTF file) to use?
Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?
As you can see there are many many other options for running tophat!
Exercise 2a: Examine a BAM file
Examine a few lines of the C1_R1 alignment file.
Exercise 3b: Spliced sequences
Find a spliced alignment.
How is a spliced sequence represented in the BAM file?
Exercise 4: Count spliced sequences
How many spliced sequences are there in the C1_R1 alignment file?
Let's see how this compares to BWA results...
cds cd $SCRATCH/my_rnaseq_course/bwa_exercise/results/bwa_mem
Exercise 4b: Count spliced sequences in BWA results
How many spliced sequences are there in the C1_R1 alignment file?
Exercise 5: How does a read with tophat spliced alignment look in the BWA results?