Mapping with Tophat

Objectives

In this lab, you will explore a popular transcriptome-aware mapper called Tophat. Simulated RNA-seq data will be provided to you; the data contains paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). The objectives of this lab is to:

Learn how Tophat2 works and how to use it.
Learn how it is different from using a mapper like BWA.

12 raw data files have been provided for all our further RNA-seq analysis:

c1_r1, c1_r2, c1_r3 from the first biological condition
c2_r1, c2_r2, and c2_r3 from the second biological condition

Introduction

Tophat is part of the tuxedo suite of RNA-Seq tools. Tophat does a transcriptome-aware alignment of the input sequences to a reference genome using either the Bowtie or Bowtie2 aligner (in theory it can use other aligners, but we do not recommend this).

How Tophat Works

Image from: http://genomebiology.com/2013/14/4/R36

The input sequences are aligned to the transcriptome for your reference genome, if you provided a GTF/GFF file.
- sequences that align to the transcriptome are retained, and their coordinates are translated to genomic coordinates
- sequences that do not align to the transcriptome are subjected to further analysis below
Remaining sequences are broken into sub-fragments of at least 25 bases, and these sub-fragments are aligned to the reference genome.
- if two adjacent sub-fragments align to non-adjacent genomic locations, they are "trans frags" that will be used to infer splice junctions

At the end of the Tophat process, you have a BAM file describing the alignment of the input data to genomic coordinates. This file can be used as input for downstream applications like Cuffmerge-Cufflinks-Cuffdiff, which will be described in further sections. You will also have files describing the junctions found.

More documentation on tophat2 can be found here: http://tophat.cbcb.umd.edu/manual.shtml

Why splice aware/split alignment is important?

Split Read Alignment

Get your data...

Six raw data files were provided as the starting point:

c1_r1, c1_r2, c1_r3 from the first biological condition
c2_r1, c2_r2, and c2_r3 from the second biological condition
Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).

Get set up for the exercises

cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cd tophat_exercise

Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).

Run tophat

On lonestar, to run tophat, following modules need to be loaded.

module load boost/1.45.0
module load bowtie
module load tophat

General syntax for tophat command

tophat [options] <bowtie_index_prefix> <reads1> <reads2>

Look at run_commands/tophat.commands to see how it was run.

How do I look into the file?

cat run_commands/tophat.commands

tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq

Take a minute to look at the output files produced by one tophat run.

cd results/C1_R1_thout
ls -l

-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020  563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020   52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020   54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020  32K May 16 23:35 logs
-rw------- 1 daras G-801020  184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020  442 May 16 23:35 unmapped.bam

Exercise 1a: Providing a transcript annotation file

Which tophat option is used to provide a transcript annotation file (GTF file) to use?

Hints

Remember that you can type the command and hit enter to get help info about it.
Or Google tophat to find its online manual

Solution

The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file

Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?

Solution

Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file
Also add the --no-novel-juncs option to suppress de novo junction detection

As you can see there are many many other options for running tophat!

Exercise 2a: Examine a BAM file

Examine a few lines of the C1_R1 alignment file.

Hint

samtools view

How to

samtools view -x accepted_hits.bam | less  # :q to quit

#make sure you are in results/C1_R1_thout to run this

Exercise 3b: Spliced sequences
Find a spliced alignment.

How is a spliced sequence represented in the BAM file?

Hint

The 6th BAM file field is the CIGAR string which tells you how your query sequence mapped to the reference.

Answer

Error rendering macro 'code': Invalid value specified for parameter 'lang'

 samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head

The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:

56M - the first 58 bases match the reference
76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
17M - the last 17 bases match the reference

Exercise 4: Count spliced sequences

How many spliced sequences are there in the C1_R1 alignment file?

Hint

samtools view and cut and grep

How to

samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

Let's see how this compares to BWA results...

Back to BWA results

cds
cd $SCRATCH/my_rnaseq_course/bwa_exercise/results/bwa_mem

Exercise 4b: Count spliced sequences in BWA results
How many spliced sequences are there in the C1_R1 alignment file?

Hint

samtools view and cut and grep

How to

samtools view C1_R1.mem.bam| cut -f 6 | grep 'N' | wc -l

Exercise 5: How does a read with tophat spliced alignment look in the BWA results?

How to

samtools view C1_R1.mem.bam| awk '{if ($1 == 9) print $1,$6}'