Mapping with Tophat Exercises 2014

Get your data

Six raw data files were provided as the starting point:

  • c1_r1, c1_r2, c1_r3 from the first biological condition
  • c2_r1, c2_r2, and c2_r3 from the second biological condition

Get set up for the exercises
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_results . & #pregenerated tophat results
#Lets get pregenerated bwa results as well for doing comparisons
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/bwa_mem_results . &
 
cd tophat_exercise
 

Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).

Run tophat

On lonestar, to run tophat, following modules need to be loaded.

Load modules
module load boost/1.45.0
module load bowtie
module load tophat

 

Create reference index - this has already been done for you.

             bowtie2-build reference/genome.fa reference/genome.fa

Notice the index files in the reference/ directory.

 

Tophat command syntax

 

     tophat [options] <bowtie_index_prefix> <reads1> <reads2>


Create a tophat.commands file with all your tophat commands.

Tophat commands
nano commands.tophat
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R2_thout reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R3_thout reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R1_thout reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R2_thout reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R3_thout reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq

Create a launcher file called tophat_launcher.sge to submit these tophat commands.

Create a launcher file called tophat_launcher.sge
module load python
launcher_creator.py -j commands.tophat -n tophat -q normal -t 08:00:00 -a CCBB -l tophat_launcher.sge

Submit your tophat job

Create a launcher file called tophat_launcher.sge
qsub tophat_launcher.sge

Examine tophat parameters

Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:

tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq


Exercise 1a: Why did we use the -G flag?

 Hints

Remember that you can type the command and hit enter to get help info about it.

 Solution

The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file


Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?

 Solution

Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file
Also add the --no-novel-juncs option to suppress de novo junction detection

Exercise 1c:  Why did we use the -p flag? And why did we choose -p 2 specifically?

As you can see there are many many other options for running tophat!

Examine tophat results

 

We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.

When following along here, please switch to your idev session for running these example commands.

Get tophat results
cd ../tophat_results
Take a minute to look at the output files produced by one tophat run.
cd C1_R1_thout
ls -l

-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020  563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020   52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020   54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020  32K May 16 23:35 logs
-rw------- 1 daras G-801020  184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020  442 May 16 23:35 unmapped.bam

 

Exercise 1: Examine a BAM file

Examine a few lines of the C1_R1 alignment file.

Lets try this first
head accepted_hits.bam
Let's try this next
samtools view -x accepted_hits.bam | head  

#make sure you are in tophat_results/C1_R1_thout to run this

Exercise 2a: Spliced sequences - Find a spliced alignment.

How is a spliced sequence represented in the BAM file?

Which field in our file contains the cigar score?

Finding Cigar score field
samtools view -x accepted_hits.bam | head  

Let's parse the bam file to pull out just spliced alignments.

Looking for spliced alignments
 samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head

The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:

  • 66M - the first 66 bases match the reference
  • 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
  • 9M - the last 17 bases match the reference

Exercise 2b: What is the read ID for one of these reads with spliced alignment? 

Pick your favorite and remember it. I have picked read 9 with the cigar score 66M76N9M.

Exercise 3: Count spliced sequences

How many spliced sequences are there in the C1_R1 alignment file?

 Hint

samtools view and cut and grep

 How to
samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

Exercise 4: Samtools flagstat

Lets run samtools flagstat on our tophat bam file:

samtools flagstat accepted_hits.bam

Let's see how this compares to BWA results...

 

Back to the directory with BWA results
cd $SCRATCH/my_rnaseq_course/bwa_mem_results

Exercise 5: Count spliced sequences in BWA results

How many spliced sequences are there in one of the alignment files,  C1_R1.mem.bam?

 Hint

samtools view and cut and grep

 How to
samtools view C1_R1.mem.bam| cut -f 6 | grep 'N' | wc -l

Exercise 6: How does a read with tophat spliced alignment look in the BWA results?

Let's how read 9 looks in BWA results. Here's how it looks in tophat results (picking out only the first 9 columns in the sam file):

9    99    2L    7954    50    75M    =    8051

9    147    2L    8051    50    66M76N9M    =    7954

 

 How to
samtools view C1_R1.mem.bam| awk '{if ($1 == 9) print}'

Help! I have a lots of reads and a large number of reads. Make tophat go faster!

  • Use threading option efficiently (tophat -p <number of threads>)

  • Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
    • WAIT! We have a pipeline for that!
    • Look for fastTophat in $BI/scripts  
  • Split mapping by chromosome- mapping to each chromosome=1 job.

BACK TO COURSE OUTLINE