Mapping with Tophat Exercises 2014
Get your data
Six raw data files were provided as the starting point:
- c1_r1, c1_r2, c1_r3 from the first biological condition
- c2_r1, c2_r2, and c2_r3 from the second biological condition
cds cd my_rnaseq_course cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . & cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_results . & #pregenerated tophat results #Lets get pregenerated bwa results as well for doing comparisons cp -r /corral-repl/utexas/BioITeam/rnaseq_course/bwa_mem_results . & cd tophat_exercise
Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).
Run tophat
On lonestar, to run tophat, following modules need to be loaded.
module load boost/1.45.0 module load bowtie module load tophat
Create reference index - this has already been done for you.
bowtie2-build reference/genome.fa reference/genome.fa
Notice the index files in the reference/ directory.
Tophat command syntax
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
Create a tophat.commands file with all your tophat commands.
nano commands.tophat tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq tophat -p 2 -G reference/genes.gtf -o C1_R2_thout reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq tophat -p 2 -G reference/genes.gtf -o C1_R3_thout reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R1_thout reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R2_thout reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R3_thout reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq
Create a launcher file called tophat_launcher.sge to submit these tophat commands.
module load python launcher_creator.py -j commands.tophat -n tophat -q normal -t 08:00:00 -a CCBB -l tophat_launcher.sge
Submit your tophat job
qsub tophat_launcher.sge
Examine tophat parameters
Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
Exercise 1a: Why did we use the -G flag?
Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?
Exercise 1c: Why did we use the -p flag? And why did we choose -p 2 specifically?
As you can see there are many many other options for running tophat!
Examine tophat results
We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.
When following along here, please switch to your idev session for running these example commands.
cd ../tophat_results
cd C1_R1_thout ls -l -rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam -rw------- 1 daras G-801020 563 May 16 23:35 align_summary.txt -rw------- 1 daras G-801020 52 May 16 23:35 deletions.bed -rw------- 1 daras G-801020 54 May 16 23:35 insertions.bed -rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed drwx------ 2 daras G-801020 32K May 16 23:35 logs -rw------- 1 daras G-801020 184 May 16 23:35 prep_reads.info -rw------- 1 daras G-801020 442 May 16 23:35 unmapped.bam
Exercise 1: Examine a BAM file
Examine a few lines of the C1_R1 alignment file.
head accepted_hits.bam
samtools view -x accepted_hits.bam | head #make sure you are in tophat_results/C1_R1_thout to run this
Exercise 2a: Spliced sequences - Find a spliced alignment.
How is a spliced sequence represented in the BAM file?
Which field in our file contains the cigar score?
samtools view -x accepted_hits.bam | head
Let's parse the bam file to pull out just spliced alignments.
samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head
The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:
- 66M - the first 66 bases match the reference
- 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
- 9M - the last 17 bases match the reference
Exercise 2b: What is the read ID for one of these reads with spliced alignment?
Pick your favorite and remember it. I have picked read 9 with the cigar score 66M76N9M.
Exercise 3: Count spliced sequences
How many spliced sequences are there in the C1_R1 alignment file?
Exercise 4: Samtools flagstat
Lets run samtools flagstat on our tophat bam file:
samtools flagstat accepted_hits.bam
Let's see how this compares to BWA results...
cd $SCRATCH/my_rnaseq_course/bwa_mem_results
Exercise 5: Count spliced sequences in BWA results
How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?
Exercise 6: How does a read with tophat spliced alignment look in the BWA results?
Let's how read 9 looks in BWA results. Here's how it looks in tophat results (picking out only the first 9 columns in the sam file):
9 99 2L 7954 50 75M = 8051
9 147 2L 8051 50 66M76N9M = 7954
Help! I have a lots of reads and a large number of reads. Make tophat go faster!
Use threading option efficiently (tophat -p <number of threads>)
- Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
- WAIT! We have a pipeline for that!
- Look for fastTophat in $BI/scripts
- Split mapping by chromosome- mapping to each chromosome=1 job.
BACK TO COURSE OUTLINE
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.