...
Why splice aware/split alignment is important?
Get your data...
Six raw data files were provided as the starting point:
- c1_r1, c1_r2, c1_r3 from the first biological condition
- c2_r1, c2_r2, and c2_r3 from the second biological condition
- Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).
Code Block | ||
---|---|---|
| ||
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cd tophat_exercise |
Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).
Run tophat
On lonestar, to run tophat, following modules need to be loaded.
Code Block |
---|
module load boost/1.45.0
module load bowtie
module load tophat
|
Code Block | ||
---|---|---|
| ||
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
|
Look at run_commands/tophat.commands to see how it was run.
...
cat run_commands/tophat.commands
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
Code Block | ||
---|---|---|
| ||
cd results/C1_R1_thout
ls -l
-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020 563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020 52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020 54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020 32K May 16 23:35 logs
-rw------- 1 daras G-801020 184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020 442 May 16 23:35 unmapped.bam
|
Exercise 1a: Providing a transcript annotation file
Which tophat option is used to provide a transcript annotation file (GTF file) to use?
...
Remember that you can type the command and hit enter to get help info about it.
Or Google tophat to find its online manual
...
The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file
Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?
...
Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file
Also add the --no-novel-juncs option to suppress de novo junction detection
As you can see there are many many other options for running tophat!
Exercise 2a: Examine a BAM file
Examine a few lines of the C1_R1 alignment file.
...
samtools view
...
Code Block |
---|
samtools view -x accepted_hits.bam | less # :q to quit
#make sure you are in results/C1_R1_thout to run this |
Exercise 3b: Spliced sequences
Find a spliced alignment.
How is a spliced sequence represented in the BAM file?
...
The 6th BAM file field is the CIGAR string which tells you how your query sequence mapped to the reference.
...
samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head
The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:
- 56M - the first 58 bases match the reference
- 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
- 17M - the last 17 bases match the reference
Exercise 4: Count spliced sequences
How many spliced sequences are there in the C1_R1 alignment file?
...
samtools view and cut and grep
...
Code Block |
---|
samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l
|
Let's see how this compares to BWA results...
Code Block | ||
---|---|---|
| ||
cds
cd $SCRATCH/my_rnaseq_course/bwa_exercise/results/bwa_mem |
Exercise 4b: Count spliced sequences in BWA results
How many spliced sequences are there in the C1_R1 alignment file?
...
samtools view and cut and grep
...
Code Block |
---|
samtools view C1_R1.mem.bam| cut -f 6 | grep 'N' | wc -l
|
Exercise 5: How does a read with tophat spliced alignment look in the BWA results?
...
Now on to our tophat exercises.