...
We have installed SVdetect for you under $BI/bin
so it .
It should be in your $PATH.
Installation is a bit difficult, so you can please skip this part.!
Install SVDetect scripts
Navigate to the SVDetect project page
...
Here's an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype.
Code Block |
---|
cds
cp -r $BI/gva_course/structural_variation/data structural_variation
cd structural_variation |
...
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina, First of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in FASTA format | E. coli B strain REL606 |
Map data using BWA
You should submit the bwa aln
and bwa sampe
commands as jobs to the queue, one after the other.
...
First we need to (surprise!) map the data. We'll use bowtie2 for this tutorial, but BWA would also work with the proper options.
Warning | |||||
---|---|---|---|---|---|
| |||||
Many of the commands past this point are computationally intensive. You should run them through an
|
Code Block |
---|
bowtie2-build NC_012967.1.fasta 61FTVAAXX_2_1.fastq bwa aln -f 61FTVAAXX_2_2.sai NC_012967.1.fasta 61FTVAAXX_2_2.fastq bwa sampe -n 0 -N 100 -f 61FTVAAXX.sam bowtie2 -p 12 -X 5000 --rf -x NC_012967.1 -1.fasta 61FTVAAXX_2_1.sai 61FTVAAXX_2_2.saifastq -2 61FTVAAXX_2_12.fastq -S 61FTVAAXX_2_2.fastq .sam |
Possibly unfamiliar options:
-
...
-rf
tells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library-X 5000
tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.
If you use bowtie (version 1) to do your mapping, you won't predict any read SVs. Why?
Expand | ||||
---|---|---|---|---|
| ||||
bowtie doesn't map discordant output discordantly mapped pairs! |
Run SVDetect
The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair). You should submit this command to the TACC queue.
...
Code Block | ||
---|---|---|
| ||
<general> input_format=sam sv_type=all mates_orientation=RF read1_length=35 read2_length=35 mates_file=/full/path/to/61FTVAAXX.ab.sam cmap_file=/full/path/to/NC_012967.1.lengths num_threads=1 </general> <detection> split_mate_file=0 window_size=2000 step_length=1000 </detection> <filtering> split_link_file=0 nb_pairs_threshold=3 strand_filtering=1 </filtering> <bed> <colorcode> 255,0,0=1,4 0,255,0=5,10 0,0,255=11,100000 </colorcode> </bed> |
Note |
---|
You'll need to substitute your own paths for /full/path/to/61FTVAAXX.ab.sam and /full/path/to/NC_012967.1.lengths . |
You also need to create a tab-delimited file of chromosome lengths. Use the tab key rather than writing out <tab>!!
Code Block | ||
---|---|---|
| ||
1<tab>NC_012967<tab>4629812 |
You'll want to submit the first two commands to the TACC queue or do them in an idev
shell. They take a while.
Consult the manual for a full description of what these commands and options are doing.
Code Block | ||
---|---|---|
| ||
SVDetect linking -conf ./svdetect.conf SVDetect filtering -conf ./svdetect.conf SVDetect links2SV -conf ./svdetect.conf |
Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt
.
...
Code Block |
---|
chr_type SV_type BAL_type chromosome1 start1-end1 average_dist chromosome2 start2-end2 nb_pairs score_strand_filtering score_order_filtering score_insert_size_filtering final_score breakpoint1_start1-end1 breakpoint2_start2-end2 ... INTRA NORMAL_SENSE - chrNC_012967 599566-601002601025 - chrNC_012967 662994663036-664998664898 430 1577 100% - - 1 - - ... INTRA NORMAL_SENSE - chrNC_012967 3-2025 - chrNC_012967 4627019-4628998 989288 100% - - 1 - - ... INTRA REVERSE_SENSE - chrNC_012967 16999-1890319033 - chrNC_012967 27759792775082-27778382777014 873274 100% - - 1 - - |
Any idea what sorts of mutations produced these three structural variants?
Expand | ||||
---|---|---|---|---|
| ||||
1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000. ... Many of the others are due to new insertions of transposable elements. |
Very, Very Advanced Exercise
...