Structural Variant (SV) calling with SVdetect 2020
Overview
Most approaches for predicting structural variants require you to have paired-end or mate-pair reads. They use the distribution of distances separating these reads to find outliers and also look at pairs with incorrect orientations. As mentioned during several of the presentations, many researchers choose to ignore these types of mutations and combined with the increased difficulty of accurately identifying them, the community is less settled on the "best" way to analyze them. Here we present a tutorial on SVDetect based on the quality of its instructions, and easy of installation despite its use of relatively hefty configuration files.
Other possible tools:
- BreakDancer - hard to install prerequisites on TACC. Requires installing libgd and the notoriously difficult GD Perl module.
- PEMer - hard to install prerequisites on TACC. Requires "ROOT" package.
Good discussion of some of the issues of predicting structural variation:
Learning objectives:
Identify structural variants in a new data set.
- Work with a new type of program that uses configuration files rather than entering all information on a single command at the command line. This is very similar to the queue system TACC uses making this a good introduction.
Calling SV with SVdetect:
Here we'll look an E. coli genome re-sequencing sample where a key mutation producing a new structural variant was responsible for a new phenotype involving citrate, something the Barrick lab has studied.
Prepare your directories
cds cp -r $BI/gva_course/structural_variation/data GVA_sv_tutorial cd GVA_sv_tutorial
Possible Input/Output errors experienced with the above command
At least 1 student was experiencing an issue with the above command where an "Input/Output" error message was generated, and the files were copied, but the files were
This is Illumina mate-paired data (having a larger insert size than paired-end data) from genome re-sequencing of an E. coli clone.
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina, First of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Paired-end Illumina, Second of mate-pair, FASTQ format | Re-sequenced E. coli genome |
| Reference Genome in FASTA format | E. coli B strain REL606 |
NC_012967.1.lengths | Simple tab delimtered file based on the size of the reference needed for SVDetect so you don't have to create it yourself |
Map data using bowtie2
First we need to (surprise!) map the data. This will hopefully reinforce the bowtie2 tutorial you just completed.
Do not run on head node
Use showq -u to verify you are still on the idev node.
If not, and you need help getting a new idev node, see this tutorial.
bowtie2-build NC_012967.1.fasta NC_012967.1 bowtie2 -p 48 -X 5000 --rf -x NC_012967.1 -1 61FTVAAXX_2_1.fastq -2 61FTVAAXX_2_2.fastq -S 61FTVAAXX.sam
Possibly unfamiliar options:
--rf
tells bowtie2 that your read pairs are in the "reverse-forward" orientation of a mate-pair library-X 5000
tells bowtie2 to not mark read pairs as discordant unless their insert size is greater than 5000 bases.
You may notice that these commands complete pretty quickly. Always remember speed is not necessarily representative of how taxing something is for TACC's head node, and always try to be a good TACC citizen and do as much as you can on idev nodes or as job submissions
People have previously asked what makes bowtie2 better/different than 'bowtie'
In short, No. A detailed explanation can be found at this link. This is mentioned here somewhat out of posterity ... years ago when multiple different mappers were used (including bowtie and bowtie2) it was pointed out that SV is very difficult to detect when reads are mapped with bowtie as it does not identify discordantly mapped read pairs.
Analyze read mapping distribution
The first step is to look at all mapped read pairs and whittle down the list only to those that have an unusual insert sizes (distances between the two reads in a pair).
BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam
As we discussed in our earlier presentation, SV are often detected by looking for variations in library insert sizes. The stdout of the pearl script will answer the questions:
Running SVDetect
SVDetect demonstrates a common strategy in some programs with complex input where instead of including a lot of options on the command line, it reads in a simple text file that sets all of the required options. Lets look at how to create a configuration file:
This is one of the most common mistakes people make during the course and the nature of zoom make this very likely to be true this year as well
Notice the next block contains line numbers. On lines 7 and 8 you see ##### and <USERNAME> ... these need to correspond to your scratch directory locations. You can easily check this with the pwd command.
Do not change anything else on the line.
If you are unsure what you should be replacing those place holders with please get my attention on zoom and I'll help you through it.
<general> input_format=sam sv_type=all mates_orientation=RF read1_length=35 read2_length=35 mates_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/61FTVAAXX.ab.sam cmap_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/NC_012967.1.lengths num_threads=48 </general> <detection> split_mate_file=0 window_size=2000 step_length=1000 </detection> <filtering> split_link_file=0 nb_pairs_threshold=3 strand_filtering=1 </filtering> <bed> <colorcode> 255,0,0=1,4 0,255,0=5,10 0,0,255=11,100000 </colorcode> </bed>
The following commands will take a while and must be completed in order, so no advantages/ability to have them run in the background. Consult the manual for a full description of what these commands and options are doing while the commands are running.
SVDetect linking -conf svdetect.conf SVDetect filtering -conf svdetect.conf SVDetect links2SV -conf svdetect.conf
Possible errors on idev nodes
In reviewing these tutorials these commands were not executing for me in idev sessions for unknown reasons. By chance I had an idev session time out with me noticing and I noticed it did run on the head node. Try the above commands 1 at a time, but if you see error messages like the following logout of your idev session with the logout command, and then execute them 1 at a time on the head node. While this is not the best citizenship, the program gave no indications of being a problem.
Feedback from other students says this is not a problem limited to me and that multiple people are experiencing the same problem. Running these commands on the head node should be acceptable, for reasons that will be discussed in zoom.
Config/General.pm did not return a true value at /corral-repl/utexas/BioITeam/bin/SVDetect line 48. BEGIN failed--compilation aborted at /corral-repl/utexas/BioITeam/bin/SVDetect line 48.
Take a look at the resulting file: 61FTVAAXX.ab.sam.links.filtered.sv.txt
. Another downside of command line applications is that while you can print files to the screen, the formatting is not always the nicest. On the plus side in 95% of cases, you can directly copy the output from the terminal window to excel and make better sense of what the columns actually are
I've highlighted a few lines below:
chr_type SV_type BAL_type chromosome1 start1-end1 average_dist chromosome2 start2-end2 nb_pairs score_strand_filtering score_order_filtering score_insert_size_filtering final_score breakpoint1_start1-end1 breakpoint2_start2-end2 ... INTRA NORMAL_SENSE - chrNC_012967 599566-601025 - chrNC_012967 663036-664898 430 100% - - 1 - - ... INTRA NORMAL_SENSE - chrNC_012967 3-2025 - chrNC_012967 4627019-4628998 288 100% - - 1 - - ... INTRA REVERSE_SENSE - chrNC_012967 16999-19033 - chrNC_012967 2775082-2777014 274 100% - - 1 - -
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.