...
Learning objectives:
Identify structural variants in a new data set.
- Work with a new type of program that uses configuration files rather than entering all information on a single command at the command line. This is very similar to the queue system TACC uses making this a good introduction.
...
Code Block |
---|
cd $SCRATCH/GVA_sv_tutorial
conda activate GVA-SV
BAM_preprocessingPairs.pl -p 0 61FTVAAXX.sam &> preprocessing_results.txt |
As we discussed in our earlier presentation, SV are often detected by looking for variations in library insert sizes. The stdout of . Look at the preprocessing_results.txt file created by the pearl script will answer the questions:
...
Note | ||
---|---|---|
| ||
Notice the next block contains line numbers. On lines 7 and 8 you see ##### and <USERNAME> ... these need to correspond to your scratch directory locations. You can easily check this with the pwd command. Do not change anything else on the line. If you are unsure what you should be replacing those place holders with please get my attention on zoom and I'll help you through it. |
Code Block | ||||
---|---|---|---|---|
| ||||
<general> input_format=sam sv_type=all mates_orientation=RF read1_length=35 read2_length=35 mates_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/61FTVAAXX.ab.sam cmap_file=/scratch/#####/<USERNAME>/GVA_sv_tutorial/NC_012967.1.lengths num_threads=48 </general> <detection> split_mate_file=0 window_size=20005000 step_length=10002500 </detection> <filtering> split_link_file=0 nb_pairs_threshold=37 strand_filtering=1 insert_size_filtering=1 mu_length=unknown1 sigma_length=unknown2 order_filtering=1 nb_pairs_order_threshold=5 </filtering> <bed> <colorcode> 255,0,0=1,4 0,255,0=5,10 0,0,255=11,100000 </colorcode> </bed> |
The following commands will take a few minutes each and must be completed in order, so no advantages/ability to have them run in the background. Consult the manual for a full description of what these commands and options are doing while the commands are running.
Code Block | ||
---|---|---|
| ||
SVDetect linking -conf svdetect.conf
SVDetect filtering -conf svdetect.conf
SVDetect links2SV -conf svdetect.conf
|
Each command will finish faster than the one before it with the longest period with no advancement seeming to be after "
# Defining precise link coordinates..."
is printed to the screen.
...
title | Cheat Sheat |
---|
If you decide that SVdetect is the program you would like to use for calling your structural variants the following 1 line command will become invaluable:
SVDetect linking -conf svdetect.conf && SVDetect filtering -conf svdetect.conf && SVDetect links2SV -conf svdetect.conf && echo "SVDetect finished without Errors"
...
Take a look at the final output file: 61FTVAAXX.ab.sam.links.filtered.sv.txt
. Another downside of command line applications is that while you can print files to the screen, the formatting is not always the nicest. On the plus side in 95% of cases, you can directly copy the output from the terminal window to excel and make better sense of what the columns actually are
I've highlighted a few lines below:
Code Block |
---|
chr_type SV_type BAL_type chromosome1 start1-end1 average_dist chromosome2 start2-end2 nb_pairs score_strand_filtering score_order_filtering score_insert_size_filtering final_score breakpoint1_start1-end1 breakpoint2_start2-end2
...
INTRA NORMAL_SENSE - chrNC_012967 599566-601025 - chrNC_012967 663036-664898 430 100% - - 1 - -
...
INTRA NORMAL_SENSE - chrNC_012967 3-2025 - chrNC_012967 4627019-4628998 288 100% - - 1 - -
...
INTRA REVERSE_SENSE - chrNC_012967 16999-19033 - chrNC_012967 2775082-2777014 274 100% - - 1 - -
|
...
title | Any idea what sorts of mutations produced these three structural variants? |
---|
1. This is a tandem head-to-tail duplication of the region from approximately 600000 to 663000.
2. This is just the origin of the circular chromosome, connecting its end to the beginning!
3. This is a big chromosomal inversion mediated by recombination between repeated IS elements in the genome. It would not have been detected if the insert size of the library wasn't > ~1,500 bp!
...
Notice that lines 23 and 24 are listed as unknown1 and unknown2 respectively. While several of the other values are related to the actual library, these 2 values are actually critical to enabling and guiding the analysis (pick too small and even normal reads appear to support structural variants, pick too large and even variant reads appear as normal reads). Luckily these values are actually empirically determined as part of the preprocessing_results.txt file that was created with the perl script.
No Format |
---|
-- mu length = 2223, sigma length = 1122 |
You can either look up the values and replace unknown1 and unknown2 with 2223 and 1122 respectively
Code Block | ||
---|---|---|
| ||
mu_sigma=$(grep "^-- mu length = [0-9]*, sigma length = [0-9]*$" preprocessing_results.txt | \
sed -E 's/-- mu length = ([0-9]+), sigma length = ([0-9]+)/\1,\2/'); \
mu=$(echo $mu_sigma | cut -d "," -f 1); \
sigma=$(echo $mu_sigma | cut -d "," -f 2); \
sed -i -E "s/mu_length=unknown1/mu_length=$mu/" svdetect.conf; \
sed -i -E "s/sigma_length=unknown2/sigma_length=$sigma/" svdetect.conf |
Note that the above is considered a '1-liner' even though it is spread across 6 lines for clarity. When \ is encountered at the end of the line the command line waits until it reaches the end of a line without finding a "\" before executing anything. As the following commands will take a few minutes each and must be completed in order, launch them in order, but read ahead as to what values in our svdetect.conf file are critical to analysis, and why we picked the values we did.
Code Block | ||
---|---|---|
| ||
SVDetect linking -conf svdetect.conf
SVDetect filtering -conf svdetect.conf
SVDetect links2SV -conf svdetect.conf
|
Each command will finish faster than the one before it with the longest period with no advancement seeming to be after "
# Defining precise link coordinates..."
is printed to the screen. While waiting:
Expand title Critical values and their choice option value reason mu_length 2223 this is taken directly from the preprocessing perl script, it enables classification of the different discordant read mappings sigma_length 1122 Again taken from the preprocessing perl script, it enables classification of the different discordant read mappings window_size 5000 Must be set to at least 2mu + (2sigma)0.5 to enable balance classification (4540 would be minimum value in this case) step_length 2500 Documentation suggests this be 25-50% of the window size, smaller sizes enable more precise endpoint determination. strand; insert size; ordering_filtering 1 Setting each to 1 turns on such filtering, absent being listed, or being listed as 0, these tests would not run You may also consult the manual for a full description of what the commands and options inside of the svdetect.conf file.
Expand title Click here for a breakdown of what this one liner is doing First lets break this one liner down into its 6 parts:
Code Block linenumbers true mu_sigma=$(grep "^-- mu length = [0-9]*, sigma length = [0-9]*$" preprocessing_results.txt | \ sed -E 's/-- mu length = ([0-9]+), sigma length = ([0-9]+)/\1,\2/'); \ mu=$(echo $mu_sigma | cut -d "," -f 1); \ sigma=$(echo $mu_sigma | cut -d "," -f 2); \ sed -i -E "s/mu_length=unknown1/mu_length=$mu/" svdetect.conf; \ sed -i -E "s/sigma_length=unknown2/sigma_length=$sigma/" svdetect.conf
- Line 1 uses grep on the preprocessing_results.txt file looking for any line that matches "^-- mu length = [0-9]*, sigma length = [0-9]*$" and stores it in a variable mu_sigma.
- mu_sigma=$(......) stores everything between the () marks in a command line variable named mu_sigma ... you should notice that the closing ) mark is actually on line 2
- the | at the end of the line is the "pipe" command which passes the output to the next command (line in this case as we have broken the command up into parts)
- Line 2 uses the sed command to delete everything that is not a number or , and finishes storing the output in the mu_sigma variable
- sed commands can be broken down as follows: 's/find/replace/'
- in this case, find:
- -- mu length = ([0-9]+), sigma length = ([0-9]+)
- where :
- [0-9] is any number
- the + sign means find whatever is to the left 1 or more times
- and things between () should be remembered for use in the replace portion of the command
- likewise, in this case, replace:
- \1,\2/
- where
- \1 means whatever was between the first set of () marks in the find portion
- , is a literal comma
- \2 means whatever was between the sescond set of () marks in the find portion
- in this case, find:
- at the end of line 2 we now have a new variable named mu_sigma with a value of "2223,1122"
- sed commands can be broken down as follows: 's/find/replace/'
- Line 3 creates a new variable named mu and gives it the value of whatever is to the left of the first , it finds.
- echo $mu_sigma |
- pass the value of $mu_sigma to whatever is on the other side of |
- cut -d "," -f 1
- divide whatever the cut command sees at all the "," marks and then print whatever is to the left of the 1st
- at the end of line 3 we now have a variable named mu with the value "2223"
- echo $mu_sigma |
- Line 4 does the same thing as line 3 except for a variable named sigma, and takes whatever is between the 2nd comma and 3rd comma (since we only have 1 comma, its taking whatever comes after the comma)
- at the end of line 4 we now have a variable named sigma with the value "1122"
- Line 5 looks through the entire svdetect.conf file looking for a line that matches mu_length=unknown1 and replaces all that text with mu_length=$mu (except the computer knows $mu is the variable with the value 2223.
- the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
- the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the mu variable
- at the end of line 5, our svdetect.conf file line 23 now reads mu_length=2223
- Line 5 looks through the entire svdetect.conf file looking for a line that matches sigma_length=unknown2 and replaces all that text with sigma_length=$sigma (except the computer knows $sigma is the variable with the value 1122.
- the -i option tells the sed command to do the replacement in place meaning you are changing the contents of the file
- the "" marks tell the command line that you want to evaluate whatever is between the ""marks, in this case, the sigma variable
- at the end of line 5, our svdetect.conf file line 24 now reads sigma_length=1122
- Line 1 uses grep on the preprocessing_results.txt file looking for any line that matches "^-- mu length = [0-9]*, sigma length = [0-9]*$" and stores it in a variable mu_sigma.
Tip | ||
---|---|---|
| ||
If you decide that SVdetect is the program you would like to use for calling your structural variants the following 1 line command will become invaluable:
Note that because it is the svdetect.conf file that contains the sample specific information, this is the only command you will ever need for running SVDetect (you will still need to run the preprocessing script and edit the svdetect.conf file for each sample). |
Take a look at the final output file: 61FTVAAXX.ab.sam.links.filtered.sv.txt
. Another downside of command line applications is that while you can print files to the screen, the formatting is not always the nicest. On the plus side in 95% of cases, you can directly copy the output from the terminal window to excel and make better sense of what the columns actually are. Unfortunately, this is not one of those times. On Friday I'll explain how I work around this.
I've copied a few of the lines after pasting into excel below :
chr_type | SV_type | BAL_type | chromosome1 | start1-end1 | average_dist | chromosome2 | start2-end2 | nb_pairs | score_strand_filtering | score_order_filtering | score_insert_size_filtering | final_score | breakpoint1_start1-end1 | breakpoint2_start2-end2 |
INTRA | UNDEFINED | UNBAL | chrNC_012967 | 624987-629995 | 305 | chrNC_012967 | 624988-629996 | 3201 | 82% | 100% | 98% | 0.821 | 624305-624987 | 629996-630678 |
INTRA | UNDEFINED | UNBAL | chrNC_012967 | 624699-627533 | 340 | chrNC_012967 | 624988-628769 | 1889 | 85% | 100% | 98% | 0.835 | 621843-624699 | 628769-630678 |
INTRA | UNDEFINED | UNBAL | chrNC_012967 | 625953-629995 | 321 | chrNC_012967 | 627473-630044 | 1646 | 83% | 100% | 97% | 0.817 | 624305-625953 | 630044-633163 |
INTRA | LARGE_DUPLI | UNBAL | chrNC_012967 | 599566-602498 | 63831 | chrNC_012967 | 662625-666126 | 658 | 100% | 100% | 100% | 1 | 596808-599566 | 666126-668315 |
INTRA | LARGE_DUPLI | UNBAL | chrNC_012967 | 599966-602869 | 63804 | chrNC_012967 | 663105-666126 | 512 | 100% | 100% | 100% | 1 | 597179-599966 | 666126-668795 |
INTRA | LARGE_DUPLI | UNBAL | chrNC_012967 | 3-2025 | 4627075 | chrNC_012967 | 4626530-4629804 | 436 | 100% | 99% | 100% | 0.995 | 3-Jan | 4629804-4629812 |
INTRA | INVERSION | UNBAL | chrNC_012967 | 17471-20179 | 2757879 | chrNC_012967 | 2774440-2777242 | 237 | 100% | 100% | - | 1 | 14489-17471 | 2771552-2774440 |
Expand | ||
---|---|---|
| ||
|