...
This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script annovar_pipe.sh
around the existing summarize_annovar.pl
wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.
...
Get some data:
First we want to move to a new location on $SCRATCH
...
Code Block |
---|
language | bash |
---|
title | Use only 1 of the following copy commands to get some data |
---|
collapse | true |
---|
|
# # if you have already done the trios tutorial
cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf .
# if you have not done the trios tutorial
cp $BI/ngs_course/human_variation/N*.vcf .
|
...
...
|
---|
Expand |
---|
title | Click here for a hint |
---|
|
The which command is used to give the location of a program or script that is in your $PATH.
Code Block |
---|
| Here is an easy one-liner to cat the Copy the necessary database containing indexes that annovar will use |
|
# note this can not be done on an idev node. The command is expected to take several minutes, and generate no output.
cp -r $BI/ref_genome/annovar/hg19_annovar_db .
|
Setting up the commands
The BioITeam has set up 2 helpful wrapper commands for using annovar. The first (summarize_annovar.pl) calls annovar and summarizes the results. The second (annovarPipe.sh) does some file conversions to prepare for annovar and then calls the summarize_annovar.pl script.
Expand |
---|
title | Click here for a hint |
---|
|
The which command is used to give the location of a program or script that is in your $PATH. Code Block |
---|
language | bash |
---|
title | Here is an easy one-liner to cat the contents of a script (note ` is a back-tick, not apostrophe) |
---|
collapse | true |
---|
| cat `which annovar_pipeannovarPipe.sh` # the command within the `` is evaluated first and the output placed within the `` marks to be evaluated by the outer command
|
This script simply does a format conversion and then calls summarize_annovar.pl . Which begs the question what does summarize_annovar.pl do? Expand |
---|
title | Figure out what summarize_annovar.pl is doing using the back tick trick you just learned |
---|
| Again note the ` characters are "backtick", not apostrophes Code Block |
---|
title | Print out the text of the perl script summarize_annovar.pl |
---|
| more `which summarize_annovar.pl`
|
The more command was used to make it slightly easier to read and to remind you that there are multiple options of how to look a things (less, head, tail, etc), eventually you will develop your own habits but the more you are exposed to the fewer instances of angry eureka moments you will have. When working with unknown files or file types, cat should typically be avoided unless you run a wc -l command else you risk crashing your terminal window. |
This script is written in perl (a programing language that while powerful, is not very favored among most computational biologists who typically prefer: Python, R, or Bash shell scripts) as a method of standardizing input to call a series of external commands. Such wrappers (or wrappers within wrappers) make your life much easier. breseq itself has wrappers for using bowtie2 and samtools which you ran separately on other data. As you increase your understanding of scripts and the command line by looking at what others have done you may begin to make your own wrappers and small scripts, but such wrappers are a great example of the BioITeam and other community resources can provide you with. While I have told you no one will ever care as much about your data as you do, quality of life issues regarding repetitive treatment of similar inputs and outputs may be easily solved by someone else. |
...
Code Block |
---|
title | Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names |
---|
|
ls $BI/ngs_course/human_variation/N*.vcf | \
perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipeannovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 &1\n";' > commands
|
Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named commands.
Note |
---|
title | A helpful note to make your life better. |
---|
|
"commands" files are 1 way to both create a record of what it is that you are doing as well as an easy way to execute multiple commands simultaneously. In a previous tutorial (the advanced breseq tutorial) we showed you how to do this by adding an & as the last character of the line to force the command to execute in the background. Tomorrow we will go over the more common method of submitting "commands" files to the que to run as jobs rather than being spoiled with interacting with everything on idev nodes. In computational biology this instructor has often found that once I learn to do something a particular way, it takes an incredible amount of inertia to change how I do something even when i know it would help to do so. In an effort to learn from me making things harder on myself than need be, I urge you to STRONGLY consider not naming every list of commands you want to run "commands" but rather something that is actually descriptive, even "commands_DATE" would be helpful. As you start submitting your own jobs you will quickly be able to build bad habits, try to be aware of and avoid this one if possible. |
Expand |
---|
title | investigate the commands file to to determine what the piping command actually did (click for answer) |
---|
|
Code Block |
---|
language | bash |
---|
title | Think about the commands that let you look at what the contents of a file are |
---|
collapse | true |
---|
| cat commands |
|
Executing the commands
Warning |
---|
title | Remember to make sure you are on an idev done |
---|
|
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial. |
...
The easiest way to check if this is working is to use use ls and and see an explosion of new files. This will take quite a bit of time to complete running. As such, we have ALREADY You can check that things are running using the ps command, and looking at the new .log files which are created. As the commands take some time to run, pre-computed versions of these outputs are available so you can begin evaluating the results if you don't want to wait for them to finish.
Canned results
...
Accessing pre-computed results
Code Block |
---|
language | bash |
---|
title | Use this code block to copy the completed results to immediately begin evaluating the results. Additionally, when your results have completed running (likely >90 minutes as annovar not configured to use multiple processes), compare your results to these provided results |
---|
|
...
|
Code Block |
---|
language | bash |
---|
title | Use this code block to copy the completed resultscollapse | true |
---|
|
mkdir provided_results
cd provided_results
cp $BI/ngs_course/human_variation/*chrom20.samtools* .
cp $BI/ngs_course/human_variation/*chrom20.GATK* . |
...