Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools.

Pre-packaged programs

Annovar - one of the most powerful yet simple to run variant annotators available

Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.

...

Code Block

language	bash
title	Solution to make a new directory on scratch. You should know this answer.
collapse	true

mkdir $SCRATCH/GVA_Annovar
cd $SCRATCH/GVA_Annovar

...

Code Block

...

The which command is used to give the location of a program or script that is in your $PATH.

Here is an easy one-liner to cat the contents of a script (note ` is a back-tick, not apostrophe)

Expand

title	Click here for a hint

Code Block

title
language	bash

Use only 1 of the following copy commands to get some data
collapse	true

cat

# if

`which annovar_pipe.sh` #

you have already done the

command

trios

within

tutorial

the `` is evaluated first and the output placed within the `` marks to be evaluated by the outer command

This script simply does a format conversion and then calls summarize_annovar.pl. Which begs the question what does summarize_annovar.pl do?

Expand

title	Figure out what summarize_annovar.pl is doing using the back tick trick you just learned

Again note the ` characters are "backtick", not apostrophes

Code Block

title	Print out the text of the perl script summarize_annovar.pl

more `which summarize_annovar.pl`

The more command was used to make it slightly easier to read and to remind you that there are multiple options of how to look a things (less, head, tail, etc), eventually you will develop your own habits but the more you are exposed to the fewer instances of angry eureka moments you will have. When working with unknown files or file types, cat should typically be avoided unless you run a wc -l command else you risk crashing your terminal window.

This script is written in perl (a programing language that while powerful, is not very favored among most computational biologists who typically prefer: Python, R, or Bash shell scripts) as a method of standardizing input to call a series of external commands. Such wrappers (or wrappers within wrappers) make your life much easier. breseq itself has wrappers for using bowtie2 and samtools which you ran separately on other data. As you increase your understanding of scripts and the command line by looking at what others have done you may begin to make your own wrappers and small scripts, but such wrappers are a great example of the BioITeam and other community resources can provide you with. While I have told you no one will ever care as much about your data as you do, quality of life issues regarding repetitive treatment of similar inputs and outputs may be easily solved by someone else.

Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

ls

cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf .  
 
# if you have not done the trios tutorial
cp $BI/ngs_course/human_variation/N*.vcf .

Warning

title	Do not run this command yet

Code Block

title	Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names

Next, look at the code for our annovar_pipe.sh command.

Expand

title	Click here for a hint

The which command is used to give the location of a program or script that is in your $PATH.

Code Block

language	bash
title	Here is an easy one-liner to cat the contents of a script (note ` is a back-tick, not apostrophe)
collapse	true

cat `which annovar_pipe.sh`  # the command within the `` is evaluated first and the output placed within the `` marks to be evaluated by the outer command

This script simply does a format conversion and then calls summarize_annovar.pl. Which begs the question what does summarize_annovar.pl do?

Expand

title	Figure out what summarize_annovar.pl is doing using the back tick trick you just learned

Again note the ` characters are "backtick", not apostrophes

Code Block

title	Print out the text of the perl script summarize_annovar.pl

more `which summarize_annovar.pl`

The more command was used to make it slightly easier to read and to remind you that there are multiple options of how to look a things (less, head, tail, etc), eventually you will develop your own habits but the more you are exposed to the fewer instances of angry eureka moments you will have. When working with unknown files or file types, cat should typically be avoided unless you run a wc -l command else you risk crashing your terminal window.

This script is written in perl (a programing language that while powerful, is not very favored among most computational biologists who typically prefer: Python, R, or Bash shell scripts) as a method of standardizing input to call a series of external commands. Such wrappers (or wrappers within wrappers) make your life much easier. breseq itself has wrappers for using bowtie2 and samtools which you ran separately on other data. As you increase your understanding of scripts and the command line by looking at what others have done you may begin to make your own wrappers and small scripts, but such wrappers are a great example of the BioITeam and other community resources can provide you with. While I have told you no one will ever care as much about your data as you do, quality of life issues regarding repetitive treatment of similar inputs and outputs may be easily solved by someone else.

Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/

...

language	bash
title	Click here to check your work
collapse	true

...

directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file:

Code Block

title	Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names

ls $BI/ngs_course/human_variation/N*.vcf .| ls\
*.vcf | perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1 &\n";' > commands

Note

title	A helpful note to make your life better.

"commands" files are 1 way to both create a record of what it is that you are doing as well as an easy way to execute multiple commands simultaneously. In a previous tutorial (the advanced breseq tutorial) we showed you how to do this by adding an & as the last character of the line to force the command to execute in the background. Tomorrow we will go over the more common method of submitting "commands" files to the que to run as jobs rather than being spoiled with interacting with everything on idev nodes. In computational biology this instructor has often found that once I learn to do something a particular way, it takes an incredible amount of inertia to change how I do something even when i know it would help to do so. In an effort to learn from me making things harder on myself than need be, I urge you to STRONGLY consider not naming every list of commands you want to run "commands" but rather something that is actually descriptive, even "commands_DATE" would be helpful. As you start submitting your own jobs you will quickly be able to build bad habits, try to be aware of and avoid this one if possible.

Expand

title	investigate the commands file to to determine what the piping command actually did (click for answer)

Code Block

language	bash
title	Think about the commands that let you look at what the contents of a file are
collapse	true

cat commands

# if doing this tutorial on Thursday: idev -m 180 -r CCBB_Day_3 -A UT-2015-05-18 # if doing this tutorial on Friday: idev -m 180 -r CCBB_Day_4 -A UT-2015-05-18

Warning

title	Make Remember to make sure you are on an IDEV node using the showq -u command

Code Block

language	bash
title	How to request an idev node if needed
collapse	true

idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Code Block

title	Create the submission script and submit

chmod +x commands
./commands

The easiest way to check if this is working is to use ls and see an explosion of new files. This will take quite a bit of time to complete running. As such, we have ALREADY pre-computed these outputs so you can begin evaluating the results.

...

language	bash
title	Use this code block to copy the completed results

...

so you can begin evaluating the results.

Canned results

Later, when your results have completed running (likely >90 minutes as annovar not configured to use multiple processes, compare your results to these provided results).

For the scavenger hunts below you should also copy over the GATK results.

Code Block

language	bash
title	Use this code block to copy the completed resultsthe completed results

mkdir provided_results
cd provided_results
cp $BI/ngs_course/human_variation/*chrom20.samtools* .
cp $BI/ngs_course/human_variation/*chrom20.GATK* .

...

Expand

title	Thinking back to our first example where we looked for a gene with 2 frameshift deletions, can you think of a way to look at all 3 patients' GATK analysis results at once, and look for the total number of mutations detected (sorted by gene) displayed in descending order?

The final form of the command that you had was:

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r

Code Block

language	bash
title	Try to come up with a solution on your own before checking your answer here
collapse	true

cat *GATK.vcf.exome_summary.csv | grep "exonic" | awk -F"," '{print $2,"\t"$3}' | sort | uniq -c | sort -nr

Some questions for thought:

Why might this type of analysis be useful?
What other greps would make this more useful?

Other variant annotators:

http://www.yandell-lab.org/software/vaast.html
http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
http://www.bioconductor.org/help/workflows/variants/http://vat.gersteinlab.org/
http://code.google.com/p/mu2a/

Notes

Variants consist of single base base changes, insertions and deletions, and larger scale structural changes. "Larger scale" is usually defined relative to the capabilities of the technology; for example, a "small indel" usually means "detectable within a single sequence read". In 2009, sequence reads were about 50 bp, by 2011 they were 100 bp, and today the standard hiSEQ run is 150bp and miSEQ runs can be 600bp.

...

Return to GVA2020

Versions Compared

Old Version 1

New Version 2

Key

Pre-packaged programs

Annovar - one of the most powerful yet simple to run variant annotators available

Canned results

Other variant annotators:

Notes

Page Comparison

Versions Compared

Old Version 1

New Version 2

Key

Pre-packaged programs

Annovar - one of the most powerful yet simple to run variant annotators available

Canned results

Other variant annotators:

Notes