Genome Assembly
Genome Assembly
First, some background
De Novo asssembly is creating a genome without a reference genome. Creating a genome with a reference genome is called mapping assembly.
This paper is an excellent review of the theory and practice of NGS assemblers as of 2010. Read lengths will continue to get longer, error rates lower, coverage higher, but the basic concepts embodied in that paper will probably remain useful for several more years.
The figures embedded in this wiki page for educational purposes are from that paper.
Upfront we need to discuss the two basic assembler types: overlap graph and de Bruijn:
In either case, more and longer reads are better as you can imagine. With an overlap graph (also called overlap layout consensus algorithm or overlap layout algorithm) your assembly grows much more effectively with longer reads and there are few parameters you can tweak. With a de Bruijn approach, obviously your choice of k can have a strong impact on your assembly.
Effect of trade-off in read length and coverage
k-mer distributions inherent in select genomes
Some example assembly statistics
Many (many) assemblers are available. A list of assemblers can be found here.
We'll take a look at Velvet. - it's a fast and easy to use de Bruijn assembler.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.