/
Genome Assembly (velvet)

Genome Assembly (velvet)

Overview

Velvet is a De Bruijn graph assembler works fairly rapidly on short (microbial) genomes. In this tutorial we will use velvet to assemble an E. coli genome from simulated Illumina reads.

Learning Objectives

  • Run velvet to perform de novo assembly on fragment, paired-end, and mate-paired data.
  • Use contig_stats.pl to display assembly statistics.
  • Find proteins of interest in an assembly using Blast.

Table of Contents

Data

First, let's copy the fastq read files.

Move to scratch, copy the raw data, and change into this directory for the tutorial
cds
mkdir velvet_tutorial
cp $BI/ngs_course/velvet/data/*/* velvet_tutorial
cd velvet_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Files in the tutorial directory
login1$ ls
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

 

Set 1

Set 2

Set 3

Set 4

Read Size

100

100

100

100

Paired/Single Reads

Single

Paired

Paired

Paired

Gap Sizes

NA

400

400, 3000

400, 3000, 1500

Coverage

50

50

25 for each subset

20 for each subset

Number of Subsets

1

1

2

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Interleaved fastq
login1$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

Velvet Assembly

Now let's use Velvet to assemble the reads.

First, you need to load Velvet via module.

 I need to see the command...
Load the velvet module
 module load velvet

Using velvet consists of a sequence of two commands:

  1. velveth - analyzes kmers in the reads in preparation for assembly
  2. velvetg - constructs the assembly and filters contigs from the graph

Look at the help for each program.

The <hash_length> parameter of velveth is the kmer value that is key to the assembly process. Choosing it controls the tradeoff between sensitivity (lower hash_length, more reads included, longer contigs) and specificity (higher hash length, less chance of misassembly, more reads ignored, shorter contigs)). There is more discussion about choosing an appropriate kmer value in the Velvet manual and in this blog post.

Velvet has an option of specifying the insertion size of a paired read file (-ins_length). If no size is given, Velvet will guess the insertion size. We'll just have Velvet guess the size.

Velvet also has an option to specify the expected coverage of the genome, which helps it choose how to resolve repeated sequences (-exp_cov). We set this parameter to estimate this from the data. A common problem with using Velvet is that you have many very short contigs and the last line of output tells you that it used 0 of your reads. This is caused by not setting this parameter. The default is NOT auto.

We'll need to create a commands file and submit it to TACC. Let's make the commands file say:

For a "commands" file - to run four velvet assemblies in parallel. If you copy and paste, be sure that there are ONLY four lines in your file.
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes
velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes
velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes
velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes

Now use launcher_creator.py to make a *.sge for the commands file and qsub it.

Use the correct "wayness"

Velvet and other assemblers usually take large amounts of RAM to complete. Running these 4 commands on a single node will use more RAM than is available on a single node so it's necessary to change the number of commands per node (wayness) from the default of 12 to 1. When you use launcher_creator.py, you set the "wayness" (number of commands per node) using the -w flag.

You can set the allotted time for this job to just 10 minutes.

 I need some help with my launcher_creator.py command
launcher_creator.py -w 1 -n velvet -q normal -j commands -t 00:10:00

If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue.

If you find yourself waiting a long time for the assembly process to run, you can also start an idev session and try running some of the velveth and velvetg commands interactively. Each one takes a few minutes to complete.

Velvet Output

Explore each output directory that was created by Velvet.

Checking the tail of the Log files in each of the output folders, we see lines like the following:

 The results...
Single Set
Median coverage depth = 2.657895
Final graph has 9748 nodes and n50 of 191, max 1427, total 1865207, using 281499/2314900 reads
Set with one group of reads at 50 coverage
Median coverage depth = 11.131337
Final graph has 265 nodes and n50 of 127102, max 397974, total 4558511, using 1464201/2314900 reads
Set with 2 groups of reads both at 25 coverage each
Median coverage depth = 11.109244
Final graph has 203 nodes and n50 of 698134, max 1032531, total 4585717, using 1465818/2314900 reads
Set with 3 groups of reads all at 20 coverage each
Median coverage depth = 13.353287
Final graph has 202 nodes and n50 of 698626, max 1139610, total 4602729, using 1758595/2777880 reads

With better read pairs that link more distant locations in the genome, there are fewer contigs, and contigs are are longer, giving us a more complete picture of linkage across the g