Overview
SPAdes is a De Bruijn graph assembler works fairly rapidly on short (microbial) genomesassembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use velvet SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though as if Oxford Nanopore comes online it lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the gsafraw sequencing.
Learning Objectives
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Table of Contents
Installing SPAdes
Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.
Expand |
---|
title | If SPAdes is so common a tool, why doesn't the BioITeam install it for everyone? |
---|
|
In my opinion there are a few reasons: - Generally speaking, while SPAdes is commonly used for assemblies, assemblies themselves are not very common as once you have an assembled genome, you use that genome for future analysis rather than redoing the assembly.
- Since it is easily installed, it doesn't save people much work to install it for them.
|
First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget, or first downloading it to your laptop then transferring it to to TACC using SCP. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading or transferring the file to a 'src' folder on $WORK.
Code Block |
---|
language | bash |
---|
title | Making a DIRectory named SRC in $WORK (the capital letters are your clues) |
---|
collapse | true |
---|
|
mkdir $WORK/src |
Try to use 'wget -h' before clicking below. When using wget it is often helpful to right click on a link and select 'copy link address' when the file you want is available through a download link.
Code Block |
---|
language | bash |
---|
title | How to use wget to download directly to TACC |
---|
collapse | true |
---|
|
cd $WORK/src
wget http://cab.spbu.ru/files/release3.13.0/SPAdes-3.13.0-Linux.tar.gz |
Code Block |
---|
language | bash |
---|
title | How to use SCP to transfer the downloaded file to TACC from your laptop |
---|
collapse | true |
---|
|
|
Data
Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
Code Block |
---|
title | Move to scratch, copy the raw data, and change into this directory for the tutorial |
---|
|
cds
mkdir GVA_velvetSPAdes_tutorial
cp $BI/ngs_course/velvet/data/*/* GVA_velvetSPAdes_tutorial
cd BDIBGVA_velvetSPAdes_tutorial
|
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
...