Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though as Oxford Nanopore lowers its error rate and tools using both its long reads and illumina short reads the difficulty falls, while the accuracy increases). Genome assembly should only be used:

...

Note
titleA note about read preprocessing

While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and decrease the accuracy. If using this tutorial on your own samples make sure you are working with the best data possible ... reads lacking adapters in this case with the largest insert sizes possible.

For those looking for a real challenge, go through the multiqc tutorial and the fastp tutorial, and use the information provided here to compare assemblies of some of the same samples in both cases.


Learning Objectives

  • Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
  • Use contig_stats.pl to display assembly statistics.
  • Find proteins of interest in an assembly using Blast.


Installing SPAdes

As genome assembly is important part of analysis but is building a reference file that will be used many times, it makes more sense to install it its own environment. Other potential tools to have in the same environment would be read preprocessing tools, in particular adapter removal tools such as fastp. Supporting the suggestion made in the fastp tutorial that if environments are to be grouped together based on task, read pre-processing is a good environment

Code Block
languagebash
conda create --name GVA-Assembly -c bioconda spades -c conda-forge


Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different. 

...

Warning
titleIf the end of the spades test gives different output do not continue.

Get my attention on zoom and we'll figure out what is going on.

Set 1: Plasmid SPAdes

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node. If you still run into memory problems, consider moving onto the 'large-mem' queue rather than the 'normal' queue which has more memory, and also downsampling your data.

Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I suggest analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.

Data

Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.

Code Block
languagebash
titleYou should know the copy command by now, try to get it on your own before checking your answer here
collapsetrue
cp $BI/gva_course/Assembly/*.fastq.gz $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py

...

Expand
titleUnexpected error like command not found?


Code Block
titleDid you remember to activate your environment after you launched your idev session?
conda activate GVA-Assembly


Evaluating the output

As you can see from listing the contents of the output 'plasmid' directory, several new files have been generated. There are two files that I consider to be the most important. 1. contigs.fasta as this is the actual result of all the different contigs that were created. For circular chromosomes (such as plasmids) the goal would be that there is a single contig meaning that all of the reads were able to close the circle. 2. spades.log as it has the information about the completed run that you can use to compare different samples or conditions in the event that you are interested trying to optimize the command options, as would likely be the case if you were trying to assemble the best reference possible. Interestingly, the spades.log file is equivalent to if you had redirected the error and screen printing to a log file yourself (ie using &> as was done in the fastp tutorial).

...


  1. Expand
    titleHow many contigs were generated?

    just 1 (its a fasta file so you focus on the > symbols to identify each different contig that is present)




  2. Expand
    titleHow how long is each the contig?

    2096bp



  3. Expand
    titleHow deep is the coverage of this plasmid?

    In this case the answer is ~57. This value can be particularly useful when you are trying to determine if novel DNA is present as a multi copy plasmid, or as something that has inserted into the chromosome. If it is inserted, you would expect the coverage to be similar to that of the chromosome, if it is a plasmid, it could be significantly higher.


Visualizing the aseemebly

Another file that maybe of interest is (especially if you are going to try to manually make improvements to the assembly or take a targeted approach to improving the assembly) the assembly_graph.fastg. I would recommend opening this file with the bandage program. https://rrwick.github.io/Bandage/ it is lightweight and easily installed on all systems and while it is pretty intuitive it does have robust documentation https://github.com/rrwick/Bandage/wiki. Viewing this plasmid in bandage will effectively just show you a circle as it is completely closed. The good news is that bandage is powerful enough to support larger genomes which may be of help or interest in the simulated data set.

Set 2: Whole Genome Simulated Data

Here we will look at 4 sets of data with library preparation conditions to evaluate how wet lab decisions influence outcomes on the computer. Some of the text here is very similar or identical to that in set 1 incase people choose to skip directly to it.

Data

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

...

Notice how the pairs of reads are denoted by the /1 and /2 at the end of the first line in the 4 line fastq block. More often (and everywhere else in this course) your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

...

Warning
titleA warning on memory usage

SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que and may need to downsample your data.

Submitting the job

Once you have decided on the combinations you want to evaluate, use the 'wc -l' command to verify that your spades_commands file has 4 commands as you expect.

...

Code Block
languagebash
titlesubmit the job to run on the que
sbatch spades.slurm

Evaluating the output

Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.

...

Expand
titleanswers

There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data. For bacteria there is an interesting observation that the majority of chromosomes require fragments of ~7kb to be fully closed.

Visualizing the assembly 

As mentioned above the bandage program can (https://rrwick.github.io/Bandage/) can be quite helpful in understanding what is going on.

What comes next when working with your own data?

  1. Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer. 
  2. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
  3. If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
  4. If you don't think the contigs you have are "good enough"
    1. Verify you have trimmed your reads to the best they can be using fastq, multiqc, and fastp
    2. Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
    3. Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.

...