Overview
SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.
Learning Objectives
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Table of Contents
Installing SPAdes
Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.
First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget, or first downloading it to your laptop then transferring it to to TACC using SCP. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading or transferring the file to a 'src' folder on $WORK.
Do one of the following, (or both if you want practice moving files around):
Try to use 'wget -h' before clicking below. When using wget it is often helpful to right click on a link and select 'copy link address' when the file you want is available through a download link.
Remember that scp has 2 parts after the command name just like the cp command: 1. the location the file currently is, and 2. the location you want to copy the file to. Most of the class has dealt with moving things from TACC to your computer, but in this case things will move the opposite direction, think about what needs to change about your SCP command to accomplish this.
Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.
Now that the files have been extracted you have a choice in how to use them: 1 option is to copy the binary files to a location that is already in your path (such as the $HOME/local/bin directory we set up for you in your .bashrc file), and the second option is to add the $WORK/src/SPAdes-3.13.0-Linux/bin folder to your path. This is a personal preference and I do not know how prevalent either choice is among researchers. I know that my preference is to copy executable to known locations in the path rather than add a ton of different directories to my path, but others may feel differently. Below I present both options:
Doing both of the following may cause unintended effects in the future (particularly if you attempt to update the version of SPAdes you are using) and I do not recommend it.
If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.
Testing SPAdes installation
SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.
mkdir $SCRATCH/GVA_SPAdes_tutorial cd $SCRATCH/GVA_SPAdes_tutorial spades.py --test
Assuming everything goes correctly, the last lines printed to the screen should be:
======= SPAdes pipeline finished. ========= TEST PASSED CORRECTLY. SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log Thank you for using SPAdes!
The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.
If the end of the spades test gives different output do not continue.
Set 1: Plasmid SPAdes
Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.
Set 2: SPAdes example Data
This assembly may take more than 2 hours to complete and like the next data set it may be better to run this as a job overnight and analyze the data in tomorrow's class. If you choose to attempt to run this data as an interactive session remember that if your idev session runs out of time, the program will stop working and you will have to start over. If you want to submit it as a job, take the commands you want to run (ie the commands that start with "spades.py" and put them in a file named 'commands' using nano. You can include both commands for this data set and the next data set in the same commands file. Finally use the 'launcher_creator.py' command to generate a ".slurm" file and check with instructor that everything is set up correctly so the job will run overnight.
Data
SPAdes provides two example data sets and benchmarks for how long the pipeline might take to run http://spades.bioinf.spbau.ru/release3.11.1/manual.html#sec1.3. In this section we will download the data for the standard E. coli data and compare their results with ours.
mkdir $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest cd $SCRATCH/GVA_SPAdes_tutorial wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_1.fastq.gz wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_2.fastq.gz
SPAdes Assembly
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.
Set 3: Whole Genome Simulated Data
Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
Data
mkdir $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial cd $SCRATCH/GVA_SPAdes_tutorial
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq single_end_100_c_50.fastq paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_400_c_25.fastq paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_50.fastq
There are 4 sets of simulated reads:
Set 1 | Set 2 | Set 3 | Set 4 | |
---|---|---|---|---|
Read Size | 100 | 100 | 100 | 100 |
Paired/Single Reads | Single | Paired | Paired | Paired |
Gap Sizes | NA | 400 | 400, 3000 | 400, 3000, 1500 |
Coverage | 50 | 50 | 25 for each subset | 20 for each subset |
Number of Subsets | 1 | 1 | 2 | 3 |
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq @READ-1/1 TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA + *@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@ @READ-1/2 TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA + I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
SPAdes Assembly
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.
If you are planning to run a job overnight, consider adding additional combinations of reads as individual commands to get an idea of how different insert sizes can play a role in final contig lenghts. Just remember that insert length should always increase:
spades.py -t 48 -o 400_1500_3000 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq --12 paired_end_2x100_ins_3000_c_25.fastq spades.py -t 48 -o 400_and_1500 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq spades.py -t 48 -o 400_only --12 paired_end_2x100_ins_400_c_50.fastq
A warning on memory usage
SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 3 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que.
Interrogating the output
Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fa file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.
The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the reults to head/tail or both head and tail to isolate the top/bottom contigs.
# Count the total number of contigs: grep -c "^>" single_end/contigs.fa # Determine the length of the 5 largest contigs: grep "^>" single_end/contigs.fa | head -n 5 # Determine the length of the 20 smallest contigs: grep "^>" single_end/contigs.fa | tail -n 20 # Determine the length of the 100th through 110th contigs: grep "^>" single_end/contigs.fa | head -n 110 | tail -n 10
If you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs?
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?
What comes next when working with your own data?
- Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer.
- You can turn the contigs.fa into a blast database (
formatdb
ormakeblastdb
depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast. - If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
- If you don't think the contigs you have are "good enough"
- Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
- Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.