Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Warning
titleIf the end of the spades test gives different output do not continue.


Set 1: Plasmid SPAdes

Set 2: SPAdes example Data

SPAdes provides two example data sets and benchmarks for how long the pipeline might take to run http://spades.bioinf.spbau.ru/release3.11.1/manual.html#sec1.3. In this section we will download the data for the standard E. coli data and compare their results with ours.

Code Block
languagebash
titleDownload data
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_1.fastq.gz
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_2.fastq.gz



Set 3: Whole Genome Simulated Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help. 

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node will likely result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Data

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

...

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

...

Expand
titleanswers
  1. Sometimes errors in reads lead to dead-ends in the graphs that are trimmed when they should not be.
  2. There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.

What comes next when working with your own data?

  1. Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer. 
  2. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
  3. If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
  4. If you don't think the contigs you have are "good enough"
    1. Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
    2. Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.

...