Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.

...

Code Block
languagebash
titleMaking a DIRectory named SRC in $WORK (the capital letters are your clues)
collapsetrue
mkdir $WORK/src

Do one of the following, (or both if you want practice moving files around):

  1. Try to use 'wget -h' before clicking below. When using wget it is often helpful to right click on a link and select 'copy link address' when the file you want is available through a download link.

With better read pairs that link more distant locations in the genome, there are fewer contigs, and contigs are are longer, giving us a more complete picture of linkage across the genome.

The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with this "perfect" data?

...

  1. Sometimes errors in reads lead to dead-ends in the graphs that are trimmed when they should not be.
  2. There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.

More assembly statistics: contig_stats.pl

The output file stats.txt contains information about every contig in the assembly, but it isn't sorted and can be a bit overwhelming.

You can generate some summary stats and graphs about each assembly using the contig_stats.pl script in the $BI/bin. You probably want to change into the directory of results for a specific assembly before running this command, since it generates several output files in the current working directory. You will need to copy the PNG files back to your computer to view it using scp.

Expand
titleNeed a hint?

Since you are moved into each of the results directories, you need to give the relevant location

Code Block
languagebash
titleNot still not sure? click here for the solution
collapsetrue
contig_stats.pl -plot -tool Velvet contigs.fa

Transfer back several of the different outputs into their own direcotry and being comparing them to determine which library and set of parameters seemed to work best.

What do I do now?

Many choices:

  1. Get a better assembly: maybe add a different library size, or go into a detailed genome completion project (commonly called "finishing") using a sequence assembly editor like consed or gap4 or AMOS. (Be careful though, the amount of data in NGS data sets can be very difficult for these programs to deal with, since many were designed for Sanger sequencing reads.) You may have a lot of PCR products to make to close gaps and/or to order and orient scaffolds. consed in particular makes this pretty easy, but it may still consume a lot more time and money than the initial shotgun assembly. You can identify some misassemblies by mapping the original reads to the assembly and then viewing them in IGV to look for discordant mate pairs, for example.
    Code Block
    languagebash
    titleHow to use wget to download directly to TACC
    collapsetrue

...

cd $WORK/src
wget http://cab.spbu.ru/files/release3.13.0/SPAdes-3.13.0-Linux.tar.gz

...

languagebash
titleHow to use SCP to transfer the downloaded file to TACC from your laptop
collapsetrue

Data

Tutorial assumes that you are on an idev node. If you are not sure please ask for help.

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
cds
mkdir GVA_SPAdes_tutorial
cp $BI/ngs_course/velvet/data/*/* GVA_SPAdes_tutorial
cd GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

...

Set 1

...

Set 2

...

Set 3

...

Set 4

...

Read Size

...

100

...

100

...

100

...

100

...

Paired/Single Reads

...

Single

...

Paired

...

Paired

...

Paired

...

Gap Sizes

...

NA

...

400

...

400, 3000

...

400, 3000, 1500

...

Coverage

...

50

...

50

...

25 for each subset

...

20 for each subset

...

Number of Subsets

...

1

...

1

...

2

...

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

Velvet Assembly

Now let's use Velvet to assemble the reads.

First, you will need to load the velvet module.

Code Block
titleLoad the velvet module
collapsetrue
 module load velvet

Using velvet consists of a sequence of two commands:

  1. velveth - analyzes kmers in the reads in preparation for assembly
  2. velvetg - constructs the assembly and filters contigs from the graph

Look at the help for each program.

The <hash_length> parameter of velveth is the kmer value that is key to the assembly process. Choosing it controls the tradeoff between sensitivity (lower hash_length, more reads included, longer contigs) and specificity (higher hash length, less chance of misassembly, more reads ignored, shorter contigs)). There is more discussion about choosing an appropriate kmer value in the Velvet manual and in this blog post.

Velvet has an option of specifying the insertion size of a paired read file (-ins_length). If no size is given, Velvet will guess the insertion size. We'll just have Velvet guess the size.

Velvet also has an option to specify the expected coverage of the genome, which helps it choose how to resolve repeated sequences (-exp_cov). We set this parameter to estimate this from the data. A common problem with using Velvet is that you have many very short contigs and the last line of output tells you that it used 0 of your reads. This is caused by not setting this parameter. The default is NOT auto.

The following commands will each take some time to run, while they run, read ahead to learn more about what you will be seeing:

Code Block
languagebash
titleRun 1 line at a time (4 lines total) on an idev node. If you copy and paste, be sure that you get both the velveth command and the velvetg command after the && symbol on the same line
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes
velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes
velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes
velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes
Warning
titleA warning on memory usage

Velvet and other assemblers usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the standard que.

Tip
titleWhat does the && symbol mean?

As we discussed in our piping tutorial, things separated by a ';' mark will execute 1 after the other. In the case of &&, the second command will only execute if the first command finishes correctly (exit status zero to get technical). This can be useful to consider when building a pipeline to limit progression from 1 program to another when something has failed, BUT only if you understand exit states of each program.

Velvet Output

Explore each output directory that was created by Velvet.

Checking the tail of the Log files in each of the output folders, we see lines like the following:

...

Code Block
titleSingle Set
Median coverage depth = 2.657895
Final graph has 9748 nodes and n50 of 191, max 1427, total 1865207, using 281499/2314900 reads
Code Block
titleSet with one group of reads at 50 coverage
Median coverage depth = 11.131337
Final graph has 265 nodes and n50 of 127102, max 397974, total 4558511, using 1464201/2314900 reads
Code Block
titleSet with 2 groups of reads both at 25 coverage each
Median coverage depth = 11.109244
Final graph has 203 nodes and n50 of 698134, max 1032531, total 4585717, using 1465818/2314900 reads
Code Block
titleSet with 3 groups of reads all at 20 coverage each
Median coverage depth = 13.353287
Final graph has 202 nodes and n50 of 698626, max 1139610, total 4602729, using 1758595/2777880 reads
  1. #Note that idev nodes have a tendency to download files from the internet much slower than the same file would download from the head node. If you are already in an idev node, it is likely faster to logout of the idev node (with the 'logout' command), execute the wget command listed below on the head node, and then start a new idev node after the download is complete.
    cd $WORK/src
    wget http://cab.spbu.ru/files/release3.13.0/SPAdes-3.13.0-Linux.tar.gz
  2. Remember that scp has 2 parts after the command name just like the cp command: 1. the location the file currently is, and 2. the location you want to copy the file to. Most of the class has dealt with moving things from TACC to your computer, but in this case things will move the opposite direction, think about what needs to change about your SCP command to accomplish this.

    Code Block
    languagebash
    titleHow to use SCP to transfer the downloaded file to TACC from your laptop (MAC)
    collapsetrue
    In a terminal window of your laptop not LS5
    scp ~/Downloads/SPAdes-3.13.0-Linux.tar.gz <taccuserID>@ls5.tacc.utexas.edu:<$WORK pwd>/src # Note you need to replace $WORK with the output from the pwd command on TACC

Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.

Code Block
languagebash
titlehint eXtracting a .tar.gz file is the opposite of Creating one (hints are in the capital letters)
collapsetrue
cd $WORK
tar -xvzf SPAdes-3.13.0-Linux.tar.gz
# from the help file:
  # x = Extract
  # v = verbose
  # z = file is also gzipped
  # f = force 

Now that the files have been extracted you have a choice in how to use them: 1 option is to copy the binary files to a location that is already in your path (such as the $HOME/local/bin directory we set up for you in your .bashrc file), and the second option is to add the $WORK/src/SPAdes-3.13.0-Linux/bin folder to your path. This is a personal preference and I do not know how prevalent either choice is among researchers. I know that my preference is to copy executable to known locations in the path rather than add a ton of different directories to my path, but others may feel differently. Below I present both options:

Warning

Doing both of the following may cause unintended effects in the future (particularly if you attempt to update the version of SPAdes you are using) and I do not recommend it.

Code Block
languagebash
titleCopy executables to somewhere already in your path (IE $HOME/local/bin)
collapsetrue
cp $WORK/src/SPAdes-3.13.0-Linux/bin/* $HOME/local/bin  #Note that by specifying the full path all the files and the destination, this command can be run from anywhere on TACC.
Code Block
languagebash
titleSuggested line to add to your .bashrc file to directly access executables
collapsetrue
export PATH=$WORK/src/SPAdes-3.13.0-Linux/bin:$PATH
# This line must be added to the .bashrc file found in your $HOME directory, in section 2. I typically add all modifications 1 after the other so the most recent thing i have added is the last line in this section.

If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.

Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different. 

Code Block
languagebash
titleSPAdes self test
mkdir $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
spades.py --test

Assuming everything goes correctly, the last lines printed to the screen should be:

Code Block
titleCorrect SPAdes output
linenumberstrue
======= SPAdes pipeline finished.

========= TEST PASSED CORRECTLY.

SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.

Warning
titleIf the end of the spades test gives different output do not continue.


Set 1: Plasmid SPAdes

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.

Data

Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.

Code Block
languagebash
titleYou should know the copy command by now, try to get it on your own before checking your answer here
collapsetrue
cp $BI/gva_course/Assembly/*.fastq.gz $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In this case we are looking for the following options:

-1<filename>file with forward paired-end reads

-2<filename>file with reverse paired-end reads

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take less than 2 minutes.


Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
plasmidspades.py -t 48 -o plasmid -1 SH1_1P.fastq.gz -2 SH1_2P.fastq.gz 

Set 2: SPAdes example Data

This assembly may take more than 2 hours to complete and like the next data set it may be better to run this as a job overnight and analyze the data in tomorrow's class. If you choose to attempt to run this data as an interactive session remember that if your idev session runs out of time, the program will stop working and you will have to start over. If you want to submit it as a job, take the commands you want to run (ie the commands that start with "spades.py" and put them in a file named 'commands' using nano. You can include both commands for this data set and the next data set in the same commands file. Finally use the 'launcher_creator.py' command to generate a ".slurm" file and check with instructor that everything is set up correctly so the job will run overnight.

Data

SPAdes provides two example data sets and benchmarks for how long the pipeline might take to run http://spades.bioinf.spbau.ru/release3.11.1/manual.html#sec1.3. In this section we will download the data for the standard E. coli data and compare their results with ours.

Code Block
languagebash
titleDownload data
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cd $SCRATCH/GVA_SPAdes_tutorial
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_1.fastq.gz
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_2.fastq.gz

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In this case we are looking for the following options:

-1<filename>file with forward paired-end reads

-2<filename>file with reverse paired-end reads

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.


Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
spades.py -o example -t 48 -1 s_6_1.fastq.gz -2 s_6_2.fastq.gz

Set 3: Whole Genome Simulated Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help. 

Data

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:


Set 1

Set 2

Set 3

Set 4

Read Size

100

100

100

100

Paired/Single Reads

Single

Paired

Paired

Paired

Gap Sizes

NA

400

400, 3000

400, 3000, 1500

Coverage

50

50

25 for each subset

20 for each subset

Number of Subsets

1

1

2

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In our case we are looking for the following options:

--12<filename>file with interlaced forward and reverse paired-end reads

-s<filename>file with unpaired reads



It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.

Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
spades.py -t 48 -o single_end -s single_end_100_c_50.fastq 


If you are planning to run a job overnight, consider adding additional combinations of reads as individual commands to get an idea of how different insert sizes can play a role in final contig lenghts. Just remember that insert length should always increase:

Code Block
languagebash
titlePossible other commands
linenumberstrue
spades.py -t 48 -o 400_1500_3000 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq --12 paired_end_2x100_ins_3000_c_25.fastq
spades.py -t 48 -o 400_and_1500 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq
spades.py -t 48 -o 400_only --12 paired_end_2x100_ins_400_c_50.fastq 
Warning
titleA warning on memory usage

SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 3 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que.


Interrogating the output

Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.

The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the reults to head/tail or both head and tail to isolate the top/bottom contigs.

Code Block
languagebash
titleExample grep commands
# Count the total number of contigs:
grep -c "^>" single_end/contigs.fasta

# Determine the length of the 5 largest contigs:
grep "^>" single_end/contigs.fasta | head -n 5

# Determine the length of the 20 smallest contigs:
grep "^>" single_end/contigs.fasta | tail -n 20

# Determine the length of the 100th through 110th contigs:
grep "^>" single_end/contigs.fasta | head -n 110 | tail -n 10

If you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs? Why might larger insert sizes not help things very much?

Expand
titleAnswer

The length of repetitive elements in the genome plays a large role in how easily it can be assembled as large repeats need even larger insert sizes to be spanned by single read pairs.

The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?

Expand
titleanswers

There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.

What comes next when working with your own data?

  1. Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Most assemblers will be able to take 2x100 data and give you full gene sequences since these are non-repetitive and so assemble well, and obviously 2x600 miSEQ runs will do even better. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) and start blasting away.

Further Reading

...

  1. Think about what question you are trying to answer. 
  2. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
  3. If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
  4. If you don't think the contigs you have are "good enough"
    1. Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
    2. Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.

Original spades publication

Return to GVA2019