Page Comparison

...

If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.

Data

Tutorial assumes that you are on an idev node. If you are not sure please ask for help.

...

title	Move to scratch, copy the raw data, and change into this directory for the tutorial

...

Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.

Code Block

language	bash
title	SPAdes self test

mkdir $SCRATCH/GVA_SPAdes_tutorial
cpcd $BI$SCRATCH/ngs_course/velvet/data/*/* GVA_SPAdes_tutorial
cd GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block

title	Files in the tutorial directory

paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

...

Set 1

...

Set 2

...

Set 3

...

Set 4

...

Read Size

...

100

...

100

...

100

...

100

...

Paired/Single Reads

...

Single

...

Paired

...

Paired

...

Paired

...

Gap Sizes

...

NA

...

400

...

400, 3000

...

400, 3000, 1500

...

Coverage

...

50

...

50

...

25 for each subset

...

20 for each subset

...

Number of Subsets

...

1

...

1

...

2

...

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block

title	Interleaved fastq

tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

Velvet Assembly

Now let's use Velvet to assemble the reads.

First, you will need to load the velvet module.

Code Block

title	Load the velvet module
collapse	true

 module load velvet

Using velvet consists of a sequence of two commands:

...

GVA_SPAdes_tutorial
spades.py --test

Assuming everything goes correctly, the last lines printed to the screen should be:

Code Block

title	Correct SPAdes output
linenumbers	true

======= SPAdes pipeline finished.

========= TEST PASSED CORRECTLY.

SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.

Warning

title	If the end of the spades test gives different output do not continue.

Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help.

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node will likely result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Code Block

title	Move to scratch, copy the raw data, and change into this directory for the tutorial

mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block

title	Files in the tutorial directory

paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

	Set 1	Set 2	Set 3	Set 4
Read Size	100	100	100	100
Paired/Single Reads	Single	Paired	Paired	Paired
Gap Sizes	NA	400	400, 3000	400, 3000, 1500
Coverage	50	50	25 for each subset	20 for each subset
Number of Subsets	1	1	2	3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block

title	Interleaved fastq

tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand

title	Using the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In our case we are looking for the following options:

--12<filename>file with interlaced forward and reverse paired-end reads

-s<filename>file with unpaired reads

It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end

Code Block

language	bash
title	Did you come up with the same thing I did?
collapse	true

spades.py -s single_end_100_c_50.fastq -o single_end

Look at the help for each program.

...

Code Block

language	bash
title	Run 1 line at a time (4 lines total) on an idev node. If you copy and paste, be sure that you get both the velveth command and the velvetg command after the && symbol on the same lineCommand for running the first data set

velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes
velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes
velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes
velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes

...

Versions Compared

Old Version 5

New Version 6

Key

Data

Testing SPAdes installation

Velvet Assembly

Data

SPAdes Assembly