Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.

Data

Tutorial assumes that you are on an idev node. If you are not sure please ask for help.

...

titleMove to scratch, copy the raw data, and change into this directory for the tutorial

...

Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different. 

Code Block
languagebash
titleSPAdes self test
mkdir $SCRATCH/GVA_SPAdes_tutorial
cpcd $BI$SCRATCH/ngs_course/velvet/data/*/* GVA_SPAdes_tutorial
cd GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

...

Set 1

...

Set 2

...

Set 3

...

Set 4

...

Read Size

...

100

...

100

...

100

...

100

...

Paired/Single Reads

...

Single

...

Paired

...

Paired

...

Paired

...

Gap Sizes

...

NA

...

400

...

400, 3000

...

400, 3000, 1500

...

Coverage

...

50

...

50

...

25 for each subset

...

20 for each subset

...

Number of Subsets

...

1

...

1

...

2

...

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

Velvet Assembly

Now let's use Velvet to assemble the reads.

First, you will need to load the velvet module.

Code Block
titleLoad the velvet module
collapsetrue
 module load velvet

Using velvet consists of a sequence of two commands:

...

GVA_SPAdes_tutorial
spades.py --test

Assuming everything goes correctly, the last lines printed to the screen should be:

Code Block
titleCorrect SPAdes output
linenumberstrue
======= SPAdes pipeline finished.

========= TEST PASSED CORRECTLY.

SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.

Warning
titleIf the end of the spades test gives different output do not continue.


Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help. 

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node will likely result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:


Set 1

Set 2

Set 3

Set 4

Read Size

100

100

100

100

Paired/Single Reads

Single

Paired

Paired

Paired

Gap Sizes

NA

400

400, 3000

400, 3000, 1500

Coverage

50

50

25 for each subset

20 for each subset

Number of Subsets

1

1

2

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In our case we are looking for the following options:

--12<filename>file with interlaced forward and reverse paired-end reads

-s<filename>file with unpaired reads



It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end

Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
spades.py -s single_end_100_c_50.fastq -o single_end


Look at the help for each program.

...

Code Block
languagebash
titleRun 1 line at a time (4 lines total) on an idev node. If you copy and paste, be sure that you get both the velveth command and the velvetg command after the && symbol on the same lineCommand for running the first data set
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes
velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes
velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes
velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes

...