...
If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.
Data
Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
...
title | Move to scratch, copy the raw data, and change into this directory for the tutorial |
---|
...
Testing SPAdes installation
SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.
Code Block | ||||
---|---|---|---|---|
| ||||
mkdir $SCRATCH/GVA_SPAdes_tutorial cpcd $BI$SCRATCH/ngs_course/velvet/data/*/* GVA_SPAdes_tutorial cd GVA_SPAdes_tutorial |
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
Code Block | ||
---|---|---|
| ||
paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_50.fastq
|
There are 4 sets of simulated reads:
...
Set 1
...
Set 2
...
Set 3
...
Set 4
...
Read Size
...
100
...
100
...
100
...
100
...
Paired/Single Reads
...
Single
...
Paired
...
Paired
...
Paired
...
Gap Sizes
...
NA
...
400
...
400, 3000
...
400, 3000, 1500
...
Coverage
...
50
...
50
...
25 for each subset
...
20 for each subset
...
Number of Subsets
...
1
...
1
...
2
...
3
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
Code Block | ||
---|---|---|
| ||
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6
|
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Velvet Assembly
Now let's use Velvet to assemble the reads.
First, you will need to load the velvet module.
Code Block | ||||
---|---|---|---|---|
| ||||
module load velvet |
Using velvet consists of a sequence of two commands:
...
GVA_SPAdes_tutorial
spades.py --test |
Assuming everything goes correctly, the last lines printed to the screen should be:
Code Block | ||||
---|---|---|---|---|
| ||||
======= SPAdes pipeline finished.
========= TEST PASSED CORRECTLY.
SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log
Thank you for using SPAdes! |
The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.
Warning | ||
---|---|---|
| ||
Data
Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
Tip |
---|
Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node will likely result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node. |
Code Block | ||
---|---|---|
| ||
mkdir $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
|
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
Code Block | ||
---|---|---|
| ||
paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_50.fastq
|
There are 4 sets of simulated reads:
Set 1 | Set 2 | Set 3 | Set 4 | |
---|---|---|---|---|
Read Size | 100 | 100 | 100 | 100 |
Paired/Single Reads | Single | Paired | Paired | Paired |
Gap Sizes | NA | 400 | 400, 3000 | 400, 3000, 1500 |
Coverage | 50 | 50 | 25 for each subset | 20 for each subset |
Number of Subsets | 1 | 1 | 2 | 3 |
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
Code Block | ||
---|---|---|
| ||
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6
|
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
SPAdes Assembly
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
Expand | ||
---|---|---|
| ||
The first option in the basic option is: -o<output_dir>directory to store all the resulting files (required) And we will need to supply the read files to the program. In our case we are looking for the following options: --12<filename>file with interlaced forward and reverse paired-end reads -s<filename>file with unpaired reads It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently |
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
spades.py -s single_end_100_c_50.fastq -o single_end |
Look at the help for each program.
...
Code Block | ||||
---|---|---|---|---|
| ||||
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes |
...