...
Find and download RNAseq data from run SRR390925, of experiment SRX112044, publication SRP009873. Copy the file to your home directory Scratch area on Lonestar Stampede2 at TACC then extract the data in fastq FASTQ format.
A solution
- SRA search page http://www.ncbi.nlm.nih.gov/sra.
- Type in SRX112044 ? then Search
- On experiment summary page click SRR390925
- takes you to the Run browser where you can see example reads
- Under "Download" , "Run" click "ftp" under .sra
- save the file locally
- Open a Terminal window, change into the directory where the file was stored
- Copy from local machine to TACC
Code Block scp SRR390925.sra username@lonestar.tacc.utexas.edu:~/
- the colon ( : ) after the hostname indicates this is a remote destination
- the ~/ indicates your home directory
- Login to Lonestar:
ssh username@lonestartab, select ReadsCode Block - This tells you that you need the SRA Toolkit to fetch the Run data
Login to stampede2:
Code Block ssh username@stampede2.tacc.utexas.edu:~/
check that the file is in your
home directoryScratch area
Code Block login2$ language bash title Once on stampede2 (prompt: stamp2$) cds ls SRR390925.sra
if not, copy it from our course area:
Code Block language bash title Once on stampede2 (prompt: stamp2$)
Find the SRA toolkit module
Code Block module login2$load biocontainers module spider sratoolkit ------------------------------------------------------------------------------------------------------ sratoolkit: sratoolkit/2.18.9 2 ------------------------------------------------------------------------------------------------------ Description: The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the libraries for using data in the INSDC Sequence Read Archives. This module can be loaded directly: module load sratoolkit/2.18.92 Help: The sratoolkit module file defines the following environment variables: - TACC_SRATOOLKIT_DIR for the location of the sratoolkit distribution.SRATOOLK_DIR - TACC_SRATOOLK_EXAMPLE - example files To improve download speed, the prefetch command has been aliased to always use aspera. We also suggest running $ scratch_cache to change your cache directory to use the scratch filesystem. Documentation can be found online at https://github.com/ncbi/sra-tools/wiki Version 2.18.92
Load the module
Code Block login2$ module load sratoolkit
Invoke fastq-dump with no arguments to
getsee basic usage information.
Code Block login2$ fastq-dump Usage: /opt/apps/sratoolkit/2.1.9//fastq-dump [options] <path> [ -A ] <accession><path>...] /opt/apps/sratoolkit/2.1.9//fastq-dump [options] <path [path...]><accession> Use option --help for more information /opt/apps/sratoolkit/2.1.9//fastq-dump : 2.18.92
Extract to
fastqFASTQ
Code Block login2$ $TACC_SRATOOLKIT_DIR/fastq-dump SRR390925.sra # Output should look like this: Written 1981132 spots for SRR390925.sra Written 1981132 spots total
Look at some data
Code Block login2$ ls SRR390925.fastq SRR390925.sra login2$ head SRR390925.fastq @SRR390925.1 ROCKFORD:1:1:0:1260 length=36 NCAACAAGTTTCTTTGGTTATTAACTACGACTTACC \+SRR390925.1 ROCKFORD:1:1:0:1260 length=36 \#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC @SRR390925.2 ROCKFORD:1:1:0:293 length=36 NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \+SRR390925.2 ROCKFORD:1:1:0:293 length=36 \#################################### @SRR390925.3 ROCKFORD:1:1:0:330 length=36 NAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAA
Count lines and number of reads (fastq has 4 lines/read)
Code Block login2$ wc -l SRR390925.fastq 7924528 SRR390925.fastq login2$ echo $((7924528 / 4)) 1981132
...