Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 22

...

Find and download RNAseq data from run SRR390925, of experiment SRX112044, publication SRP009873. Copy the file to your Scratch area home directory on Stampede2 Lonestar at TACC then extract the data in FASTQ fastq format.

A solution

  • SRA search page http://www.ncbi.nlm.nih.gov/sra.
  • Type in SRX112044 then ? Search
  • On experiment summary page click SRR390925
    • takes you to the Run browser where you can see example reads
  • Under "Download" tab, select Reads
    • This tells you that you need the SRA Toolkit to fetch the Run data

    Login to stampede2:

    Code Blockssh username@stampede2, "Run" click "ftp" under .sra
    • save the file locally
  • Open a Terminal window, change into the directory where the file was stored
  • Copy from local machine to TACC
    Code Block
    
    scp SRR390925.sra username@lonestar.tacc.utexas.edu:~/
    
    • the colon ( : ) after the hostname indicates this is a remote destination
    • the ~/ indicates your home directory
  • Login to Lonestar:
    Code Block
    
    ssh username@lonestar.tacc.utexas.edu:~/
    
    • check that the file is in your Scratch areahome directory
      Code Block
      languagebash
      titleOnce on stampede2 (prompt: stamp2$)
      cds
      login2$ ls
      SRR390925.sra
      

      if not, copy it from our course area:

      Once on stampede2 (prompt: stamp2$)
      Code Block
      languagebash
      title
  • Find the SRA toolkit module
    Code Block
    module
    load biocontainerslogin2$ module spider sratoolkit
    
      ------------------------------------------------------------------------------------------------------
      sratoolkit: sratoolkit/2.81.2
    -------------------------9
      -----------------------------------------------------------------------------
        Description:
          The SRA Toolkit and SDK from NCBI is a collection of tools and
          libraries for using data in the
          INSDC Sequence Read Archives.
    
        This module can be loaded directly: module load sratoolkit/2.81.29
    
        Help:
          The sratoolkit module file defines the following environment variables:
    
     
         - TACC_SRATOOLKSRATOOLKIT_DIR        - TACC_SRATOOLK_EXAMPLE - example files
    
          To improve download speed,for the prefetch command has been aliased to always
          use aspera. We also suggest running
    
          $ scratch_cache
    
          to change your cache directory to use the scratch filesystem.
    
          Documentation can be found online at https://github.com/ncbi/sra-tools/wikilocation of the sratoolkit distribution.
    
          Version 2.81.29
    
  • Load the module
    Code Block
    
    login2$ module load sratoolkit
    
  • Invoke fastq-dump with no arguments to see get basic usage information.
    Code Block
    
    login2$ fastq-dump
    
    Usage:
      /opt/apps/sratoolkit/2.1.9//fastq-dump [options] [ <path> [<path>...]-A ] <accession>
      /opt/apps/sratoolkit/2.1.9//fastq-dump [options] <accession> <path [path...]>
    
    Use option --help for more information
    
    /opt/apps/sratoolkit/2.1.9//fastq-dump : 2.81.29
    
  • Extract to FASTQ fastq
    Code Block
    
    login2$ $TACC_SRATOOLKIT_DIR/fastq-dump SRR390925.sra
    
    #
    Output should look like this:
    Written 1981132 spots for SRR390925.sra
    Written 1981132 spots total
    
  • Look at some data
    Code Block
    
    login2$ ls
    SRR390925.fastq  SRR390925.sra
    login2$ head SRR390925.fastq
    @SRR390925.1 ROCKFORD:1:1:0:1260 length=36
    NCAACAAGTTTCTTTGGTTATTAACTACGACTTACC
    \+SRR390925.1 ROCKFORD:1:1:0:1260 length=36
    \#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
    @SRR390925.2 ROCKFORD:1:1:0:293 length=36
    NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    \+SRR390925.2 ROCKFORD:1:1:0:293 length=36
    \####################################
    @SRR390925.3 ROCKFORD:1:1:0:330 length=36
    NAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAA
    
  • Count lines and number of reads (fastq has 4 lines/read)
    Code Block
    
    login2$ wc -l SRR390925.fastq
    7924528 SRR390925.fastq
    login2$ echo $((7924528 / 4))
    1981132
    

...