Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 23 Current »

SRA Toolkit Exercises

SRA Exercise 1

Find and download RNAseq data from run SRR390925, of experiment SRX112044, publication SRP009873. Copy the file to your Scratch area on Stampede2 at TACC then extract the data in FASTQ format.

A solution

  • SRA search page http://www.ncbi.nlm.nih.gov/sra.
  • Type in SRX112044 then Search
  • On experiment summary page click SRR390925
    • takes you to the Run browser where you can see example reads
  • Under "Download" tab, select Reads
    • This tells you that you need the SRA Toolkit to fetch the Run data
  • Login to stampede2:

    ssh username@stampede2.tacc.utexas.edu:~/
    
    • check that the file is in your Scratch area

      Once on stampede2 (prompt: stamp2$)
      cds
      ls
      SRR390925.sra
      
    • if not, copy it from our course area:

      Once on stampede2 (prompt: stamp2$)
       
  • Find the SRA toolkit module

    module load biocontainers
    module spider sratoolkit
    
    ------------------------------------------------------------------------------------------------------
      sratoolkit: sratoolkit/2.8.2
    ------------------------------------------------------------------------------------------------------
        Description:
          The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for using data in the
          INSDC Sequence Read Archives
    
        This module can be loaded directly: module load sratoolkit/2.8.2
    
        Help:
          The sratoolkit module file defines the following environment variables:
    
           - TACC_SRATOOLK_DIR
           - TACC_SRATOOLK_EXAMPLE - example files
    
          To improve download speed, the prefetch command has been aliased to always
          use aspera. We also suggest running
    
          $ scratch_cache
    
          to change your cache directory to use the scratch filesystem.
    
          Documentation can be found online at https://github.com/ncbi/sra-tools/wiki
    
          Version 2.8.2
    
  • Load the module

    module load sratoolkit
    
  • Invoke fastq-dump with no arguments to see basic usage information.

    Usage:
      fastq-dump [options] <path> [<path>...]
      fastq-dump [options] <accession>
    
    Use option --help for more information
    
    fastq-dump : 2.8.2
    
  • Extract to FASTQ

    fastq-dump SRR390925.sra
    
    # Output should look like this:
    Written 1981132 spots for SRR390925.sra
    Written 1981132 spots total
    
  • Look at some data

    ls
    SRR390925.fastq  SRR390925.sra
    
    head SRR390925.fastq
    @SRR390925.1 ROCKFORD:1:1:0:1260 length=36
    NCAACAAGTTTCTTTGGTTATTAACTACGACTTACC
    \+SRR390925.1 ROCKFORD:1:1:0:1260 length=36
    \#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
    @SRR390925.2 ROCKFORD:1:1:0:293 length=36
    NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    \+SRR390925.2 ROCKFORD:1:1:0:293 length=36
    \####################################
    @SRR390925.3 ROCKFORD:1:1:0:330 length=36
    NAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAA
    
  • Count lines and number of reads (fastq has 4 lines/read)

    login2$ wc -l SRR390925.fastq
    7924528 SRR390925.fastq
    login2$ echo $((7924528 / 4))
    1981132
    

UCSC Genome Browser Exercises

UCSC Exercise 1

Using the UCSC Genome Browser, determine whether Craig Venter or James Watson has a higher risk of Alzheimer's disease.

A Solution

Craig Venter has at least one SNP associated with Alzheimer's disease.

  • http://genome.ucsc.edu/ ? Genome Browser ? submit
  • type APOE in gene box ? jump
  • under "Phenotype and Disease Association" change "GWAS Catalog" from "hide" to "squish" ? refresh
  • under "Variation & Repeats" click on "Genome Variants" to see subtrack information
    • note both Venter and Watson have published their genotypes here
    • deselect "1000 Genomes Pilot" tracks (click '-')
    • change "Maximum display mode"  from "hide" to "pack" ? Submit
  • zoom in on rs429358. click on rs429358 under "NGRI Catalog... tracks".
    • note association w/Alzheimer's disease
  • back in display window, note that Venter has a variant for this SNP while Watson does not

UCSC Exercise 2

Using the UCSC Genome Browser, find and download a list of high-sequencing-depth regions in BED format.

A Solution
  • http://genome.ucsc.edu/cgi-bin/hgTable
  • clade: Mammal, genome: Human, assembly: hg19
  • group: Mapping and Sequencing tracdks, track: Hi Seq Depth
  • output format: BED - browser extensible data
  • filename: hi_seq_depth.bed
  • ? get output
  • ? get BED, save to local directory
  • No labels