Exercise solutions

SRA Toolkit Exercises

SRA Exercise 1

Find and download RNAseq data from run SRR390925, of experiment SRX112044, publication SRP009873. Copy the file to your home directory on Lonestar at TACC then extract the data in fastq format.

A solution

SRA search page http://www.ncbi.nlm.nih.gov/sra.
Type in SRX112044 ? Search
On experiment summary page click SRR390925
- takes you to the Run browser where you can see example reads
Under "Download", "Run" click "ftp" under .sra
- save the file locally
Open a Terminal window, change into the directory where the file was stored
Copy from local machine to TACC
```
scp SRR390925.sra username@stampede.tacc.utexas.edu:~/
```
- the colon ( : ) after the hostname indicates this is a remote destination
- the ~/ indicates your home directory

Login to stampede:

ssh username@stampede.tacc.utexas.edu

check that the file is in your home directory
```
stamp:~ ls
SRR390925.sra
```

Find the SRA toolkit module

stamp:~ module spider sratoolkit

  ----------------------------------------------------------------------------
  sratoolkit: sratoolkit/2.1.9
  ----------------------------------------------------------------------------
    Description:
      The SRA Toolkit and SDK from NCBI is a collection of tools and
      libraries for using data in the INSDC Sequence Read Archives.

    This module can be loaded directly: module load sratoolkit/2.1.9

    Help:
      The sratoolkit module file defines the following environment variables:
      TACC_SRATOOLKIT_DIR for the location of the sratoolkit distribution.

      Version 2.1.9

Load the module
```
stamp:~ module load sratoolkit
```

Invoke fastq-dump with no arguments to get basic usage

stamp:~ fastq-dump

Usage:
  /opt/apps/sratoolkit/2.1.9//fastq-dump [options] [ -A ] <accession>
  /opt/apps/sratoolkit/2.1.9//fastq-dump [options] <path [path...]>

Use option --help for more information

/opt/apps/sratoolkit/2.1.9//fastq-dump : 2.1.9

Extract to fastq

stamp:~ $TACC_SRATOOLKIT_DIR/fastq-dump SRR390925.sra
Written 1981132 spots for SRR390925.sra
Written 1981132 spots total

Look at some data

stamp:~ ls
SRR390925.fastq  SRR390925.sra

stamp:~ head SRR390925.fastq
@SRR390925.1 ROCKFORD:1:1:0:1260 length=36
NCAACAAGTTTCTTTGGTTATTAACTACGACTTACC
\+SRR390925.1 ROCKFORD:1:1:0:1260 length=36
\#CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@SRR390925.2 ROCKFORD:1:1:0:293 length=36
NAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
\+SRR390925.2 ROCKFORD:1:1:0:293 length=36
\####################################
@SRR390925.3 ROCKFORD:1:1:0:330 length=36
NAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAA

Count lines and number of reads (fastq has 4 lines/read)

stamp:~ wc -l SRR390925.fastq
7924528 SRR390925.fastq
login2$ echo $((7924528 / 4))
1981132

UCSC Genome Browser Exercises

UCSC Exercise 1

Using the UCSC Genome Browser, determine whether Craig Venter or James Watson has a higher risk of Alzheimer's disease.

A Solution

Craig Venter has at least one SNP associated with Alzheimer's disease.

http://genome.ucsc.edu/ ? Genome Browser ? submit
type APOE in gene box ? jump
under "Phenotype and Disease Association" change "GWAS Catalog" from "hide" to "squish" ? refresh
under "Variation & Repeats" click on "Genome Variants" to see subtrack information
- note both Venter and Watson have published their genotypes here
- deselect "1000 Genomes Pilot" tracks (click '-')
- change "Maximum display mode" from "hide" to "pack" ? Submit
zoom in on rs429358. click on rs429358 under "NGRI Catalog... tracks".
- note association w/Alzheimer's disease
back in display window, note that Venter has a variant for this SNP while Watson does not

UCSC Exercise 2

Using the UCSC Genome Browser, find and download a list of high-sequencing-depth regions in BED format.

A Solution

http://genome.ucsc.edu/cgi-bin/hgTable
clade: Mammal, genome: Human, assembly: hg19
group: Mapping and Sequencing tracdks, track: Hi Seq Depth
output format: BED - browser extensible data
filename: hi_seq_depth.bed
? get output
? get BED, save to local directory