Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Data wrangling best practices

...

keep fastq files compressed

  • Most sequencing facilities will give you compressed sequencing data files
    • gzip format (.gz extension) for individual files
    • tar or zip format for directories of files
  • Even with compression it's easy to run out of storage space!

...

  • resist the temptation to gunzip!
  • nearly all modern bioinformatics tools are able to work on .gz files
  • there are techniques for working with compressed files without ever un-compressing them

arrange adequate storage space

  • Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
  • Stage your active projects on corral 
    • copy data to $WORK or $SCRATCH for analysis
    • copy important analysis products back to corral 
  • Periodically back up corral directories to ranch tape archive

backup analysis artifacts regularly

  • Obtain an allocation on TACC's ranch tape archive system
    • 10 TB a good initial number
    • free! and under-utilized
  • Periodically back up your corral directories to ranch tape archive

distinguish between types of data

Artifacts from different stages of the analysis will have different archival requirements.

  • Original sequence data (fastq files)
    • must be backed up!
  • Alignments
    • usually larger than original fastqs
    • should be backed up once stable
  • Peak calling artifacts
  • Downstream analysis artifacts

While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted after publication.

track your analysis steps

Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.

...