Table of Contents |
---|
Data wrangling best practices
NGS is smack dab in the middle of the Big Data revolution. Initial NGS fastq FASTQ files are big (100s of MB to GB) – and they're just the start.
...
arrange adequate storage space
- At TACC
- Obtain an allocation on TACC's
work
file system or- corral disk array (initial 5 TB are no-cost)
- Stage your active projects on corral or $WORK
- copy data to $SCRATCH for analysis
- copy important analysis products back to corral or $WORK
- Periodically back up corral or $WORK directories to ranch tape archive
- On a UT Biomedical Research Support Facility (BRCF) "POD"
- See https://wikis.utexas.edu/display/RCTFusers
- Home and Work areas on POD servers are automatically backed up weekly
- and archived to ranch every 4-6 months
- Home and Work areas on POD servers are automatically backed up weekly
- GSAF customers can obtain a no-cost 2 TB allocation on the GSAF POD
- See https://wikis.utexas.edu/display/RCTFusers
backup analysis artifacts regularly
...
Artifacts from different stages of the analysis will have different archival requirements.
- Original sequence data (fastq FASTQ files)
- must be backed up!
- Alignments
- usually larger than original fastq FASTQs
- can be backed up once stable
- Downstream analysis artifacts
- Reporting artifacts (plots, plotting code)
...
- Keep "work files" that detail analysis steps performed
- here's and Example alignment work file
...