2020 Data wrangling best practices
Data wrangling best practices
NGS is smack dab in the middle of the Big Data revolution. Initial NGS FASTQ files are big (100s of MB to GB) – and they're just the start.
Organization and good practices are critical! Your data can get out of hand very quickly!
keep fastq files compressed
- Most sequencing facilities will give you compressed sequencing data files
- gzip format (.gz extension) for individual files
- tar or zip format for directories of files
- Even with compression it's easy to run out of storage space!
You may be tempted un-compress your sequencing files to manipulate them more directly
- resist the temptation to gunzip!
- nearly all modern bioinformatics tools are able to work on .gz files
- there are techniques for working with compressed files without ever un-compressing them
arrange adequate storage space
- At TACC
- Obtain an allocation on TACC's corral disk array (initial 5 TB are no-cost)
- Stage your active projects on corral or $WORK
- copy data to $SCRATCH for analysis
- copy important analysis products back to corral or $WORK
- Periodically back up corral or $WORK directories to ranch tape archive
- On a UT Biomedical Research Support Facility (BRCF) "POD"
- See https://utexas.atlassian.net/wiki/display/RCTFusers/
- Home and Work areas on POD servers are automatically backed up weekly
- and archived to ranch every 4-6 months
- Home and Work areas on POD servers are automatically backed up weekly
- GSAF customers can obtain a no-cost 2 TB allocation on the GSAF POD
- See https://utexas.atlassian.net/wiki/display/RCTFusers/
backup analysis artifacts regularly
- All TACC users automatically have a 2 TB allocation TACC's ranch tape archive system
- larger allocations can be requested by project owners in the TACC User Portal
- free! and under-utilized
- Periodically back up your corral or $WORK directories to ranch tape archive
- large directories should be combined first using the tar program
- large directories should be combined first using the tar program
distinguish between types of data
Artifacts from different stages of the analysis will have different archival requirements.
- Original sequence data (FASTQ files)
- must be backed up!
- Alignments
- usually larger than original FASTQs
- can be backed up once stable
- Downstream analysis artifacts
- Reporting artifacts (plots, plotting code)
While a project is active you will want to keep more intermediate artifacts for reference. Many of these can be deleted after publication.
track your analysis steps
Your analyses should be reproducible by others so you need to keep the equivalent of a lab notebook to document your protocols.
- Keep "work files" that detail analysis steps performed
- here's and 2020 Example alignment work file
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.