Page Comparison

Table of Contents

...

Austin's own Advanced Micro Devices (AMD) has most generously donated a number of GPU-enabled servers to UT.

While it is still true that AMD GPUs do not support as many 3rd party applications as NVIDIA, they do support many popular Machine Learning (ML) applications such as TensorFlow, PyTorch, and AlphaFold, and Molecular Dynamics (MD) applications such as GROMACS, all of which are installed and ready for use.

Our recently announced AMD GPU pod is available for both research and instructional use, for any UT-Austin affiliated PIs. To request an allocation, ask your PI to contact us at rctf-support@utexas.edu, and provide the UT EIDs of those who should be granted access.

...

The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.4 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.

On AMD GPU servers, AlphaFold is implemented by a run_alphafold.py Python script inside a Docker image, See the run_alphafold_rocm.sh and run_multimer_rocm.sh scripts under /stor/scratch/AlphaFold for a complete list of options to that script.

Pytorch and TensorFlow

Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. Run them from the command line using time to compare the run times.

Tensor Flow
- time (python3 /stor/scratch/GPU_info/tensorflow_example.py )
  - should take ~30s or less with GPU (on an unloaded system), > 1 minute with CPUs only
  - this is a simple test, and on CPU-only servers multiple cores are used but only 1 GPU, one reason why the times are not more different
PyTorch
- time (python3 /stor/scratch/GPU_info/pytorch_example.py )
  Note: this test script is not yet working on AMD servers
- Model time should be ~30-45s with GPU on an unloaded system
- You'll see this warning, which can be ignored:
  MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx90878.kdb Performance
  may degrade. Please follow instructions to install: https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-package

If GPUs are available and accessible, the output generated will indicate they are being used.

...

benchmarks/ - a set of MD benchmark files from https://www.mpinat.mpg.de/grubmueller/bench
gromacs_amd_example.sh - a simple GROMACS example script taking advantage of the GPU, running the benchMEM.tpr benchmark by default.
gromacs_cpu_example.sh - an a GROMACS example script using the CPUs only.

...

We have multiple versions of the ROCm framework installed in the /opt directory, designated by a version number extension (e.g. /opt/rocm-5.1.3, /opt/rocm-5.2.3). The default version is the one pointed to by the /opt/rocm symbolic link, which is generally the latest version.

As of May 2024, the highest ROCm version installed (and the default) is rocm-5.7.2. This is the last minor version in the ROCm 5.x series. ROCm series 6.x versions have now been published, but we do not yet have them installed on the AMD compute servers.

To specify a specific particular ROCm version other than the default, set the ROCM_HOME environment variable; for example:

...

ROCm Video series
- https://community.amd.com/t5/instinct-accelerators-blog/rocm-open-software-ecosystem-for-accelerated-compute/ba-p/418720
- Especially the Introduction to AMD GPU Hardware: Link
  - Provides hardware background and terminology used throughout other guides
- Also
  - GPU Programming Concepts
    - Part 1 - HIP framework (like NVIDIA CUDA): Link
    - Part 2 - Device management, synchronization, MPI programming: Link
    - Part 3 - Device code, shared memory & thread synchronization: Link
  - GPU Programming Software (compilers, libraries & tools): Link
AMD ROCm resources Learning Center: https://developer.amd.com/resources/rocm-resources/rocm-learning-center/
- Especially:
  - Introduction to ROCm (Video, PDF)
  - Introduction to HIP (Video, PDF)
  - Introduction to Deep Learning on ROCm (Video, PDF)

...

Versions Compared

Old Version 18

New Version 19

Key

Pytorch and TensorFlow