Because GPU-enabled nodes on the Duke Compute Cluster are also allocated for special projects, they may be removed from the gpu-common partition. This may mean that jobs running on the shared GPU machines would be interrupted. GPUs are useful in other computing environments such for research using highly sensitive data or for research with software that runs only on Windows. In these cases, common GPU machines can be allocated to those projects, which means that they will be unavailable to the cluster during their use elsewhere.
Scheduling of common GPUs is designed to maximize usage and provide service to as broad a community as possible, and ways of doing that are still being devised.
You can lessen the impact of a possible removal of a GPU in several ways, which are part of good computing practice in any case:
- Check-point your jobs. This is good practice for all jobs running on the cluster, since check-pointing preserves the state of processes and makes it possible to “take-up-where-you-left-off” in a restart of the job. This is prudent for jobs that run longer than a few hours.
- Limit the time that your jobs run. The longer your jobs take to run, the more likely they will be interrupted. Also, be sure to check-point (see above).
- If your computation requires sustained use of GPU-enable machines, acquire a GPU machine for your group’s use. The machine can reside in the cluster environment, and your group will have priority, so it will not be part of the gpu-common shared partition.
If you need help or more information on these tactics, contact email@example.com.