The first level (`n2gpu`) describes the prefix of the host names in which corresponding accelerators are installed. The second level describes the ID in Slurm followed by the device id.
How to get this data? It depends on the accelerators. The following example is for a host with four NVidia A100 GPUs. This should be similar on all hosts with NVidia GPUs:
You will find the four GPUs identified by ids starting at 0. In the second coloum, you can find the Bus-ID or identifier of the GPU. These are the values which have to be defined in the code example above. The mechanism in the background assumes that all nodes starting with this prefix have the same configuration and assignment of ID to bus ID. So if you have another configuration, you have to start a new prefix, only matching the hosts with this configuration.
**node_regex**
This option is unique to every cluster system. This regex describes the sytax of the hostnames which are used as computing resources in jobs. \ have to be escaped
Simply run `slurm-clusercockpit-sync.py` inside the same directory which contains the config.json file. A brief help is also available:
*`-c, --config` You can use a different config file for testing or other purposes. Otherwise it would use config.json in the actual directory.
*`-j, --jobid` In a test setup it might be useful to sync individual job ids instead of syncing all jobs.
*`-l, --limit` Synchronize only this number of jobs in the respective direction. Stopping a job might take some short time. If a massive amount of jobs have to get stopped, the script might run a long time and miss new starting jobs if they start end end within the execution time of the script.
*`--direction` Mostly a debug option. Only synchronize starting or stopping jobs. The default is both directions.
The script terminates after synchronization of all jobs.
# Getting help
This script is to be seen as an example implementation and may have to be adapted for other installations. I tried to keep the script as general as possible and to catch some differences between clusters already. If adjustments are necessary, I am happy about pull requests or notification about that on other ways to get an implementation that runs on as many systems as possible without adjustments in the long run.