Files
cocogoat/AMD_FIX_SUMMARY.md
2026-02-05 23:18:26 +01:00

3.6 KiB

AMD GPU Monitoring Fix Summary

Issue

The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level rocmsmi module which has complex initialization and function signatures.

Solution

Updated to use the correct rocml high-level API from pyrsmi, based on the official example at: /anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py

Changes Made

1. Fixed AMDMonitor Class

Before (incorrect):

from pyrsmi import rocmsmi
ret = self.rocmsmi.rsmi_init(0)
power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)

After (correct):

from pyrsmi import rocml
self.rocml.smi_initialize()
power_watts = self.rocml.smi_get_device_average_power(self.device_id)

Key API Functions:

  • rocml.smi_initialize() - Initialize monitoring
  • rocml.smi_get_device_average_power(device_id) - Get power in Watts (not microwatts!)
  • rocml.smi_get_device_utilization(device_id) - Get GPU utilization %
  • rocml.smi_get_device_memory_used(device_id) - Get memory used in bytes
  • rocml.smi_get_device_memory_total(device_id) - Get total memory in bytes
  • rocml.smi_get_device_temperature(device_id) - Get temperature
  • rocml.smi_get_device_name(device_id) - Get device name
  • rocml.smi_shutdown() - Cleanup

2. Updated All SLURM Scripts for Apptainer

All GPU benchmark scripts now run inside the apptainer container:

A100, H100, H200 (NVIDIA):

APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...

MI300X (AMD):

APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...

Note: --nv for NVIDIA, --rocm for AMD

3. Updated Documentation

  • README.md now mentions apptainer usage
  • Updated setup instructions to use apptainer for model caching
  • Added notes about container flags (--nv vs --rocm)

Testing

To verify the AMD monitoring works:

# Inside apptainer on MI300X node
apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
from utils.gpu_monitor import AMDMonitor
m = AMDMonitor(0)
print(f'GPU: {m.get_device_name()}')
metrics = m.get_metrics()
print(f'Power: {metrics.power_watts:.2f} W')
print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
m.cleanup()
"

Files Modified

  1. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py - Fixed AMDMonitor class
  2. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh - Added apptainer
  3. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh - Added apptainer
  4. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh - Added apptainer
  5. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh - Added apptainer with --rocm
  6. /anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md - Updated documentation

Key Differences: rocml vs rocmsmi

Feature rocml (High-level) rocmsmi (Low-level)
API Style Simple functions Complex C-style API
Initialization smi_initialize() rsmi_init(0) + error codes
Power Returns Watts Returns microwatts
Memory Returns bytes Returns bytes via enums
Error Handling Returns -1 on error Returns error codes
Ease of Use Much easier Complex

The rocml module is the recommended high-level Python API for pyrsmi.