3.6 KiB
3.6 KiB
AMD GPU Monitoring Fix Summary
Issue
The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level rocmsmi module which has complex initialization and function signatures.
Solution
Updated to use the correct rocml high-level API from pyrsmi, based on the official example at:
/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py
Changes Made
1. Fixed AMDMonitor Class
Before (incorrect):
from pyrsmi import rocmsmi
ret = self.rocmsmi.rsmi_init(0)
power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)
After (correct):
from pyrsmi import rocml
self.rocml.smi_initialize()
power_watts = self.rocml.smi_get_device_average_power(self.device_id)
Key API Functions:
rocml.smi_initialize()- Initialize monitoringrocml.smi_get_device_average_power(device_id)- Get power in Watts (not microwatts!)rocml.smi_get_device_utilization(device_id)- Get GPU utilization %rocml.smi_get_device_memory_used(device_id)- Get memory used in bytesrocml.smi_get_device_memory_total(device_id)- Get total memory in bytesrocml.smi_get_device_temperature(device_id)- Get temperaturerocml.smi_get_device_name(device_id)- Get device namerocml.smi_shutdown()- Cleanup
2. Updated All SLURM Scripts for Apptainer
All GPU benchmark scripts now run inside the apptainer container:
A100, H100, H200 (NVIDIA):
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...
MI300X (AMD):
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...
Note: --nv for NVIDIA, --rocm for AMD
3. Updated Documentation
- README.md now mentions apptainer usage
- Updated setup instructions to use apptainer for model caching
- Added notes about container flags (--nv vs --rocm)
Testing
To verify the AMD monitoring works:
# Inside apptainer on MI300X node
apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
from utils.gpu_monitor import AMDMonitor
m = AMDMonitor(0)
print(f'GPU: {m.get_device_name()}')
metrics = m.get_metrics()
print(f'Power: {metrics.power_watts:.2f} W')
print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
m.cleanup()
"
Files Modified
/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py- Fixed AMDMonitor class/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh- Added apptainer/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh- Added apptainer/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh- Added apptainer/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh- Added apptainer with --rocm/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md- Updated documentation
Key Differences: rocml vs rocmsmi
| Feature | rocml (High-level) | rocmsmi (Low-level) |
|---|---|---|
| API Style | Simple functions | Complex C-style API |
| Initialization | smi_initialize() |
rsmi_init(0) + error codes |
| Power | Returns Watts | Returns microwatts |
| Memory | Returns bytes | Returns bytes via enums |
| Error Handling | Returns -1 on error | Returns error codes |
| Ease of Use | Much easier | Complex |
The rocml module is the recommended high-level Python API for pyrsmi.