# AMD GPU Monitoring Fix Summary ## Issue The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures. ## Solution Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at: `/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py` ## Changes Made ### 1. Fixed AMDMonitor Class **Before** (incorrect): ```python from pyrsmi import rocmsmi ret = self.rocmsmi.rsmi_init(0) power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id) ``` **After** (correct): ```python from pyrsmi import rocml self.rocml.smi_initialize() power_watts = self.rocml.smi_get_device_average_power(self.device_id) ``` **Key API Functions**: - `rocml.smi_initialize()` - Initialize monitoring - `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!) - `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization % - `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes - `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes - `rocml.smi_get_device_temperature(device_id)` - Get temperature - `rocml.smi_get_device_name(device_id)` - Get device name - `rocml.smi_shutdown()` - Cleanup ### 2. Updated All SLURM Scripts for Apptainer All GPU benchmark scripts now run inside the apptainer container: **A100, H100, H200** (NVIDIA): ```bash APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif" apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ... ``` **MI300X** (AMD): ```bash APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif" apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ... ``` Note: `--nv` for NVIDIA, `--rocm` for AMD ### 3. Updated Documentation - README.md now mentions apptainer usage - Updated setup instructions to use apptainer for model caching - Added notes about container flags (--nv vs --rocm) ## Testing To verify the AMD monitoring works: ```bash # Inside apptainer on MI300X node apptainer exec --rocm pytorch_25.10_tilelang.sif python -c " from utils.gpu_monitor import AMDMonitor m = AMDMonitor(0) print(f'GPU: {m.get_device_name()}') metrics = m.get_metrics() print(f'Power: {metrics.power_watts:.2f} W') print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%') print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB') m.cleanup() " ``` ## Files Modified 1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class 2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer 3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer 4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer 5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm 6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation ## Key Differences: rocml vs rocmsmi | Feature | rocml (High-level) | rocmsmi (Low-level) | |---------|-------------------|---------------------| | API Style | Simple functions | Complex C-style API | | Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes | | Power | Returns Watts | Returns microwatts | | Memory | Returns bytes | Returns bytes via enums | | Error Handling | Returns -1 on error | Returns error codes | | Ease of Use | Much easier | Complex | The `rocml` module is the recommended high-level Python API for pyrsmi.