Initial commit

2026-02-05 23:18:26 +01:00
commit 747c92ac6b
31 changed files with 4220 additions and 0 deletions
--- a/AMD_FIX_SUMMARY.md
+++ b/AMD_FIX_SUMMARY.md
@@ -0,0 +1,100 @@
+# AMD GPU Monitoring Fix Summary
+
+## Issue
+The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures.
+
+## Solution
+Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at:
+`/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py`
+
+## Changes Made
+
+### 1. Fixed AMDMonitor Class
+
+**Before** (incorrect):
+```python
+from pyrsmi import rocmsmi
+ret = self.rocmsmi.rsmi_init(0)
+power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)
+```
+
+**After** (correct):
+```python
+from pyrsmi import rocml
+self.rocml.smi_initialize()
+power_watts = self.rocml.smi_get_device_average_power(self.device_id)
+```
+
+**Key API Functions**:
+- `rocml.smi_initialize()` - Initialize monitoring
+- `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!)
+- `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization %
+- `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes
+- `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes
+- `rocml.smi_get_device_temperature(device_id)` - Get temperature
+- `rocml.smi_get_device_name(device_id)` - Get device name
+- `rocml.smi_shutdown()` - Cleanup
+
+### 2. Updated All SLURM Scripts for Apptainer
+
+All GPU benchmark scripts now run inside the apptainer container:
+
+**A100, H100, H200** (NVIDIA):
+```bash
+APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
+apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...
+```
+
+**MI300X** (AMD):
+```bash
+APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
+apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...
+```
+
+Note: `--nv` for NVIDIA, `--rocm` for AMD
+
+### 3. Updated Documentation
+
+- README.md now mentions apptainer usage
+- Updated setup instructions to use apptainer for model caching
+- Added notes about container flags (--nv vs --rocm)
+
+## Testing
+
+To verify the AMD monitoring works:
+
+```bash
+# Inside apptainer on MI300X node
+apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
+from utils.gpu_monitor import AMDMonitor
+m = AMDMonitor(0)
+print(f'GPU: {m.get_device_name()}')
+metrics = m.get_metrics()
+print(f'Power: {metrics.power_watts:.2f} W')
+print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
+print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
+m.cleanup()
+"
+```
+
+## Files Modified
+
+1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class
+2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer
+3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer
+4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer
+5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm
+6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation
+
+## Key Differences: rocml vs rocmsmi
+
+| Feature | rocml (High-level) | rocmsmi (Low-level) |
+|---------|-------------------|---------------------|
+| API Style | Simple functions | Complex C-style API |
+| Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes |
+| Power | Returns Watts | Returns microwatts |
+| Memory | Returns bytes | Returns bytes via enums |
+| Error Handling | Returns -1 on error | Returns error codes |
+| Ease of Use | Much easier | Complex |
+
+The `rocml` module is the recommended high-level Python API for pyrsmi.