101 lines
3.6 KiB
Markdown
101 lines
3.6 KiB
Markdown
# AMD GPU Monitoring Fix Summary
|
|
|
|
## Issue
|
|
The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures.
|
|
|
|
## Solution
|
|
Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at:
|
|
`/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py`
|
|
|
|
## Changes Made
|
|
|
|
### 1. Fixed AMDMonitor Class
|
|
|
|
**Before** (incorrect):
|
|
```python
|
|
from pyrsmi import rocmsmi
|
|
ret = self.rocmsmi.rsmi_init(0)
|
|
power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)
|
|
```
|
|
|
|
**After** (correct):
|
|
```python
|
|
from pyrsmi import rocml
|
|
self.rocml.smi_initialize()
|
|
power_watts = self.rocml.smi_get_device_average_power(self.device_id)
|
|
```
|
|
|
|
**Key API Functions**:
|
|
- `rocml.smi_initialize()` - Initialize monitoring
|
|
- `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!)
|
|
- `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization %
|
|
- `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes
|
|
- `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes
|
|
- `rocml.smi_get_device_temperature(device_id)` - Get temperature
|
|
- `rocml.smi_get_device_name(device_id)` - Get device name
|
|
- `rocml.smi_shutdown()` - Cleanup
|
|
|
|
### 2. Updated All SLURM Scripts for Apptainer
|
|
|
|
All GPU benchmark scripts now run inside the apptainer container:
|
|
|
|
**A100, H100, H200** (NVIDIA):
|
|
```bash
|
|
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
|
|
apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...
|
|
```
|
|
|
|
**MI300X** (AMD):
|
|
```bash
|
|
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
|
|
apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...
|
|
```
|
|
|
|
Note: `--nv` for NVIDIA, `--rocm` for AMD
|
|
|
|
### 3. Updated Documentation
|
|
|
|
- README.md now mentions apptainer usage
|
|
- Updated setup instructions to use apptainer for model caching
|
|
- Added notes about container flags (--nv vs --rocm)
|
|
|
|
## Testing
|
|
|
|
To verify the AMD monitoring works:
|
|
|
|
```bash
|
|
# Inside apptainer on MI300X node
|
|
apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
|
|
from utils.gpu_monitor import AMDMonitor
|
|
m = AMDMonitor(0)
|
|
print(f'GPU: {m.get_device_name()}')
|
|
metrics = m.get_metrics()
|
|
print(f'Power: {metrics.power_watts:.2f} W')
|
|
print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
|
|
print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
|
|
m.cleanup()
|
|
"
|
|
```
|
|
|
|
## Files Modified
|
|
|
|
1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class
|
|
2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer
|
|
3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer
|
|
4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer
|
|
5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm
|
|
6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation
|
|
|
|
## Key Differences: rocml vs rocmsmi
|
|
|
|
| Feature | rocml (High-level) | rocmsmi (Low-level) |
|
|
|---------|-------------------|---------------------|
|
|
| API Style | Simple functions | Complex C-style API |
|
|
| Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes |
|
|
| Power | Returns Watts | Returns microwatts |
|
|
| Memory | Returns bytes | Returns bytes via enums |
|
|
| Error Handling | Returns -1 on error | Returns error codes |
|
|
| Ease of Use | Much easier | Complex |
|
|
|
|
The `rocml` module is the recommended high-level Python API for pyrsmi.
|