Ignore markdown files

This commit is contained in:
2026-02-05 23:20:23 +01:00
parent 747c92ac6b
commit c4acf721e7
2 changed files with 0 additions and 411 deletions

View File

@@ -1,100 +0,0 @@
# AMD GPU Monitoring Fix Summary
## Issue
The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures.
## Solution
Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at:
`/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py`
## Changes Made
### 1. Fixed AMDMonitor Class
**Before** (incorrect):
```python
from pyrsmi import rocmsmi
ret = self.rocmsmi.rsmi_init(0)
power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)
```
**After** (correct):
```python
from pyrsmi import rocml
self.rocml.smi_initialize()
power_watts = self.rocml.smi_get_device_average_power(self.device_id)
```
**Key API Functions**:
- `rocml.smi_initialize()` - Initialize monitoring
- `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!)
- `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization %
- `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes
- `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes
- `rocml.smi_get_device_temperature(device_id)` - Get temperature
- `rocml.smi_get_device_name(device_id)` - Get device name
- `rocml.smi_shutdown()` - Cleanup
### 2. Updated All SLURM Scripts for Apptainer
All GPU benchmark scripts now run inside the apptainer container:
**A100, H100, H200** (NVIDIA):
```bash
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...
```
**MI300X** (AMD):
```bash
APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...
```
Note: `--nv` for NVIDIA, `--rocm` for AMD
### 3. Updated Documentation
- README.md now mentions apptainer usage
- Updated setup instructions to use apptainer for model caching
- Added notes about container flags (--nv vs --rocm)
## Testing
To verify the AMD monitoring works:
```bash
# Inside apptainer on MI300X node
apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
from utils.gpu_monitor import AMDMonitor
m = AMDMonitor(0)
print(f'GPU: {m.get_device_name()}')
metrics = m.get_metrics()
print(f'Power: {metrics.power_watts:.2f} W')
print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
m.cleanup()
"
```
## Files Modified
1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class
2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer
3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer
4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer
5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm
6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation
## Key Differences: rocml vs rocmsmi
| Feature | rocml (High-level) | rocmsmi (Low-level) |
|---------|-------------------|---------------------|
| API Style | Simple functions | Complex C-style API |
| Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes |
| Power | Returns Watts | Returns microwatts |
| Memory | Returns bytes | Returns bytes via enums |
| Error Handling | Returns -1 on error | Returns error codes |
| Ease of Use | Much easier | Complex |
The `rocml` module is the recommended high-level Python API for pyrsmi.

311
README.md
View File

@@ -1,311 +0,0 @@
# LLM Benchmark Suite
A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**.
## Features
- **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages
- **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages
- **Energy Monitoring**: GPU-specific energy and power measurement
- NVIDIA: pynvml
- AMD: pyrsmi
- **Attention Implementations**:
- FlashAttention-2 (A100, MI300X)
- FlashAttention-3 Hopper (H100, H200)
- Configurable via CLI
- **Comprehensive Metrics**:
- Tokens per second
- Energy per token
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- End-to-End Request Latency
- GPU utilization and memory usage
## Directory Structure
```
llm-benchmark/
├── cache_model.py # Model caching script
├── benchmark_pretrain.py # Pretraining benchmark
├── benchmark_inference.py # Inference benchmark
├── run_benchmark.py # Main orchestration script
├── requirements.txt # Python dependencies
├── utils/
│ ├── gpu_monitor.py # GPU monitoring (NVIDIA & AMD)
│ ├── metrics.py # Metrics collection and reporting
│ └── attention.py # Attention implementation helpers
├── configs/
│ ├── a100.yaml
│ ├── h100.yaml
│ ├── h200.yaml
│ └── mi300x.yaml
└── results/ # Benchmark results (JSON)
```
## Setup
### 1. Container Environment
All benchmarks should be run inside the apptainer container:
```bash
# Container is located at:
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif
```
### 2. Install Dependencies (if not using apptainer)
If you want to run directly without apptainer:
```bash
# Install Python dependencies
pip install -r requirements.txt
# For AMD GPUs, ensure ROCm and pyrsmi are installed
# For NVIDIA GPUs, ensure CUDA and pynvml are installed
```
### 3. Cache Model (Run on Head Node)
**IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.
```bash
# Using apptainer (recommended)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
--model-name Qwen/Qwen3-4B \
--cache-dir ./model_cache
# Or directly (if dependencies installed)
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache
```
The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME).
## Usage
### Quick Start
```bash
# Run both pretraining and inference benchmarks
python run_benchmark.py --mode both --model-path ./model_cache
# Run only pretraining
python run_benchmark.py --mode pretrain --num-steps 20
# Run only inference
python run_benchmark.py --mode inference --num-requests 20
```
### Detailed Usage
#### List Available GPUs
```bash
python run_benchmark.py --list-gpus
```
#### Pretraining Benchmark
```bash
python benchmark_pretrain.py \
--model-path ./model_cache \
--model-name Qwen/Qwen3-4B \
--attn-implementation auto \
--batch-size 8 \
--sequence-length 8192 \
--num-steps 10 \
--warmup-steps 3 \
--output-dir ./results
```
**Metrics Reported** (per stage: forward, backward, optimizer):
- Duration (ms)
- Tokens processed
- Throughput (tokens/s)
- Energy (J)
- Energy per token (J/token)
- Average power (W)
- Peak memory (GB)
- GPU utilization (%)
#### Inference Benchmark
```bash
python benchmark_inference.py \
--model-path ./model_cache \
--model-name Qwen/Qwen3-4B \
--attn-implementation auto \
--num-requests 10 \
--prompt-length 512 \
--generation-length 100 \
--warmup-requests 2 \
--output-dir ./results
```
**Metrics Reported**:
- **Prefill**: TTFT, throughput, energy per token
- **Decode**: ITL, throughput, energy per token
- **End-to-End**: Request latency, total throughput, total energy
### Attention Implementations
The benchmark automatically selects the optimal attention implementation based on GPU:
- **A100, MI300X**: `flash_attention_2`
- **H100, H200**: `flash_attention_3_hopper`
Override with `--attn-implementation`:
```bash
# Force FlashAttention-3 Hopper on H100
python run_benchmark.py --attn-implementation flash_attention_3_hopper
# Use SDPA instead
python run_benchmark.py --attn-implementation sdpa
```
Available options:
- `auto` - Auto-detect based on GPU
- `flash_attention_2` - FlashAttention-2 (all GPUs)
- `flash_attention_3_hopper` - FlashAttention-3 for H100/H200
- `sdpa` - PyTorch Scaled Dot Product Attention
- `eager` - Standard PyTorch attention
## Running on SLURM
All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:
```bash
# On head node (with internet access)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
--model-name Qwen/Qwen3-4B \
--cache-dir ./model_cache
```
Then submit jobs:
```bash
# A100
sbatch slurm_a100.sh
# H100
sbatch slurm_h100.sh
# H200
sbatch slurm_h200.sh
# MI300X
sbatch slurm_mi300x.sh
```
**Note**:
- NVIDIA GPUs use `--nv` flag
- AMD GPUs use `--rocm` flag
## Output
Results are saved to the `--output-dir` directory (default: `./results/`):
- `pretrain_<GPU>_<ATTENTION>.json` - Pretraining metrics
- `inference_<GPU>_<ATTENTION>.json` - Inference metrics
Example output:
```
===============================================================================
PRETRAINING BENCHMARK RESULTS
===============================================================================
Model: Qwen/Qwen3-4B
GPU: NVIDIA A100 80GB
Attention: flash_attention_2
Batch Size: 8
Sequence Length: 8192
Training Steps: 10
-------------------------------------------------------------------------------
STAGE BREAKDOWN
-------------------------------------------------------------------------------
[1] FORWARD PASS
Duration: 1005.23 ms
Tokens: 163,840
Throughput: 163,012.45 tokens/s
Energy: 253.0 J
Energy per Token: 1.5443 mJ/token
[2] BACKWARD PASS
Duration: 2052.11 ms
Tokens: 163,840
Throughput: 79,857.23 tokens/s
Energy: 516.2 J
Energy per Token: 3.1513 mJ/token
[3] OPTIMIZER STEP
Duration: 153.42 ms
Tokens: 163,840
Throughput: 1,068,012.34 tokens/s
Energy: 38.4 J
Energy per Token: 0.2344 mJ/token
-------------------------------------------------------------------------------
OVERALL METRICS
-------------------------------------------------------------------------------
Total Duration: 3210.76 ms
Total Tokens: 163,840
Throughput: 51,012.45 tokens/s
Total Energy: 807.6 J
Energy per Token: 4.9300 mJ/token
===============================================================================
```
## Key Metrics Reference
### Pretraining
- **Forward**: Input processing and loss calculation
- **Backward**: Gradient computation
- **Optimizer**: Weight updates
### Inference
- **TTFT (Time to First Token)**: Prefill latency
- **ITL (Inter-Token Latency)**: Average decode time per token
- **E2E Latency**: Total request time (prefill + decode)
### Energy
- **Energy (J)**: Total energy consumed
- **Energy per Token (mJ/token)**: Energy efficiency metric
- **Average Power (W)**: Power consumption during stage
## Troubleshooting
### Model Not Found
Ensure you've cached the model first:
```bash
python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache
```
### GPU Monitoring Errors
- **NVIDIA**: Install pynvml: `pip install pynvml`
- **AMD**: Install pyrsmi: `pip install pyrsmi`
### FlashAttention-3 Not Found
For H100/H200, ensure FlashAttention-3 is installed. If not available, use:
```bash
python run_benchmark.py --attn-implementation flash_attention_2
```
### Out of Memory
Reduce batch size or sequence length:
```bash
python run_benchmark.py --batch-size 4 --sequence-length 1024
```
## Citation
If you use this benchmark suite, please cite:
- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention)
- [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper)
- [Qwen Models](https://huggingface.co/Qwen)
## License
MIT License - see LICENSE file for details