312 lines
8.3 KiB
Markdown
312 lines
8.3 KiB
Markdown
# LLM Benchmark Suite
|
|
|
|
A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**.
|
|
|
|
## Features
|
|
|
|
- **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages
|
|
- **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages
|
|
- **Energy Monitoring**: GPU-specific energy and power measurement
|
|
- NVIDIA: pynvml
|
|
- AMD: pyrsmi
|
|
- **Attention Implementations**:
|
|
- FlashAttention-2 (A100, MI300X)
|
|
- FlashAttention-3 Hopper (H100, H200)
|
|
- Configurable via CLI
|
|
- **Comprehensive Metrics**:
|
|
- Tokens per second
|
|
- Energy per token
|
|
- Time to First Token (TTFT)
|
|
- Inter-Token Latency (ITL)
|
|
- End-to-End Request Latency
|
|
- GPU utilization and memory usage
|
|
|
|
## Directory Structure
|
|
|
|
```
|
|
llm-benchmark/
|
|
├── cache_model.py # Model caching script
|
|
├── benchmark_pretrain.py # Pretraining benchmark
|
|
├── benchmark_inference.py # Inference benchmark
|
|
├── run_benchmark.py # Main orchestration script
|
|
├── requirements.txt # Python dependencies
|
|
├── utils/
|
|
│ ├── gpu_monitor.py # GPU monitoring (NVIDIA & AMD)
|
|
│ ├── metrics.py # Metrics collection and reporting
|
|
│ └── attention.py # Attention implementation helpers
|
|
├── configs/
|
|
│ ├── a100.yaml
|
|
│ ├── h100.yaml
|
|
│ ├── h200.yaml
|
|
│ └── mi300x.yaml
|
|
└── results/ # Benchmark results (JSON)
|
|
```
|
|
|
|
## Setup
|
|
|
|
### 1. Container Environment
|
|
|
|
All benchmarks should be run inside the apptainer container:
|
|
|
|
```bash
|
|
# Container is located at:
|
|
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif
|
|
```
|
|
|
|
### 2. Install Dependencies (if not using apptainer)
|
|
|
|
If you want to run directly without apptainer:
|
|
|
|
```bash
|
|
# Install Python dependencies
|
|
pip install -r requirements.txt
|
|
|
|
# For AMD GPUs, ensure ROCm and pyrsmi are installed
|
|
# For NVIDIA GPUs, ensure CUDA and pynvml are installed
|
|
```
|
|
|
|
### 3. Cache Model (Run on Head Node)
|
|
|
|
**IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.
|
|
|
|
```bash
|
|
# Using apptainer (recommended)
|
|
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
|
|
--model-name Qwen/Qwen3-4B \
|
|
--cache-dir ./model_cache
|
|
|
|
# Or directly (if dependencies installed)
|
|
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache
|
|
```
|
|
|
|
The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME).
|
|
|
|
## Usage
|
|
|
|
### Quick Start
|
|
|
|
```bash
|
|
# Run both pretraining and inference benchmarks
|
|
python run_benchmark.py --mode both --model-path ./model_cache
|
|
|
|
# Run only pretraining
|
|
python run_benchmark.py --mode pretrain --num-steps 20
|
|
|
|
# Run only inference
|
|
python run_benchmark.py --mode inference --num-requests 20
|
|
```
|
|
|
|
### Detailed Usage
|
|
|
|
#### List Available GPUs
|
|
|
|
```bash
|
|
python run_benchmark.py --list-gpus
|
|
```
|
|
|
|
#### Pretraining Benchmark
|
|
|
|
```bash
|
|
python benchmark_pretrain.py \
|
|
--model-path ./model_cache \
|
|
--model-name Qwen/Qwen3-4B \
|
|
--attn-implementation auto \
|
|
--batch-size 8 \
|
|
--sequence-length 8192 \
|
|
--num-steps 10 \
|
|
--warmup-steps 3 \
|
|
--output-dir ./results
|
|
```
|
|
|
|
**Metrics Reported** (per stage: forward, backward, optimizer):
|
|
- Duration (ms)
|
|
- Tokens processed
|
|
- Throughput (tokens/s)
|
|
- Energy (J)
|
|
- Energy per token (J/token)
|
|
- Average power (W)
|
|
- Peak memory (GB)
|
|
- GPU utilization (%)
|
|
|
|
#### Inference Benchmark
|
|
|
|
```bash
|
|
python benchmark_inference.py \
|
|
--model-path ./model_cache \
|
|
--model-name Qwen/Qwen3-4B \
|
|
--attn-implementation auto \
|
|
--num-requests 10 \
|
|
--prompt-length 512 \
|
|
--generation-length 100 \
|
|
--warmup-requests 2 \
|
|
--output-dir ./results
|
|
```
|
|
|
|
**Metrics Reported**:
|
|
- **Prefill**: TTFT, throughput, energy per token
|
|
- **Decode**: ITL, throughput, energy per token
|
|
- **End-to-End**: Request latency, total throughput, total energy
|
|
|
|
### Attention Implementations
|
|
|
|
The benchmark automatically selects the optimal attention implementation based on GPU:
|
|
- **A100, MI300X**: `flash_attention_2`
|
|
- **H100, H200**: `flash_attention_3_hopper`
|
|
|
|
Override with `--attn-implementation`:
|
|
|
|
```bash
|
|
# Force FlashAttention-3 Hopper on H100
|
|
python run_benchmark.py --attn-implementation flash_attention_3_hopper
|
|
|
|
# Use SDPA instead
|
|
python run_benchmark.py --attn-implementation sdpa
|
|
```
|
|
|
|
Available options:
|
|
- `auto` - Auto-detect based on GPU
|
|
- `flash_attention_2` - FlashAttention-2 (all GPUs)
|
|
- `flash_attention_3_hopper` - FlashAttention-3 for H100/H200
|
|
- `sdpa` - PyTorch Scaled Dot Product Attention
|
|
- `eager` - Standard PyTorch attention
|
|
|
|
## Running on SLURM
|
|
|
|
All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:
|
|
|
|
```bash
|
|
# On head node (with internet access)
|
|
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
|
|
--model-name Qwen/Qwen3-4B \
|
|
--cache-dir ./model_cache
|
|
```
|
|
|
|
Then submit jobs:
|
|
|
|
```bash
|
|
# A100
|
|
sbatch slurm_a100.sh
|
|
|
|
# H100
|
|
sbatch slurm_h100.sh
|
|
|
|
# H200
|
|
sbatch slurm_h200.sh
|
|
|
|
# MI300X
|
|
sbatch slurm_mi300x.sh
|
|
```
|
|
|
|
**Note**:
|
|
- NVIDIA GPUs use `--nv` flag
|
|
- AMD GPUs use `--rocm` flag
|
|
|
|
## Output
|
|
|
|
Results are saved to the `--output-dir` directory (default: `./results/`):
|
|
|
|
- `pretrain_<GPU>_<ATTENTION>.json` - Pretraining metrics
|
|
- `inference_<GPU>_<ATTENTION>.json` - Inference metrics
|
|
|
|
Example output:
|
|
|
|
```
|
|
===============================================================================
|
|
PRETRAINING BENCHMARK RESULTS
|
|
===============================================================================
|
|
|
|
Model: Qwen/Qwen3-4B
|
|
GPU: NVIDIA A100 80GB
|
|
Attention: flash_attention_2
|
|
Batch Size: 8
|
|
Sequence Length: 8192
|
|
Training Steps: 10
|
|
|
|
-------------------------------------------------------------------------------
|
|
STAGE BREAKDOWN
|
|
-------------------------------------------------------------------------------
|
|
|
|
[1] FORWARD PASS
|
|
Duration: 1005.23 ms
|
|
Tokens: 163,840
|
|
Throughput: 163,012.45 tokens/s
|
|
Energy: 253.0 J
|
|
Energy per Token: 1.5443 mJ/token
|
|
|
|
[2] BACKWARD PASS
|
|
Duration: 2052.11 ms
|
|
Tokens: 163,840
|
|
Throughput: 79,857.23 tokens/s
|
|
Energy: 516.2 J
|
|
Energy per Token: 3.1513 mJ/token
|
|
|
|
[3] OPTIMIZER STEP
|
|
Duration: 153.42 ms
|
|
Tokens: 163,840
|
|
Throughput: 1,068,012.34 tokens/s
|
|
Energy: 38.4 J
|
|
Energy per Token: 0.2344 mJ/token
|
|
|
|
-------------------------------------------------------------------------------
|
|
OVERALL METRICS
|
|
-------------------------------------------------------------------------------
|
|
Total Duration: 3210.76 ms
|
|
Total Tokens: 163,840
|
|
Throughput: 51,012.45 tokens/s
|
|
Total Energy: 807.6 J
|
|
Energy per Token: 4.9300 mJ/token
|
|
===============================================================================
|
|
```
|
|
|
|
## Key Metrics Reference
|
|
|
|
### Pretraining
|
|
- **Forward**: Input processing and loss calculation
|
|
- **Backward**: Gradient computation
|
|
- **Optimizer**: Weight updates
|
|
|
|
### Inference
|
|
- **TTFT (Time to First Token)**: Prefill latency
|
|
- **ITL (Inter-Token Latency)**: Average decode time per token
|
|
- **E2E Latency**: Total request time (prefill + decode)
|
|
|
|
### Energy
|
|
- **Energy (J)**: Total energy consumed
|
|
- **Energy per Token (mJ/token)**: Energy efficiency metric
|
|
- **Average Power (W)**: Power consumption during stage
|
|
|
|
## Troubleshooting
|
|
|
|
### Model Not Found
|
|
Ensure you've cached the model first:
|
|
```bash
|
|
python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache
|
|
```
|
|
|
|
### GPU Monitoring Errors
|
|
- **NVIDIA**: Install pynvml: `pip install pynvml`
|
|
- **AMD**: Install pyrsmi: `pip install pyrsmi`
|
|
|
|
### FlashAttention-3 Not Found
|
|
For H100/H200, ensure FlashAttention-3 is installed. If not available, use:
|
|
```bash
|
|
python run_benchmark.py --attn-implementation flash_attention_2
|
|
```
|
|
|
|
### Out of Memory
|
|
Reduce batch size or sequence length:
|
|
```bash
|
|
python run_benchmark.py --batch-size 4 --sequence-length 1024
|
|
```
|
|
|
|
## Citation
|
|
|
|
If you use this benchmark suite, please cite:
|
|
- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention)
|
|
- [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper)
|
|
- [Qwen Models](https://huggingface.co/Qwen)
|
|
|
|
## License
|
|
|
|
MIT License - see LICENSE file for details
|