8.3 KiB
LLM Benchmark Suite
A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: MI300X, A100 80G, H100, and H200.
Features
- Pretraining Benchmarks: Separate metrics for forward, backward, and optimizer stages
- Inference Benchmarks: Separate metrics for prefill (TTFT) and decode (ITL) stages
- Energy Monitoring: GPU-specific energy and power measurement
- NVIDIA: pynvml
- AMD: pyrsmi
- Attention Implementations:
- FlashAttention-2 (A100, MI300X)
- FlashAttention-3 Hopper (H100, H200)
- Configurable via CLI
- Comprehensive Metrics:
- Tokens per second
- Energy per token
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- End-to-End Request Latency
- GPU utilization and memory usage
Directory Structure
llm-benchmark/
├── cache_model.py # Model caching script
├── benchmark_pretrain.py # Pretraining benchmark
├── benchmark_inference.py # Inference benchmark
├── run_benchmark.py # Main orchestration script
├── requirements.txt # Python dependencies
├── utils/
│ ├── gpu_monitor.py # GPU monitoring (NVIDIA & AMD)
│ ├── metrics.py # Metrics collection and reporting
│ └── attention.py # Attention implementation helpers
├── configs/
│ ├── a100.yaml
│ ├── h100.yaml
│ ├── h200.yaml
│ └── mi300x.yaml
└── results/ # Benchmark results (JSON)
Setup
1. Container Environment
All benchmarks should be run inside the apptainer container:
# Container is located at:
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif
2. Install Dependencies (if not using apptainer)
If you want to run directly without apptainer:
# Install Python dependencies
pip install -r requirements.txt
# For AMD GPUs, ensure ROCm and pyrsmi are installed
# For NVIDIA GPUs, ensure CUDA and pynvml are installed
3. Cache Model (Run on Head Node)
IMPORTANT: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.
# Using apptainer (recommended)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
--model-name Qwen/Qwen3-4B \
--cache-dir ./model_cache
# Or directly (if dependencies installed)
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache
The model will be cached to ./model_cache in the current directory (avoiding slow NFS $HOME).
Usage
Quick Start
# Run both pretraining and inference benchmarks
python run_benchmark.py --mode both --model-path ./model_cache
# Run only pretraining
python run_benchmark.py --mode pretrain --num-steps 20
# Run only inference
python run_benchmark.py --mode inference --num-requests 20
Detailed Usage
List Available GPUs
python run_benchmark.py --list-gpus
Pretraining Benchmark
python benchmark_pretrain.py \
--model-path ./model_cache \
--model-name Qwen/Qwen3-4B \
--attn-implementation auto \
--batch-size 8 \
--sequence-length 8192 \
--num-steps 10 \
--warmup-steps 3 \
--output-dir ./results
Metrics Reported (per stage: forward, backward, optimizer):
- Duration (ms)
- Tokens processed
- Throughput (tokens/s)
- Energy (J)
- Energy per token (J/token)
- Average power (W)
- Peak memory (GB)
- GPU utilization (%)
Inference Benchmark
python benchmark_inference.py \
--model-path ./model_cache \
--model-name Qwen/Qwen3-4B \
--attn-implementation auto \
--num-requests 10 \
--prompt-length 512 \
--generation-length 100 \
--warmup-requests 2 \
--output-dir ./results
Metrics Reported:
- Prefill: TTFT, throughput, energy per token
- Decode: ITL, throughput, energy per token
- End-to-End: Request latency, total throughput, total energy
Attention Implementations
The benchmark automatically selects the optimal attention implementation based on GPU:
- A100, MI300X:
flash_attention_2 - H100, H200:
flash_attention_3_hopper
Override with --attn-implementation:
# Force FlashAttention-3 Hopper on H100
python run_benchmark.py --attn-implementation flash_attention_3_hopper
# Use SDPA instead
python run_benchmark.py --attn-implementation sdpa
Available options:
auto- Auto-detect based on GPUflash_attention_2- FlashAttention-2 (all GPUs)flash_attention_3_hopper- FlashAttention-3 for H100/H200sdpa- PyTorch Scaled Dot Product Attentioneager- Standard PyTorch attention
Running on SLURM
All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:
# On head node (with internet access)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
--model-name Qwen/Qwen3-4B \
--cache-dir ./model_cache
Then submit jobs:
# A100
sbatch slurm_a100.sh
# H100
sbatch slurm_h100.sh
# H200
sbatch slurm_h200.sh
# MI300X
sbatch slurm_mi300x.sh
Note:
- NVIDIA GPUs use
--nvflag - AMD GPUs use
--rocmflag
Output
Results are saved to the --output-dir directory (default: ./results/):
pretrain_<GPU>_<ATTENTION>.json- Pretraining metricsinference_<GPU>_<ATTENTION>.json- Inference metrics
Example output:
===============================================================================
PRETRAINING BENCHMARK RESULTS
===============================================================================
Model: Qwen/Qwen3-4B
GPU: NVIDIA A100 80GB
Attention: flash_attention_2
Batch Size: 8
Sequence Length: 8192
Training Steps: 10
-------------------------------------------------------------------------------
STAGE BREAKDOWN
-------------------------------------------------------------------------------
[1] FORWARD PASS
Duration: 1005.23 ms
Tokens: 163,840
Throughput: 163,012.45 tokens/s
Energy: 253.0 J
Energy per Token: 1.5443 mJ/token
[2] BACKWARD PASS
Duration: 2052.11 ms
Tokens: 163,840
Throughput: 79,857.23 tokens/s
Energy: 516.2 J
Energy per Token: 3.1513 mJ/token
[3] OPTIMIZER STEP
Duration: 153.42 ms
Tokens: 163,840
Throughput: 1,068,012.34 tokens/s
Energy: 38.4 J
Energy per Token: 0.2344 mJ/token
-------------------------------------------------------------------------------
OVERALL METRICS
-------------------------------------------------------------------------------
Total Duration: 3210.76 ms
Total Tokens: 163,840
Throughput: 51,012.45 tokens/s
Total Energy: 807.6 J
Energy per Token: 4.9300 mJ/token
===============================================================================
Key Metrics Reference
Pretraining
- Forward: Input processing and loss calculation
- Backward: Gradient computation
- Optimizer: Weight updates
Inference
- TTFT (Time to First Token): Prefill latency
- ITL (Inter-Token Latency): Average decode time per token
- E2E Latency: Total request time (prefill + decode)
Energy
- Energy (J): Total energy consumed
- Energy per Token (mJ/token): Energy efficiency metric
- Average Power (W): Power consumption during stage
Troubleshooting
Model Not Found
Ensure you've cached the model first:
python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache
GPU Monitoring Errors
- NVIDIA: Install pynvml:
pip install pynvml - AMD: Install pyrsmi:
pip install pyrsmi
FlashAttention-3 Not Found
For H100/H200, ensure FlashAttention-3 is installed. If not available, use:
python run_benchmark.py --attn-implementation flash_attention_2
Out of Memory
Reduce batch size or sequence length:
python run_benchmark.py --batch-size 4 --sequence-length 1024
Citation
If you use this benchmark suite, please cite:
- FlashAttention-2
- FlashAttention-3 (for Hopper)
- Qwen Models
License
MIT License - see LICENSE file for details