Files

Bole Ma 747c92ac6b Initial commit

2026-02-05 23:18:26 +01:00

8.3 KiB

Raw Blame History

LLM Benchmark Suite

A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: MI300X, A100 80G, H100, and H200.

Features

Pretraining Benchmarks: Separate metrics for forward, backward, and optimizer stages
Inference Benchmarks: Separate metrics for prefill (TTFT) and decode (ITL) stages
Energy Monitoring: GPU-specific energy and power measurement
- NVIDIA: pynvml
- AMD: pyrsmi
Attention Implementations:
- FlashAttention-2 (A100, MI300X)
- FlashAttention-3 Hopper (H100, H200)
- Configurable via CLI
Comprehensive Metrics:
- Tokens per second
- Energy per token
- Time to First Token (TTFT)
- Inter-Token Latency (ITL)
- End-to-End Request Latency
- GPU utilization and memory usage

Directory Structure

llm-benchmark/
├── cache_model.py           # Model caching script
├── benchmark_pretrain.py    # Pretraining benchmark
├── benchmark_inference.py   # Inference benchmark
├── run_benchmark.py         # Main orchestration script
├── requirements.txt         # Python dependencies
├── utils/
│   ├── gpu_monitor.py       # GPU monitoring (NVIDIA & AMD)
│   ├── metrics.py           # Metrics collection and reporting
│   └── attention.py         # Attention implementation helpers
├── configs/
│   ├── a100.yaml
│   ├── h100.yaml
│   ├── h200.yaml
│   └── mi300x.yaml
└── results/                 # Benchmark results (JSON)

Setup

1. Container Environment

All benchmarks should be run inside the apptainer container:

# Container is located at:
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif

2. Install Dependencies (if not using apptainer)

If you want to run directly without apptainer:

# Install Python dependencies
pip install -r requirements.txt

# For AMD GPUs, ensure ROCm and pyrsmi are installed
# For NVIDIA GPUs, ensure CUDA and pynvml are installed

3. Cache Model (Run on Head Node)

IMPORTANT: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.

# Using apptainer (recommended)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache

# Or directly (if dependencies installed)
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache

The model will be cached to ./model_cache in the current directory (avoiding slow NFS $HOME).

Usage

Quick Start

# Run both pretraining and inference benchmarks
python run_benchmark.py --mode both --model-path ./model_cache

# Run only pretraining
python run_benchmark.py --mode pretrain --num-steps 20

# Run only inference
python run_benchmark.py --mode inference --num-requests 20

Detailed Usage

List Available GPUs

python run_benchmark.py --list-gpus

Pretraining Benchmark

python benchmark_pretrain.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --batch-size 8 \
  --sequence-length 8192 \
  --num-steps 10 \
  --warmup-steps 3 \
  --output-dir ./results

Metrics Reported (per stage: forward, backward, optimizer):

Duration (ms)
Tokens processed
Throughput (tokens/s)
Energy (J)
Energy per token (J/token)
Average power (W)
Peak memory (GB)
GPU utilization (%)

Inference Benchmark

python benchmark_inference.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --num-requests 10 \
  --prompt-length 512 \
  --generation-length 100 \
  --warmup-requests 2 \
  --output-dir ./results

Metrics Reported:

Prefill: TTFT, throughput, energy per token
Decode: ITL, throughput, energy per token
End-to-End: Request latency, total throughput, total energy

Attention Implementations

The benchmark automatically selects the optimal attention implementation based on GPU:

A100, MI300X: flash_attention_2
H100, H200: flash_attention_3_hopper

Override with --attn-implementation:

# Force FlashAttention-3 Hopper on H100
python run_benchmark.py --attn-implementation flash_attention_3_hopper

# Use SDPA instead
python run_benchmark.py --attn-implementation sdpa

Available options:

auto - Auto-detect based on GPU
flash_attention_2 - FlashAttention-2 (all GPUs)
flash_attention_3_hopper - FlashAttention-3 for H100/H200
sdpa - PyTorch Scaled Dot Product Attention
eager - Standard PyTorch attention

Running on SLURM

All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:

# On head node (with internet access)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache

Then submit jobs:

# A100
sbatch slurm_a100.sh

# H100
sbatch slurm_h100.sh

# H200
sbatch slurm_h200.sh

# MI300X
sbatch slurm_mi300x.sh

Note:

NVIDIA GPUs use --nv flag
AMD GPUs use --rocm flag

Output

Results are saved to the --output-dir directory (default: ./results/):

pretrain_<GPU>_<ATTENTION>.json - Pretraining metrics
inference_<GPU>_<ATTENTION>.json - Inference metrics

Example output:

===============================================================================
PRETRAINING BENCHMARK RESULTS
===============================================================================

Model: Qwen/Qwen3-4B
GPU: NVIDIA A100 80GB
Attention: flash_attention_2
Batch Size: 8
Sequence Length: 8192
Training Steps: 10

-------------------------------------------------------------------------------
STAGE BREAKDOWN
-------------------------------------------------------------------------------

[1] FORWARD PASS
  Duration:              1005.23 ms
  Tokens:                 163,840
  Throughput:           163,012.45 tokens/s
  Energy:                    253.0 J
  Energy per Token:       1.5443 mJ/token

[2] BACKWARD PASS
  Duration:              2052.11 ms
  Tokens:                 163,840
  Throughput:            79,857.23 tokens/s
  Energy:                    516.2 J
  Energy per Token:       3.1513 mJ/token

[3] OPTIMIZER STEP
  Duration:                153.42 ms
  Tokens:                 163,840
  Throughput:         1,068,012.34 tokens/s
  Energy:                     38.4 J
  Energy per Token:       0.2344 mJ/token

-------------------------------------------------------------------------------
OVERALL METRICS
-------------------------------------------------------------------------------
  Total Duration:        3210.76 ms
  Total Tokens:          163,840
  Throughput:           51,012.45 tokens/s
  Total Energy:             807.6 J
  Energy per Token:       4.9300 mJ/token
===============================================================================

Key Metrics Reference

Pretraining

Forward: Input processing and loss calculation
Backward: Gradient computation
Optimizer: Weight updates

Inference

TTFT (Time to First Token): Prefill latency
ITL (Inter-Token Latency): Average decode time per token
E2E Latency: Total request time (prefill + decode)

Energy

Energy (J): Total energy consumed
Energy per Token (mJ/token): Energy efficiency metric
Average Power (W): Power consumption during stage

Troubleshooting

Model Not Found

Ensure you've cached the model first:

python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache

GPU Monitoring Errors

NVIDIA: Install pynvml: pip install pynvml
AMD: Install pyrsmi: pip install pyrsmi

FlashAttention-3 Not Found

For H100/H200, ensure FlashAttention-3 is installed. If not available, use:

python run_benchmark.py --attn-implementation flash_attention_2

Out of Memory

Reduce batch size or sequence length:

python run_benchmark.py --batch-size 4 --sequence-length 1024

Citation

If you use this benchmark suite, please cite:

License

MIT License - see LICENSE file for details

8.3 KiB Raw Blame History