2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00
2026-02-05 23:18:26 +01:00

LLM Benchmark Suite

A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: MI300X, A100 80G, H100, and H200.

Features

  • Pretraining Benchmarks: Separate metrics for forward, backward, and optimizer stages
  • Inference Benchmarks: Separate metrics for prefill (TTFT) and decode (ITL) stages
  • Energy Monitoring: GPU-specific energy and power measurement
    • NVIDIA: pynvml
    • AMD: pyrsmi
  • Attention Implementations:
    • FlashAttention-2 (A100, MI300X)
    • FlashAttention-3 Hopper (H100, H200)
    • Configurable via CLI
  • Comprehensive Metrics:
    • Tokens per second
    • Energy per token
    • Time to First Token (TTFT)
    • Inter-Token Latency (ITL)
    • End-to-End Request Latency
    • GPU utilization and memory usage

Directory Structure

llm-benchmark/
├── cache_model.py           # Model caching script
├── benchmark_pretrain.py    # Pretraining benchmark
├── benchmark_inference.py   # Inference benchmark
├── run_benchmark.py         # Main orchestration script
├── requirements.txt         # Python dependencies
├── utils/
│   ├── gpu_monitor.py       # GPU monitoring (NVIDIA & AMD)
│   ├── metrics.py           # Metrics collection and reporting
│   └── attention.py         # Attention implementation helpers
├── configs/
│   ├── a100.yaml
│   ├── h100.yaml
│   ├── h200.yaml
│   └── mi300x.yaml
└── results/                 # Benchmark results (JSON)

Setup

1. Container Environment

All benchmarks should be run inside the apptainer container:

# Container is located at:
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif

2. Install Dependencies (if not using apptainer)

If you want to run directly without apptainer:

# Install Python dependencies
pip install -r requirements.txt

# For AMD GPUs, ensure ROCm and pyrsmi are installed
# For NVIDIA GPUs, ensure CUDA and pynvml are installed

3. Cache Model (Run on Head Node)

IMPORTANT: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.

# Using apptainer (recommended)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache

# Or directly (if dependencies installed)
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache

The model will be cached to ./model_cache in the current directory (avoiding slow NFS $HOME).

Usage

Quick Start

# Run both pretraining and inference benchmarks
python run_benchmark.py --mode both --model-path ./model_cache

# Run only pretraining
python run_benchmark.py --mode pretrain --num-steps 20

# Run only inference
python run_benchmark.py --mode inference --num-requests 20

Detailed Usage

List Available GPUs

python run_benchmark.py --list-gpus

Pretraining Benchmark

python benchmark_pretrain.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --batch-size 8 \
  --sequence-length 8192 \
  --num-steps 10 \
  --warmup-steps 3 \
  --output-dir ./results

Metrics Reported (per stage: forward, backward, optimizer):

  • Duration (ms)
  • Tokens processed
  • Throughput (tokens/s)
  • Energy (J)
  • Energy per token (J/token)
  • Average power (W)
  • Peak memory (GB)
  • GPU utilization (%)

Inference Benchmark

python benchmark_inference.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --num-requests 10 \
  --prompt-length 512 \
  --generation-length 100 \
  --warmup-requests 2 \
  --output-dir ./results

Metrics Reported:

  • Prefill: TTFT, throughput, energy per token
  • Decode: ITL, throughput, energy per token
  • End-to-End: Request latency, total throughput, total energy

Attention Implementations

The benchmark automatically selects the optimal attention implementation based on GPU:

  • A100, MI300X: flash_attention_2
  • H100, H200: flash_attention_3_hopper

Override with --attn-implementation:

# Force FlashAttention-3 Hopper on H100
python run_benchmark.py --attn-implementation flash_attention_3_hopper

# Use SDPA instead
python run_benchmark.py --attn-implementation sdpa

Available options:

  • auto - Auto-detect based on GPU
  • flash_attention_2 - FlashAttention-2 (all GPUs)
  • flash_attention_3_hopper - FlashAttention-3 for H100/H200
  • sdpa - PyTorch Scaled Dot Product Attention
  • eager - Standard PyTorch attention

Running on SLURM

All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:

# On head node (with internet access)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache

Then submit jobs:

# A100
sbatch slurm_a100.sh

# H100
sbatch slurm_h100.sh

# H200
sbatch slurm_h200.sh

# MI300X
sbatch slurm_mi300x.sh

Note:

  • NVIDIA GPUs use --nv flag
  • AMD GPUs use --rocm flag

Output

Results are saved to the --output-dir directory (default: ./results/):

  • pretrain_<GPU>_<ATTENTION>.json - Pretraining metrics
  • inference_<GPU>_<ATTENTION>.json - Inference metrics

Example output:

===============================================================================
PRETRAINING BENCHMARK RESULTS
===============================================================================

Model: Qwen/Qwen3-4B
GPU: NVIDIA A100 80GB
Attention: flash_attention_2
Batch Size: 8
Sequence Length: 8192
Training Steps: 10

-------------------------------------------------------------------------------
STAGE BREAKDOWN
-------------------------------------------------------------------------------

[1] FORWARD PASS
  Duration:              1005.23 ms
  Tokens:                 163,840
  Throughput:           163,012.45 tokens/s
  Energy:                    253.0 J
  Energy per Token:       1.5443 mJ/token

[2] BACKWARD PASS
  Duration:              2052.11 ms
  Tokens:                 163,840
  Throughput:            79,857.23 tokens/s
  Energy:                    516.2 J
  Energy per Token:       3.1513 mJ/token

[3] OPTIMIZER STEP
  Duration:                153.42 ms
  Tokens:                 163,840
  Throughput:         1,068,012.34 tokens/s
  Energy:                     38.4 J
  Energy per Token:       0.2344 mJ/token

-------------------------------------------------------------------------------
OVERALL METRICS
-------------------------------------------------------------------------------
  Total Duration:        3210.76 ms
  Total Tokens:          163,840
  Throughput:           51,012.45 tokens/s
  Total Energy:             807.6 J
  Energy per Token:       4.9300 mJ/token
===============================================================================

Key Metrics Reference

Pretraining

  • Forward: Input processing and loss calculation
  • Backward: Gradient computation
  • Optimizer: Weight updates

Inference

  • TTFT (Time to First Token): Prefill latency
  • ITL (Inter-Token Latency): Average decode time per token
  • E2E Latency: Total request time (prefill + decode)

Energy

  • Energy (J): Total energy consumed
  • Energy per Token (mJ/token): Energy efficiency metric
  • Average Power (W): Power consumption during stage

Troubleshooting

Model Not Found

Ensure you've cached the model first:

python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache

GPU Monitoring Errors

  • NVIDIA: Install pynvml: pip install pynvml
  • AMD: Install pyrsmi: pip install pyrsmi

FlashAttention-3 Not Found

For H100/H200, ensure FlashAttention-3 is installed. If not available, use:

python run_benchmark.py --attn-implementation flash_attention_2

Out of Memory

Reduce batch size or sequence length:

python run_benchmark.py --batch-size 4 --sequence-length 1024

Citation

If you use this benchmark suite, please cite:

License

MIT License - see LICENSE file for details

Description
Co-designed Cost Optimization for Generative Operations & Attention Telemetry
Readme 68 KiB
Languages
Python 90.9%
Shell 9.1%