# LLM Benchmark Suite A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**. ## Features - **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages - **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages - **Energy Monitoring**: GPU-specific energy and power measurement - NVIDIA: pynvml - AMD: pyrsmi - **Attention Implementations**: - FlashAttention-2 (A100, MI300X) - FlashAttention-3 Hopper (H100, H200) - Configurable via CLI - **Comprehensive Metrics**: - Tokens per second - Energy per token - Time to First Token (TTFT) - Inter-Token Latency (ITL) - End-to-End Request Latency - GPU utilization and memory usage ## Directory Structure ``` llm-benchmark/ ├── cache_model.py # Model caching script ├── benchmark_pretrain.py # Pretraining benchmark ├── benchmark_inference.py # Inference benchmark ├── run_benchmark.py # Main orchestration script ├── requirements.txt # Python dependencies ├── utils/ │ ├── gpu_monitor.py # GPU monitoring (NVIDIA & AMD) │ ├── metrics.py # Metrics collection and reporting │ └── attention.py # Attention implementation helpers ├── configs/ │ ├── a100.yaml │ ├── h100.yaml │ ├── h200.yaml │ └── mi300x.yaml └── results/ # Benchmark results (JSON) ``` ## Setup ### 1. Container Environment All benchmarks should be run inside the apptainer container: ```bash # Container is located at: /anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif ``` ### 2. Install Dependencies (if not using apptainer) If you want to run directly without apptainer: ```bash # Install Python dependencies pip install -r requirements.txt # For AMD GPUs, ensure ROCm and pyrsmi are installed # For NVIDIA GPUs, ensure CUDA and pynvml are installed ``` ### 3. Cache Model (Run on Head Node) **IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline. ```bash # Using apptainer (recommended) apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \ --model-name Qwen/Qwen3-4B \ --cache-dir ./model_cache # Or directly (if dependencies installed) python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache ``` The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME). ## Usage ### Quick Start ```bash # Run both pretraining and inference benchmarks python run_benchmark.py --mode both --model-path ./model_cache # Run only pretraining python run_benchmark.py --mode pretrain --num-steps 20 # Run only inference python run_benchmark.py --mode inference --num-requests 20 ``` ### Detailed Usage #### List Available GPUs ```bash python run_benchmark.py --list-gpus ``` #### Pretraining Benchmark ```bash python benchmark_pretrain.py \ --model-path ./model_cache \ --model-name Qwen/Qwen3-4B \ --attn-implementation auto \ --batch-size 8 \ --sequence-length 8192 \ --num-steps 10 \ --warmup-steps 3 \ --output-dir ./results ``` **Metrics Reported** (per stage: forward, backward, optimizer): - Duration (ms) - Tokens processed - Throughput (tokens/s) - Energy (J) - Energy per token (J/token) - Average power (W) - Peak memory (GB) - GPU utilization (%) #### Inference Benchmark ```bash python benchmark_inference.py \ --model-path ./model_cache \ --model-name Qwen/Qwen3-4B \ --attn-implementation auto \ --num-requests 10 \ --prompt-length 512 \ --generation-length 100 \ --warmup-requests 2 \ --output-dir ./results ``` **Metrics Reported**: - **Prefill**: TTFT, throughput, energy per token - **Decode**: ITL, throughput, energy per token - **End-to-End**: Request latency, total throughput, total energy ### Attention Implementations The benchmark automatically selects the optimal attention implementation based on GPU: - **A100, MI300X**: `flash_attention_2` - **H100, H200**: `flash_attention_3_hopper` Override with `--attn-implementation`: ```bash # Force FlashAttention-3 Hopper on H100 python run_benchmark.py --attn-implementation flash_attention_3_hopper # Use SDPA instead python run_benchmark.py --attn-implementation sdpa ``` Available options: - `auto` - Auto-detect based on GPU - `flash_attention_2` - FlashAttention-2 (all GPUs) - `flash_attention_3_hopper` - FlashAttention-3 for H100/H200 - `sdpa` - PyTorch Scaled Dot Product Attention - `eager` - Standard PyTorch attention ## Running on SLURM All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node: ```bash # On head node (with internet access) apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \ --model-name Qwen/Qwen3-4B \ --cache-dir ./model_cache ``` Then submit jobs: ```bash # A100 sbatch slurm_a100.sh # H100 sbatch slurm_h100.sh # H200 sbatch slurm_h200.sh # MI300X sbatch slurm_mi300x.sh ``` **Note**: - NVIDIA GPUs use `--nv` flag - AMD GPUs use `--rocm` flag ## Output Results are saved to the `--output-dir` directory (default: `./results/`): - `pretrain__.json` - Pretraining metrics - `inference__.json` - Inference metrics Example output: ``` =============================================================================== PRETRAINING BENCHMARK RESULTS =============================================================================== Model: Qwen/Qwen3-4B GPU: NVIDIA A100 80GB Attention: flash_attention_2 Batch Size: 8 Sequence Length: 8192 Training Steps: 10 ------------------------------------------------------------------------------- STAGE BREAKDOWN ------------------------------------------------------------------------------- [1] FORWARD PASS Duration: 1005.23 ms Tokens: 163,840 Throughput: 163,012.45 tokens/s Energy: 253.0 J Energy per Token: 1.5443 mJ/token [2] BACKWARD PASS Duration: 2052.11 ms Tokens: 163,840 Throughput: 79,857.23 tokens/s Energy: 516.2 J Energy per Token: 3.1513 mJ/token [3] OPTIMIZER STEP Duration: 153.42 ms Tokens: 163,840 Throughput: 1,068,012.34 tokens/s Energy: 38.4 J Energy per Token: 0.2344 mJ/token ------------------------------------------------------------------------------- OVERALL METRICS ------------------------------------------------------------------------------- Total Duration: 3210.76 ms Total Tokens: 163,840 Throughput: 51,012.45 tokens/s Total Energy: 807.6 J Energy per Token: 4.9300 mJ/token =============================================================================== ``` ## Key Metrics Reference ### Pretraining - **Forward**: Input processing and loss calculation - **Backward**: Gradient computation - **Optimizer**: Weight updates ### Inference - **TTFT (Time to First Token)**: Prefill latency - **ITL (Inter-Token Latency)**: Average decode time per token - **E2E Latency**: Total request time (prefill + decode) ### Energy - **Energy (J)**: Total energy consumed - **Energy per Token (mJ/token)**: Energy efficiency metric - **Average Power (W)**: Power consumption during stage ## Troubleshooting ### Model Not Found Ensure you've cached the model first: ```bash python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache ``` ### GPU Monitoring Errors - **NVIDIA**: Install pynvml: `pip install pynvml` - **AMD**: Install pyrsmi: `pip install pyrsmi` ### FlashAttention-3 Not Found For H100/H200, ensure FlashAttention-3 is installed. If not available, use: ```bash python run_benchmark.py --attn-implementation flash_attention_2 ``` ### Out of Memory Reduce batch size or sequence length: ```bash python run_benchmark.py --batch-size 4 --sequence-length 1024 ``` ## Citation If you use this benchmark suite, please cite: - [FlashAttention-2](https://github.com/Dao-AILab/flash-attention) - [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper) - [Qwen Models](https://huggingface.co/Qwen) ## License MIT License - see LICENSE file for details