cocogoat/README.md

# LLM Benchmark Suite

A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**.

## Features

- **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages
- **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages
- **Energy Monitoring**: GPU-specific energy and power measurement
  - NVIDIA: pynvml
  - AMD: pyrsmi
- **Attention Implementations**:
  - FlashAttention-2 (A100, MI300X)
  - FlashAttention-3 Hopper (H100, H200)
  - Configurable via CLI
- **Comprehensive Metrics**:
  - Tokens per second
  - Energy per token
  - Time to First Token (TTFT)
  - Inter-Token Latency (ITL)
  - End-to-End Request Latency
  - GPU utilization and memory usage

## Directory Structure

```
llm-benchmark/
├── cache_model.py           # Model caching script
├── benchmark_pretrain.py    # Pretraining benchmark
├── benchmark_inference.py   # Inference benchmark
├── run_benchmark.py         # Main orchestration script
├── requirements.txt         # Python dependencies
├── utils/
│   ├── gpu_monitor.py       # GPU monitoring (NVIDIA & AMD)
│   ├── metrics.py           # Metrics collection and reporting
│   └── attention.py         # Attention implementation helpers
├── configs/
│   ├── a100.yaml
│   ├── h100.yaml
│   ├── h200.yaml
│   └── mi300x.yaml
└── results/                 # Benchmark results (JSON)
```

## Setup

### 1. Container Environment

All benchmarks should be run inside the apptainer container:

```bash
# Container is located at:
/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif
```

### 2. Install Dependencies (if not using apptainer)

If you want to run directly without apptainer:

```bash
# Install Python dependencies
pip install -r requirements.txt

# For AMD GPUs, ensure ROCm and pyrsmi are installed
# For NVIDIA GPUs, ensure CUDA and pynvml are installed
```

### 3. Cache Model (Run on Head Node)

**IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.

```bash
# Using apptainer (recommended)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache

# Or directly (if dependencies installed)
python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache
```

The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME).

## Usage

### Quick Start

```bash
# Run both pretraining and inference benchmarks
python run_benchmark.py --mode both --model-path ./model_cache

# Run only pretraining
python run_benchmark.py --mode pretrain --num-steps 20

# Run only inference
python run_benchmark.py --mode inference --num-requests 20
```

### Detailed Usage

#### List Available GPUs

```bash
python run_benchmark.py --list-gpus
```

#### Pretraining Benchmark

```bash
python benchmark_pretrain.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --batch-size 8 \
  --sequence-length 8192 \
  --num-steps 10 \
  --warmup-steps 3 \
  --output-dir ./results
```

**Metrics Reported** (per stage: forward, backward, optimizer):
- Duration (ms)
- Tokens processed
- Throughput (tokens/s)
- Energy (J)
- Energy per token (J/token)
- Average power (W)
- Peak memory (GB)
- GPU utilization (%)

#### Inference Benchmark

```bash
python benchmark_inference.py \
  --model-path ./model_cache \
  --model-name Qwen/Qwen3-4B \
  --attn-implementation auto \
  --num-requests 10 \
  --prompt-length 512 \
  --generation-length 100 \
  --warmup-requests 2 \
  --output-dir ./results
```

**Metrics Reported**:
- **Prefill**: TTFT, throughput, energy per token
- **Decode**: ITL, throughput, energy per token
- **End-to-End**: Request latency, total throughput, total energy

### Attention Implementations

The benchmark automatically selects the optimal attention implementation based on GPU:
- **A100, MI300X**: `flash_attention_2`
- **H100, H200**: `flash_attention_3_hopper`

Override with `--attn-implementation`:

```bash
# Force FlashAttention-3 Hopper on H100
python run_benchmark.py --attn-implementation flash_attention_3_hopper

# Use SDPA instead
python run_benchmark.py --attn-implementation sdpa
```

Available options:
- `auto` - Auto-detect based on GPU
- `flash_attention_2` - FlashAttention-2 (all GPUs)
- `flash_attention_3_hopper` - FlashAttention-3 for H100/H200
- `sdpa` - PyTorch Scaled Dot Product Attention
- `eager` - Standard PyTorch attention

## Running on SLURM

All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:

```bash
# On head node (with internet access)
apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
  --model-name Qwen/Qwen3-4B \
  --cache-dir ./model_cache
```

Then submit jobs:

```bash
# A100
sbatch slurm_a100.sh

# H100
sbatch slurm_h100.sh

# H200
sbatch slurm_h200.sh

# MI300X
sbatch slurm_mi300x.sh
```

**Note**:
- NVIDIA GPUs use `--nv` flag
- AMD GPUs use `--rocm` flag

## Output

Results are saved to the `--output-dir` directory (default: `./results/`):

- `pretrain_<GPU>_<ATTENTION>.json` - Pretraining metrics
- `inference_<GPU>_<ATTENTION>.json` - Inference metrics

Example output:

```
===============================================================================
PRETRAINING BENCHMARK RESULTS
===============================================================================

Model: Qwen/Qwen3-4B
GPU: NVIDIA A100 80GB
Attention: flash_attention_2
Batch Size: 8
Sequence Length: 8192
Training Steps: 10

-------------------------------------------------------------------------------
STAGE BREAKDOWN
-------------------------------------------------------------------------------

[1] FORWARD PASS
  Duration:              1005.23 ms
  Tokens:                 163,840
  Throughput:           163,012.45 tokens/s
  Energy:                    253.0 J
  Energy per Token:       1.5443 mJ/token

[2] BACKWARD PASS
  Duration:              2052.11 ms
  Tokens:                 163,840
  Throughput:            79,857.23 tokens/s
  Energy:                    516.2 J
  Energy per Token:       3.1513 mJ/token

[3] OPTIMIZER STEP
  Duration:                153.42 ms
  Tokens:                 163,840
  Throughput:         1,068,012.34 tokens/s
  Energy:                     38.4 J
  Energy per Token:       0.2344 mJ/token

-------------------------------------------------------------------------------
OVERALL METRICS
-------------------------------------------------------------------------------
  Total Duration:        3210.76 ms
  Total Tokens:          163,840
  Throughput:           51,012.45 tokens/s
  Total Energy:             807.6 J
  Energy per Token:       4.9300 mJ/token
===============================================================================
```

## Key Metrics Reference

### Pretraining
- **Forward**: Input processing and loss calculation
- **Backward**: Gradient computation
- **Optimizer**: Weight updates

### Inference
- **TTFT (Time to First Token)**: Prefill latency
- **ITL (Inter-Token Latency)**: Average decode time per token
- **E2E Latency**: Total request time (prefill + decode)

### Energy
- **Energy (J)**: Total energy consumed
- **Energy per Token (mJ/token)**: Energy efficiency metric
- **Average Power (W)**: Power consumption during stage

## Troubleshooting

### Model Not Found
Ensure you've cached the model first:
```bash
python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache
```

### GPU Monitoring Errors
- **NVIDIA**: Install pynvml: `pip install pynvml`
- **AMD**: Install pyrsmi: `pip install pyrsmi`

### FlashAttention-3 Not Found
For H100/H200, ensure FlashAttention-3 is installed. If not available, use:
```bash
python run_benchmark.py --attn-implementation flash_attention_2
```

### Out of Memory
Reduce batch size or sequence length:
```bash
python run_benchmark.py --batch-size 4 --sequence-length 1024
```

## Citation

If you use this benchmark suite, please cite:
- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention)
- [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper)
- [Qwen Models](https://huggingface.co/Qwen)

## License

MIT License - see LICENSE file for details