From c4acf721e7317e6227e68c4c9e59bcdb2daac9c1 Mon Sep 17 00:00:00 2001 From: Bole Ma Date: Thu, 5 Feb 2026 23:20:23 +0100 Subject: [PATCH] Ignore markdown files --- AMD_FIX_SUMMARY.md | 100 --------------- README.md | 311 --------------------------------------------- 2 files changed, 411 deletions(-) delete mode 100644 AMD_FIX_SUMMARY.md delete mode 100644 README.md diff --git a/AMD_FIX_SUMMARY.md b/AMD_FIX_SUMMARY.md deleted file mode 100644 index b5fbca1..0000000 --- a/AMD_FIX_SUMMARY.md +++ /dev/null @@ -1,100 +0,0 @@ -# AMD GPU Monitoring Fix Summary - -## Issue -The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures. - -## Solution -Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at: -`/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py` - -## Changes Made - -### 1. Fixed AMDMonitor Class - -**Before** (incorrect): -```python -from pyrsmi import rocmsmi -ret = self.rocmsmi.rsmi_init(0) -power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id) -``` - -**After** (correct): -```python -from pyrsmi import rocml -self.rocml.smi_initialize() -power_watts = self.rocml.smi_get_device_average_power(self.device_id) -``` - -**Key API Functions**: -- `rocml.smi_initialize()` - Initialize monitoring -- `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!) -- `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization % -- `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes -- `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes -- `rocml.smi_get_device_temperature(device_id)` - Get temperature -- `rocml.smi_get_device_name(device_id)` - Get device name -- `rocml.smi_shutdown()` - Cleanup - -### 2. Updated All SLURM Scripts for Apptainer - -All GPU benchmark scripts now run inside the apptainer container: - -**A100, H100, H200** (NVIDIA): -```bash -APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif" -apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ... -``` - -**MI300X** (AMD): -```bash -APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif" -apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ... -``` - -Note: `--nv` for NVIDIA, `--rocm` for AMD - -### 3. Updated Documentation - -- README.md now mentions apptainer usage -- Updated setup instructions to use apptainer for model caching -- Added notes about container flags (--nv vs --rocm) - -## Testing - -To verify the AMD monitoring works: - -```bash -# Inside apptainer on MI300X node -apptainer exec --rocm pytorch_25.10_tilelang.sif python -c " -from utils.gpu_monitor import AMDMonitor -m = AMDMonitor(0) -print(f'GPU: {m.get_device_name()}') -metrics = m.get_metrics() -print(f'Power: {metrics.power_watts:.2f} W') -print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%') -print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB') -m.cleanup() -" -``` - -## Files Modified - -1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class -2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer -3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer -4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer -5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm -6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation - -## Key Differences: rocml vs rocmsmi - -| Feature | rocml (High-level) | rocmsmi (Low-level) | -|---------|-------------------|---------------------| -| API Style | Simple functions | Complex C-style API | -| Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes | -| Power | Returns Watts | Returns microwatts | -| Memory | Returns bytes | Returns bytes via enums | -| Error Handling | Returns -1 on error | Returns error codes | -| Ease of Use | Much easier | Complex | - -The `rocml` module is the recommended high-level Python API for pyrsmi. diff --git a/README.md b/README.md deleted file mode 100644 index a98c03c..0000000 --- a/README.md +++ /dev/null @@ -1,311 +0,0 @@ -# LLM Benchmark Suite - -A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**. - -## Features - -- **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages -- **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages -- **Energy Monitoring**: GPU-specific energy and power measurement - - NVIDIA: pynvml - - AMD: pyrsmi -- **Attention Implementations**: - - FlashAttention-2 (A100, MI300X) - - FlashAttention-3 Hopper (H100, H200) - - Configurable via CLI -- **Comprehensive Metrics**: - - Tokens per second - - Energy per token - - Time to First Token (TTFT) - - Inter-Token Latency (ITL) - - End-to-End Request Latency - - GPU utilization and memory usage - -## Directory Structure - -``` -llm-benchmark/ -├── cache_model.py # Model caching script -├── benchmark_pretrain.py # Pretraining benchmark -├── benchmark_inference.py # Inference benchmark -├── run_benchmark.py # Main orchestration script -├── requirements.txt # Python dependencies -├── utils/ -│ ├── gpu_monitor.py # GPU monitoring (NVIDIA & AMD) -│ ├── metrics.py # Metrics collection and reporting -│ └── attention.py # Attention implementation helpers -├── configs/ -│ ├── a100.yaml -│ ├── h100.yaml -│ ├── h200.yaml -│ └── mi300x.yaml -└── results/ # Benchmark results (JSON) -``` - -## Setup - -### 1. Container Environment - -All benchmarks should be run inside the apptainer container: - -```bash -# Container is located at: -/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif -``` - -### 2. Install Dependencies (if not using apptainer) - -If you want to run directly without apptainer: - -```bash -# Install Python dependencies -pip install -r requirements.txt - -# For AMD GPUs, ensure ROCm and pyrsmi are installed -# For NVIDIA GPUs, ensure CUDA and pynvml are installed -``` - -### 3. Cache Model (Run on Head Node) - -**IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline. - -```bash -# Using apptainer (recommended) -apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \ - --model-name Qwen/Qwen3-4B \ - --cache-dir ./model_cache - -# Or directly (if dependencies installed) -python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache -``` - -The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME). - -## Usage - -### Quick Start - -```bash -# Run both pretraining and inference benchmarks -python run_benchmark.py --mode both --model-path ./model_cache - -# Run only pretraining -python run_benchmark.py --mode pretrain --num-steps 20 - -# Run only inference -python run_benchmark.py --mode inference --num-requests 20 -``` - -### Detailed Usage - -#### List Available GPUs - -```bash -python run_benchmark.py --list-gpus -``` - -#### Pretraining Benchmark - -```bash -python benchmark_pretrain.py \ - --model-path ./model_cache \ - --model-name Qwen/Qwen3-4B \ - --attn-implementation auto \ - --batch-size 8 \ - --sequence-length 8192 \ - --num-steps 10 \ - --warmup-steps 3 \ - --output-dir ./results -``` - -**Metrics Reported** (per stage: forward, backward, optimizer): -- Duration (ms) -- Tokens processed -- Throughput (tokens/s) -- Energy (J) -- Energy per token (J/token) -- Average power (W) -- Peak memory (GB) -- GPU utilization (%) - -#### Inference Benchmark - -```bash -python benchmark_inference.py \ - --model-path ./model_cache \ - --model-name Qwen/Qwen3-4B \ - --attn-implementation auto \ - --num-requests 10 \ - --prompt-length 512 \ - --generation-length 100 \ - --warmup-requests 2 \ - --output-dir ./results -``` - -**Metrics Reported**: -- **Prefill**: TTFT, throughput, energy per token -- **Decode**: ITL, throughput, energy per token -- **End-to-End**: Request latency, total throughput, total energy - -### Attention Implementations - -The benchmark automatically selects the optimal attention implementation based on GPU: -- **A100, MI300X**: `flash_attention_2` -- **H100, H200**: `flash_attention_3_hopper` - -Override with `--attn-implementation`: - -```bash -# Force FlashAttention-3 Hopper on H100 -python run_benchmark.py --attn-implementation flash_attention_3_hopper - -# Use SDPA instead -python run_benchmark.py --attn-implementation sdpa -``` - -Available options: -- `auto` - Auto-detect based on GPU -- `flash_attention_2` - FlashAttention-2 (all GPUs) -- `flash_attention_3_hopper` - FlashAttention-3 for H100/H200 -- `sdpa` - PyTorch Scaled Dot Product Attention -- `eager` - Standard PyTorch attention - -## Running on SLURM - -All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node: - -```bash -# On head node (with internet access) -apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \ - --model-name Qwen/Qwen3-4B \ - --cache-dir ./model_cache -``` - -Then submit jobs: - -```bash -# A100 -sbatch slurm_a100.sh - -# H100 -sbatch slurm_h100.sh - -# H200 -sbatch slurm_h200.sh - -# MI300X -sbatch slurm_mi300x.sh -``` - -**Note**: -- NVIDIA GPUs use `--nv` flag -- AMD GPUs use `--rocm` flag - -## Output - -Results are saved to the `--output-dir` directory (default: `./results/`): - -- `pretrain__.json` - Pretraining metrics -- `inference__.json` - Inference metrics - -Example output: - -``` -=============================================================================== -PRETRAINING BENCHMARK RESULTS -=============================================================================== - -Model: Qwen/Qwen3-4B -GPU: NVIDIA A100 80GB -Attention: flash_attention_2 -Batch Size: 8 -Sequence Length: 8192 -Training Steps: 10 - -------------------------------------------------------------------------------- -STAGE BREAKDOWN -------------------------------------------------------------------------------- - -[1] FORWARD PASS - Duration: 1005.23 ms - Tokens: 163,840 - Throughput: 163,012.45 tokens/s - Energy: 253.0 J - Energy per Token: 1.5443 mJ/token - -[2] BACKWARD PASS - Duration: 2052.11 ms - Tokens: 163,840 - Throughput: 79,857.23 tokens/s - Energy: 516.2 J - Energy per Token: 3.1513 mJ/token - -[3] OPTIMIZER STEP - Duration: 153.42 ms - Tokens: 163,840 - Throughput: 1,068,012.34 tokens/s - Energy: 38.4 J - Energy per Token: 0.2344 mJ/token - -------------------------------------------------------------------------------- -OVERALL METRICS -------------------------------------------------------------------------------- - Total Duration: 3210.76 ms - Total Tokens: 163,840 - Throughput: 51,012.45 tokens/s - Total Energy: 807.6 J - Energy per Token: 4.9300 mJ/token -=============================================================================== -``` - -## Key Metrics Reference - -### Pretraining -- **Forward**: Input processing and loss calculation -- **Backward**: Gradient computation -- **Optimizer**: Weight updates - -### Inference -- **TTFT (Time to First Token)**: Prefill latency -- **ITL (Inter-Token Latency)**: Average decode time per token -- **E2E Latency**: Total request time (prefill + decode) - -### Energy -- **Energy (J)**: Total energy consumed -- **Energy per Token (mJ/token)**: Energy efficiency metric -- **Average Power (W)**: Power consumption during stage - -## Troubleshooting - -### Model Not Found -Ensure you've cached the model first: -```bash -python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache -``` - -### GPU Monitoring Errors -- **NVIDIA**: Install pynvml: `pip install pynvml` -- **AMD**: Install pyrsmi: `pip install pyrsmi` - -### FlashAttention-3 Not Found -For H100/H200, ensure FlashAttention-3 is installed. If not available, use: -```bash -python run_benchmark.py --attn-implementation flash_attention_2 -``` - -### Out of Memory -Reduce batch size or sequence length: -```bash -python run_benchmark.py --batch-size 4 --sequence-length 1024 -``` - -## Citation - -If you use this benchmark suite, please cite: -- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention) -- [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper) -- [Qwen Models](https://huggingface.co/Qwen) - -## License - -MIT License - see LICENSE file for details