From c4acf721e7317e6227e68c4c9e59bcdb2daac9c1 Mon Sep 17 00:00:00 2001
From: Bole Ma <bole.ma@fau.de>
Date: Thu, 5 Feb 2026 23:20:23 +0100
Subject: [PATCH] Ignore markdown files

---
 AMD_FIX_SUMMARY.md | 100 ---------------
 README.md          | 311 ---------------------------------------------
 2 files changed, 411 deletions(-)
 delete mode 100644 AMD_FIX_SUMMARY.md
 delete mode 100644 README.md

diff --git a/AMD_FIX_SUMMARY.md b/AMD_FIX_SUMMARY.md
deleted file mode 100644
index b5fbca1..0000000
--- a/AMD_FIX_SUMMARY.md
+++ /dev/null
@@ -1,100 +0,0 @@
-# AMD GPU Monitoring Fix Summary
-
-## Issue
-The AMDMonitor class was using incorrect pyrsmi API calls. The implementation attempted to use low-level `rocmsmi` module which has complex initialization and function signatures.
-
-## Solution
-Updated to use the correct `rocml` high-level API from pyrsmi, based on the official example at:
-`/anvme/workspace/ihpc125h-llm-profiles/pyrsmi/examples/llm_monitoring/monitor_llm_inference.py`
-
-## Changes Made
-
-### 1. Fixed AMDMonitor Class
-
-**Before** (incorrect):
-```python
-from pyrsmi import rocmsmi
-ret = self.rocmsmi.rsmi_init(0)
-power_uw = self.rocmsmi.rsmi_dev_power_ave_get(self.device_id)
-```
-
-**After** (correct):
-```python
-from pyrsmi import rocml
-self.rocml.smi_initialize()
-power_watts = self.rocml.smi_get_device_average_power(self.device_id)
-```
-
-**Key API Functions**:
-- `rocml.smi_initialize()` - Initialize monitoring
-- `rocml.smi_get_device_average_power(device_id)` - Get power in Watts (not microwatts!)
-- `rocml.smi_get_device_utilization(device_id)` - Get GPU utilization %
-- `rocml.smi_get_device_memory_used(device_id)` - Get memory used in bytes
-- `rocml.smi_get_device_memory_total(device_id)` - Get total memory in bytes
-- `rocml.smi_get_device_temperature(device_id)` - Get temperature
-- `rocml.smi_get_device_name(device_id)` - Get device name
-- `rocml.smi_shutdown()` - Cleanup
-
-### 2. Updated All SLURM Scripts for Apptainer
-
-All GPU benchmark scripts now run inside the apptainer container:
-
-**A100, H100, H200** (NVIDIA):
-```bash
-APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
-apptainer exec --nv $APPTAINER_IMAGE python run_benchmark.py ...
-```
-
-**MI300X** (AMD):
-```bash
-APPTAINER_IMAGE="/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif"
-apptainer exec --rocm $APPTAINER_IMAGE python run_benchmark.py ...
-```
-
-Note: `--nv` for NVIDIA, `--rocm` for AMD
-
-### 3. Updated Documentation
-
-- README.md now mentions apptainer usage
-- Updated setup instructions to use apptainer for model caching
-- Added notes about container flags (--nv vs --rocm)
-
-## Testing
-
-To verify the AMD monitoring works:
-
-```bash
-# Inside apptainer on MI300X node
-apptainer exec --rocm pytorch_25.10_tilelang.sif python -c "
-from utils.gpu_monitor import AMDMonitor
-m = AMDMonitor(0)
-print(f'GPU: {m.get_device_name()}')
-metrics = m.get_metrics()
-print(f'Power: {metrics.power_watts:.2f} W')
-print(f'Utilization: {metrics.gpu_utilization_percent:.1f}%')
-print(f'Memory: {metrics.memory_used_gb:.2f} / {metrics.memory_total_gb:.2f} GB')
-m.cleanup()
-"
-```
-
-## Files Modified
-
-1. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/utils/gpu_monitor.py` - Fixed AMDMonitor class
-2. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_a100.sh` - Added apptainer
-3. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h100.sh` - Added apptainer
-4. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_h200.sh` - Added apptainer
-5. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/slurm_mi300x.sh` - Added apptainer with --rocm
-6. `/anvme/workspace/ihpc125h-llm-profiles/llm-benchmark/README.md` - Updated documentation
-
-## Key Differences: rocml vs rocmsmi
-
-| Feature | rocml (High-level) | rocmsmi (Low-level) |
-|---------|-------------------|---------------------|
-| API Style | Simple functions | Complex C-style API |
-| Initialization | `smi_initialize()` | `rsmi_init(0)` + error codes |
-| Power | Returns Watts | Returns microwatts |
-| Memory | Returns bytes | Returns bytes via enums |
-| Error Handling | Returns -1 on error | Returns error codes |
-| Ease of Use | Much easier | Complex |
-
-The `rocml` module is the recommended high-level Python API for pyrsmi.
diff --git a/README.md b/README.md
deleted file mode 100644
index a98c03c..0000000
--- a/README.md
+++ /dev/null
@@ -1,311 +0,0 @@
-# LLM Benchmark Suite
-
-A comprehensive benchmarking suite for comparing LLM performance (Qwen3-4B) across different GPU architectures: **MI300X**, **A100 80G**, **H100**, and **H200**.
-
-## Features
-
-- **Pretraining Benchmarks**: Separate metrics for forward, backward, and optimizer stages
-- **Inference Benchmarks**: Separate metrics for prefill (TTFT) and decode (ITL) stages
-- **Energy Monitoring**: GPU-specific energy and power measurement
-  - NVIDIA: pynvml
-  - AMD: pyrsmi
-- **Attention Implementations**:
-  - FlashAttention-2 (A100, MI300X)
-  - FlashAttention-3 Hopper (H100, H200)
-  - Configurable via CLI
-- **Comprehensive Metrics**:
-  - Tokens per second
-  - Energy per token
-  - Time to First Token (TTFT)
-  - Inter-Token Latency (ITL)
-  - End-to-End Request Latency
-  - GPU utilization and memory usage
-
-## Directory Structure
-
-```
-llm-benchmark/
-├── cache_model.py           # Model caching script
-├── benchmark_pretrain.py    # Pretraining benchmark
-├── benchmark_inference.py   # Inference benchmark
-├── run_benchmark.py         # Main orchestration script
-├── requirements.txt         # Python dependencies
-├── utils/
-│   ├── gpu_monitor.py       # GPU monitoring (NVIDIA & AMD)
-│   ├── metrics.py           # Metrics collection and reporting
-│   └── attention.py         # Attention implementation helpers
-├── configs/
-│   ├── a100.yaml
-│   ├── h100.yaml
-│   ├── h200.yaml
-│   └── mi300x.yaml
-└── results/                 # Benchmark results (JSON)
-```
-
-## Setup
-
-### 1. Container Environment
-
-All benchmarks should be run inside the apptainer container:
-
-```bash
-# Container is located at:
-/anvme/workspace/ihpc125h-llm-profiles/pytorch_25.10_tilelang.sif
-```
-
-### 2. Install Dependencies (if not using apptainer)
-
-If you want to run directly without apptainer:
-
-```bash
-# Install Python dependencies
-pip install -r requirements.txt
-
-# For AMD GPUs, ensure ROCm and pyrsmi are installed
-# For NVIDIA GPUs, ensure CUDA and pynvml are installed
-```
-
-### 3. Cache Model (Run on Head Node)
-
-**IMPORTANT**: Run this on the head node BEFORE allocating compute nodes, as compute nodes are typically offline.
-
-```bash
-# Using apptainer (recommended)
-apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
-  --model-name Qwen/Qwen3-4B \
-  --cache-dir ./model_cache
-
-# Or directly (if dependencies installed)
-python cache_model.py --model-name Qwen/Qwen3-4B --cache-dir ./model_cache
-```
-
-The model will be cached to `./model_cache` in the current directory (avoiding slow NFS $HOME).
-
-## Usage
-
-### Quick Start
-
-```bash
-# Run both pretraining and inference benchmarks
-python run_benchmark.py --mode both --model-path ./model_cache
-
-# Run only pretraining
-python run_benchmark.py --mode pretrain --num-steps 20
-
-# Run only inference
-python run_benchmark.py --mode inference --num-requests 20
-```
-
-### Detailed Usage
-
-#### List Available GPUs
-
-```bash
-python run_benchmark.py --list-gpus
-```
-
-#### Pretraining Benchmark
-
-```bash
-python benchmark_pretrain.py \
-  --model-path ./model_cache \
-  --model-name Qwen/Qwen3-4B \
-  --attn-implementation auto \
-  --batch-size 8 \
-  --sequence-length 8192 \
-  --num-steps 10 \
-  --warmup-steps 3 \
-  --output-dir ./results
-```
-
-**Metrics Reported** (per stage: forward, backward, optimizer):
-- Duration (ms)
-- Tokens processed
-- Throughput (tokens/s)
-- Energy (J)
-- Energy per token (J/token)
-- Average power (W)
-- Peak memory (GB)
-- GPU utilization (%)
-
-#### Inference Benchmark
-
-```bash
-python benchmark_inference.py \
-  --model-path ./model_cache \
-  --model-name Qwen/Qwen3-4B \
-  --attn-implementation auto \
-  --num-requests 10 \
-  --prompt-length 512 \
-  --generation-length 100 \
-  --warmup-requests 2 \
-  --output-dir ./results
-```
-
-**Metrics Reported**:
-- **Prefill**: TTFT, throughput, energy per token
-- **Decode**: ITL, throughput, energy per token
-- **End-to-End**: Request latency, total throughput, total energy
-
-### Attention Implementations
-
-The benchmark automatically selects the optimal attention implementation based on GPU:
-- **A100, MI300X**: `flash_attention_2`
-- **H100, H200**: `flash_attention_3_hopper`
-
-Override with `--attn-implementation`:
-
-```bash
-# Force FlashAttention-3 Hopper on H100
-python run_benchmark.py --attn-implementation flash_attention_3_hopper
-
-# Use SDPA instead
-python run_benchmark.py --attn-implementation sdpa
-```
-
-Available options:
-- `auto` - Auto-detect based on GPU
-- `flash_attention_2` - FlashAttention-2 (all GPUs)
-- `flash_attention_3_hopper` - FlashAttention-3 for H100/H200
-- `sdpa` - PyTorch Scaled Dot Product Attention
-- `eager` - Standard PyTorch attention
-
-## Running on SLURM
-
-All SLURM scripts are configured to run inside the apptainer container. First cache the model on the head node:
-
-```bash
-# On head node (with internet access)
-apptainer exec --nv pytorch_25.10_tilelang.sif python cache_model.py \
-  --model-name Qwen/Qwen3-4B \
-  --cache-dir ./model_cache
-```
-
-Then submit jobs:
-
-```bash
-# A100
-sbatch slurm_a100.sh
-
-# H100
-sbatch slurm_h100.sh
-
-# H200
-sbatch slurm_h200.sh
-
-# MI300X
-sbatch slurm_mi300x.sh
-```
-
-**Note**: 
-- NVIDIA GPUs use `--nv` flag
-- AMD GPUs use `--rocm` flag
-
-## Output
-
-Results are saved to the `--output-dir` directory (default: `./results/`):
-
-- `pretrain_<GPU>_<ATTENTION>.json` - Pretraining metrics
-- `inference_<GPU>_<ATTENTION>.json` - Inference metrics
-
-Example output:
-
-```
-===============================================================================
-PRETRAINING BENCHMARK RESULTS
-===============================================================================
-
-Model: Qwen/Qwen3-4B
-GPU: NVIDIA A100 80GB
-Attention: flash_attention_2
-Batch Size: 8
-Sequence Length: 8192
-Training Steps: 10
-
--------------------------------------------------------------------------------
-STAGE BREAKDOWN
--------------------------------------------------------------------------------
-
-[1] FORWARD PASS
-  Duration:              1005.23 ms
-  Tokens:                 163,840
-  Throughput:           163,012.45 tokens/s
-  Energy:                    253.0 J
-  Energy per Token:       1.5443 mJ/token
-
-[2] BACKWARD PASS
-  Duration:              2052.11 ms
-  Tokens:                 163,840
-  Throughput:            79,857.23 tokens/s
-  Energy:                    516.2 J
-  Energy per Token:       3.1513 mJ/token
-
-[3] OPTIMIZER STEP
-  Duration:                153.42 ms
-  Tokens:                 163,840
-  Throughput:         1,068,012.34 tokens/s
-  Energy:                     38.4 J
-  Energy per Token:       0.2344 mJ/token
-
--------------------------------------------------------------------------------
-OVERALL METRICS
--------------------------------------------------------------------------------
-  Total Duration:        3210.76 ms
-  Total Tokens:          163,840
-  Throughput:           51,012.45 tokens/s
-  Total Energy:             807.6 J
-  Energy per Token:       4.9300 mJ/token
-===============================================================================
-```
-
-## Key Metrics Reference
-
-### Pretraining
-- **Forward**: Input processing and loss calculation
-- **Backward**: Gradient computation
-- **Optimizer**: Weight updates
-
-### Inference
-- **TTFT (Time to First Token)**: Prefill latency
-- **ITL (Inter-Token Latency)**: Average decode time per token
-- **E2E Latency**: Total request time (prefill + decode)
-
-### Energy
-- **Energy (J)**: Total energy consumed
-- **Energy per Token (mJ/token)**: Energy efficiency metric
-- **Average Power (W)**: Power consumption during stage
-
-## Troubleshooting
-
-### Model Not Found
-Ensure you've cached the model first:
-```bash
-python cache_model.py --model-name Qwen/Qwen2.5-3B-Instruct --cache-dir ./model_cache
-```
-
-### GPU Monitoring Errors
-- **NVIDIA**: Install pynvml: `pip install pynvml`
-- **AMD**: Install pyrsmi: `pip install pyrsmi`
-
-### FlashAttention-3 Not Found
-For H100/H200, ensure FlashAttention-3 is installed. If not available, use:
-```bash
-python run_benchmark.py --attn-implementation flash_attention_2
-```
-
-### Out of Memory
-Reduce batch size or sequence length:
-```bash
-python run_benchmark.py --batch-size 4 --sequence-length 1024
-```
-
-## Citation
-
-If you use this benchmark suite, please cite:
-- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention)
-- [FlashAttention-3](https://github.com/Dao-AILab/flash-attention) (for Hopper)
-- [Qwen Models](https://huggingface.co/Qwen)
-
-## License
-
-MIT License - see LICENSE file for details