Add README for tagging. Enable tagging by flag without configuration option

2026-03-03 22:57:29 +01:00 · 2026-01-13 08:32:32 +01:00
parent 42809e3f75
commit a9366d14c6
2 changed files with 421 additions and 0 deletions
--- a/configs/tagger/README.md
+++ b/configs/tagger/README.md
@@ -0,0 +1,419 @@
+# Job Tagging Configuration
+
+ClusterCockpit provides automatic job tagging functionality to classify and
+categorize jobs based on configurable rules. The tagging system consists of two
+main components:
+
+1. **Application Detection** - Identifies which application a job is running
+2. **Job Classification** - Analyzes job performance characteristics and applies classification tags
+
+## Directory Structure
+
+```
+configs/tagger/
+├── apps/              # Application detection patterns
+│   ├── vasp.txt
+│   ├── gromacs.txt
+│   └── ...
+└── jobclasses/        # Job classification rules
+    ├── parameters.json
+    ├── lowUtilization.json
+    ├── highload.json
+    └── ...
+```
+
+## Activating Tagger Rules
+
+### Step 1: Copy Configuration Files
+
+To activate tagging, review, adapt, and copy the configuration files from
+`configs/tagger/` to `var/tagger/`:
+
+```bash
+# From the cc-backend root directory
+mkdir -p var/tagger
+cp -r configs/tagger/apps var/tagger/
+cp -r configs/tagger/jobclasses var/tagger/
+```
+
+### Step 2: Enable Tagging in Configuration
+
+Add or set the following configuration key in the `main` section of your `config.json`:
+
+```json
+{
+  "enable-job-taggers": true
+}
+```
+
+**Important**: Automatic tagging is disabled by default. You must explicitly
+enable it by setting `enable-job-taggers: true` in the main configuration file.
+
+### Step 3: Restart cc-backend
+
+The tagger system automatically loads configuration from `./var/tagger/` at
+startup. After copying the files and enabling the feature, restart cc-backend:
+
+```bash
+./cc-backend -server
+```
+
+### Step 4: Verify Configuration Loaded
+
+Check the logs for messages indicating successful configuration loading:
+
+```
+[INFO] Setup file watch for ./var/tagger/apps
+[INFO] Setup file watch for ./var/tagger/jobclasses
+```
+
+## How Tagging Works
+
+### Automatic Tagging
+
+When `enable-job-taggers` is set to `true` in the configuration, tags are
+automatically applied when:
+
+- **Job Start**: Application detection runs immediately when a job starts
+- **Job Stop**: Job classification runs when a job completes
+
+The system analyzes job metadata and metrics to determine appropriate tags.
+
+**Note**: Automatic tagging only works for jobs that start or stop after the
+feature is enabled. Existing jobs are not automatically retagged.
+
+### Manual Tagging (Retroactive)
+
+To apply tags to existing jobs in the database, use the `-apply-tags` command
+line option:
+
+```bash
+./cc-backend -apply-tags
+```
+
+This processes all jobs in the database and applies current tagging rules. This
+is useful when:
+
+- You have existing jobs that were created before tagging was enabled
+- You've added new tagging rules and want to apply them to historical data
+- You've modified existing rules and want to re-evaluate all jobs
+
+### Hot Reload
+
+The tagger system watches the configuration directories for changes. You can
+modify or add rules without restarting `cc-backend`:
+
+- Changes to `var/tagger/apps/*` are detected automatically
+- Changes to `var/tagger/jobclasses/*` are detected automatically
+
+## Application Detection
+
+Application detection identifies which software a job is running by matching
+patterns in the job script.
+
+### Configuration Format
+
+Application patterns are stored in text files under `var/tagger/apps/`. Each
+file contains one or more regular expression patterns (one per line) that match
+against the job script.
+
+**Example: `apps/vasp.txt`**
+
+```
+vasp
+VASP
+```
+
+### How It Works
+
+1. When a job starts, the system retrieves the job script from metadata
+2. Each line in the app files is treated as a regex pattern
+3. Patterns are matched case-insensitively against the lowercased job script
+4. If a match is found, a tag of type `app` with the filename (without extension) is applied
+5. Only the first matching application is tagged
+
+### Adding New Applications
+
+1. Create a new file in `var/tagger/apps/` (e.g., `tensorflow.txt`)
+2. Add regex patterns, one per line:
+
+   ```
+   tensorflow
+   tf\.keras
+   import tensorflow
+   ```
+
+3. The file is automatically detected and loaded
+
+**Note**: The tag name will be the filename without the `.txt` extension (e.g., `tensorflow`).
+
+## Job Classification
+
+Job classification analyzes completed jobs based on their metrics and properties
+to identify performance issues or characteristics.
+
+### Configuration Format
+
+Job classification rules are defined in JSON files under
+`var/tagger/jobclasses/`. Each rule file defines:
+
+- **Metrics required**: Which job metrics to analyze
+- **Requirements**: Pre-conditions that must be met
+- **Variables**: Computed values used in the rule
+- **Rule expression**: Boolean expression that determines if the rule matches
+- **Hint template**: Message displayed when the rule matches
+
+### Parameters File
+
+`jobclasses/parameters.json` defines shared threshold values used across multiple rules:
+
+```json
+{
+  "lowcpuload_threshold_factor": 0.9,
+  "highmemoryusage_threshold_factor": 0.9,
+  "job_min_duration_seconds": 600.0,
+  "sampling_interval_seconds": 30.0
+}
+```
+
+### Rule File Structure
+
+**Example: `jobclasses/lowUtilization.json`**
+
+```json
+{
+  "name": "Low resource utilization",
+  "tag": "lowutilization",
+  "parameters": ["job_min_duration_seconds"],
+  "metrics": ["flops_any", "mem_bw"],
+  "requirements": [
+    "job.shared == \"none\"",
+    "job.duration > job_min_duration_seconds"
+  ],
+  "variables": [
+    {
+      "name": "mem_bw_perc",
+      "expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
+    }
+  ],
+  "rule": "flops_any.avg < flops_any.limits.alert",
+  "hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
+}
+```
+
+#### Field Descriptions
+
+| Field          | Description                                                                   |
+| -------------- | ----------------------------------------------------------------------------- |
+| `name`         | Human-readable description of the rule                                        |
+| `tag`          | Tag identifier applied when the rule matches                                  |
+| `parameters`   | List of parameter names from `parameters.json` to include in rule environment |
+| `metrics`      | List of metrics required for evaluation (must be present in job data)         |
+| `requirements` | Boolean expressions that must all be true for the rule to be evaluated        |
+| `variables`    | Named expressions computed before evaluating the main rule                    |
+| `rule`         | Boolean expression that determines if the job matches this classification     |
+| `hint`         | Go template string for generating a user-visible message                      |
+
+### Expression Environment
+
+Expressions in `requirements`, `variables`, and `rule` have access to:
+
+**Job Properties:**
+
+- `job.shared` - Shared node allocation type
+- `job.duration` - Job runtime in seconds
+- `job.numCores` - Number of CPU cores
+- `job.numNodes` - Number of nodes
+- `job.jobState` - Job completion state
+- `job.numAcc` - Number of accelerators
+- `job.smt` - SMT setting
+
+**Metric Statistics (for each metric in `metrics`):**
+
+- `<metric>.min` - Minimum value
+- `<metric>.max` - Maximum value
+- `<metric>.avg` - Average value
+- `<metric>.limits.peak` - Peak limit from cluster config
+- `<metric>.limits.normal` - Normal threshold
+- `<metric>.limits.caution` - Caution threshold
+- `<metric>.limits.alert` - Alert threshold
+
+**Parameters:**
+
+- All parameters listed in the `parameters` field
+
+**Variables:**
+
+- All variables defined in the `variables` array
+
+### Expression Language
+
+Rules use the [expr](https://github.com/expr-lang/expr) language for expressions. Supported operations:
+
+- **Arithmetic**: `+`, `-`, `*`, `/`, `%`, `^`
+- **Comparison**: `==`, `!=`, `<`, `<=`, `>`, `>=`
+- **Logical**: `&&`, `||`, `!`
+- **Functions**: Standard math functions (see expr documentation)
+
+### Hint Templates
+
+Hints use Go's `text/template` syntax. Variables from the evaluation environment are accessible:
+
+```
+{{.flops_any.avg}}          # Access metric average
+{{.job.duration}}            # Access job property
+{{.my_variable}}             # Access computed variable
+```
+
+### Adding New Classification Rules
+
+1. Create a new JSON file in `var/tagger/jobclasses/` (e.g., `memoryLeak.json`)
+2. Define the rule structure:
+
+   ```json
+   {
+     "name": "Memory Leak Detection",
+     "tag": "memory_leak",
+     "parameters": ["memory_leak_slope_threshold"],
+     "metrics": ["mem_used"],
+     "requirements": ["job.duration > 3600"],
+     "variables": [
+       {
+         "name": "mem_growth",
+         "expr": "(mem_used.max - mem_used.min) / job.duration"
+       }
+     ],
+     "rule": "mem_growth > memory_leak_slope_threshold",
+     "hint": "Memory usage grew by {{.mem_growth}} per second"
+   }
+   ```
+
+3. Add any new parameters to `parameters.json`
+4. The file is automatically detected and loaded
+
+## Configuration Paths
+
+The tagger system reads from these paths (relative to cc-backend working directory):
+
+- **Application patterns**: `./var/tagger/apps/`
+- **Job classification rules**: `./var/tagger/jobclasses/`
+
+These paths are defined as constants in the source code and cannot be changed without recompiling.
+
+## Troubleshooting
+
+### Tags Not Applied
+
+1. **Check tagging is enabled**: Verify `enable-job-taggers: true` is set in `config.json`
+
+2. **Check configuration exists**:
+
+   ```bash
+   ls -la var/tagger/apps
+   ls -la var/tagger/jobclasses
+   ```
+
+3. **Check logs for errors**:
+
+   ```bash
+   ./cc-backend -server -loglevel debug
+   ```
+
+4. **Verify file permissions**: Ensure cc-backend can read the configuration files
+
+5. **For existing jobs**: Use `./cc-backend -apply-tags` to retroactively tag jobs
+
+### Rules Not Matching
+
+1. **Enable debug logging**: Set `loglevel: debug` to see detailed rule evaluation
+2. **Check requirements**: Ensure all requirements in the rule are satisfied
+3. **Verify metrics exist**: Classification rules require job metrics to be available
+4. **Check metric names**: Ensure metric names match those in your cluster configuration
+
+### File Watch Not Working
+
+If changes to configuration files aren't detected:
+
+1. Restart cc-backend to reload all configuration
+2. Check filesystem supports file watching (network filesystems may not)
+3. Check logs for file watch setup messages
+
+## Best Practices
+
+1. **Start Simple**: Begin with basic rules and refine based on results
+2. **Use Requirements**: Filter out irrelevant jobs early with requirements
+3. **Test Incrementally**: Add one rule at a time and verify behavior
+4. **Document Rules**: Use descriptive names and clear hint messages
+5. **Share Parameters**: Define common thresholds in `parameters.json` for consistency
+6. **Version Control**: Keep your `var/tagger/` configuration in version control
+7. **Backup Before Changes**: Test new rules on a copy before deploying to production
+
+## Examples
+
+### Simple Application Detection
+
+**File: `var/tagger/apps/python.txt`**
+
+```
+python
+python3
+\.py
+```
+
+This detects jobs running Python scripts.
+
+### Complex Classification Rule
+
+**File: `var/tagger/jobclasses/cpuImbalance.json`**
+
+```json
+{
+  "name": "CPU Load Imbalance",
+  "tag": "cpu_imbalance",
+  "parameters": ["core_load_imbalance_threshold_factor"],
+  "metrics": ["cpu_load"],
+  "requirements": ["job.numCores > 1", "job.duration > 600"],
+  "variables": [
+    {
+      "name": "load_variance",
+      "expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
+    }
+  ],
+  "rule": "load_variance > core_load_imbalance_threshold_factor",
+  "hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
+}
+```
+
+This detects jobs where CPU load is unevenly distributed across cores.
+
+## Reference
+
+### Configuration Options
+
+**Main Configuration (`config.json`)**:
+
+- `enable-job-taggers` (boolean, default: `false`) - Enables automatic job tagging system
+  - Must be set to `true` to activate automatic tagging on job start/stop events
+  - Does not affect the `-apply-tags` command line option
+
+**Command Line Options**:
+
+- `-apply-tags` - Apply all tagging rules to existing jobs in the database
+  - Works independently of `enable-job-taggers` configuration
+  - Useful for retroactively tagging jobs or re-evaluating with updated rules
+
+### Default Configuration Location
+
+The example configurations are provided in:
+
+- `configs/tagger/apps/` - Example application patterns (16 applications)
+- `configs/tagger/jobclasses/` - Example classification rules (3 rules)
+
+Copy these to `var/tagger/` and customize for your environment.
+
+### Tag Types
+
+- `app` - Application tags (e.g., "vasp", "gromacs")
+- `jobClass` - Classification tags (e.g., "lowutilization", "highload")
+
+Tags can be queried and filtered in the ClusterCockpit UI and API.