Add README for tagging. Enable tagging by flag without configuration option

2026-03-03 22:57:29 +01:00 · 2026-01-13 08:32:32 +01:00
parent 42809e3f75
commit a9366d14c6
2 changed files with 421 additions and 0 deletions
--- a/cmd/cc-backend/main.go
+++ b/cmd/cc-backend/main.go
@@ -302,6 +302,8 @@ func initSubsystems() error {
 	// Apply tags if requested
 	if flagApplyTags {
 		tagger.Init()
 		if err := tagger.RunTaggers(); err != nil {
 			return fmt.Errorf("running job taggers: %w", err)
 		}
--- a/configs/tagger/README.md
+++ b/configs/tagger/README.md
@@ -0,0 +1,419 @@
 # Job Tagging Configuration
 ClusterCockpit provides automatic job tagging functionality to classify and
 categorize jobs based on configurable rules. The tagging system consists of two
 main components:
 1. **Application Detection** - Identifies which application a job is running
 2. **Job Classification** - Analyzes job performance characteristics and applies classification tags
 ## Directory Structure
 ```
 configs/tagger/
 ├── apps/              # Application detection patterns
 │   ├── vasp.txt
 │   ├── gromacs.txt
 │   └── ...
 └── jobclasses/        # Job classification rules
    ├── parameters.json
    ├── lowUtilization.json
    ├── highload.json
    └── ...
 ```
 ## Activating Tagger Rules
 ### Step 1: Copy Configuration Files
 To activate tagging, review, adapt, and copy the configuration files from
 `configs/tagger/` to `var/tagger/`:
 ```bash
 # From the cc-backend root directory
 mkdir -p var/tagger
 cp -r configs/tagger/apps var/tagger/
 cp -r configs/tagger/jobclasses var/tagger/
 ```
 ### Step 2: Enable Tagging in Configuration
 Add or set the following configuration key in the `main` section of your `config.json`:
 ```json
 {
  "enable-job-taggers": true
 }
 ```
 **Important**: Automatic tagging is disabled by default. You must explicitly
 enable it by setting `enable-job-taggers: true` in the main configuration file.
 ### Step 3: Restart cc-backend
 The tagger system automatically loads configuration from `./var/tagger/` at
 startup. After copying the files and enabling the feature, restart cc-backend:
 ```bash
 ./cc-backend -server
 ```
 ### Step 4: Verify Configuration Loaded
 Check the logs for messages indicating successful configuration loading:
 ```
 [INFO] Setup file watch for ./var/tagger/apps
 [INFO] Setup file watch for ./var/tagger/jobclasses
 ```
 ## How Tagging Works
 ### Automatic Tagging
 When `enable-job-taggers` is set to `true` in the configuration, tags are
 automatically applied when:
 - **Job Start**: Application detection runs immediately when a job starts
 - **Job Stop**: Job classification runs when a job completes
 The system analyzes job metadata and metrics to determine appropriate tags.
 **Note**: Automatic tagging only works for jobs that start or stop after the
 feature is enabled. Existing jobs are not automatically retagged.
 ### Manual Tagging (Retroactive)
 To apply tags to existing jobs in the database, use the `-apply-tags` command
 line option:
 ```bash
 ./cc-backend -apply-tags
 ```
 This processes all jobs in the database and applies current tagging rules. This
 is useful when:
 - You have existing jobs that were created before tagging was enabled
 - You've added new tagging rules and want to apply them to historical data
 - You've modified existing rules and want to re-evaluate all jobs
 ### Hot Reload
 The tagger system watches the configuration directories for changes. You can
 modify or add rules without restarting `cc-backend`:
 - Changes to `var/tagger/apps/*` are detected automatically
 - Changes to `var/tagger/jobclasses/*` are detected automatically
 ## Application Detection
 Application detection identifies which software a job is running by matching
 patterns in the job script.
 ### Configuration Format
 Application patterns are stored in text files under `var/tagger/apps/`. Each
 file contains one or more regular expression patterns (one per line) that match
 against the job script.
 **Example: `apps/vasp.txt`**
 ```
 vasp
 VASP
 ```
 ### How It Works
 1. When a job starts, the system retrieves the job script from metadata
 2. Each line in the app files is treated as a regex pattern
 3. Patterns are matched case-insensitively against the lowercased job script
 4. If a match is found, a tag of type `app` with the filename (without extension) is applied
 5. Only the first matching application is tagged
 ### Adding New Applications
 1. Create a new file in `var/tagger/apps/` (e.g., `tensorflow.txt`)
 2. Add regex patterns, one per line:
   ```
   tensorflow
   tf\.keras
   import tensorflow
   ```
 3. The file is automatically detected and loaded
 **Note**: The tag name will be the filename without the `.txt` extension (e.g., `tensorflow`).
 ## Job Classification
 Job classification analyzes completed jobs based on their metrics and properties
 to identify performance issues or characteristics.
 ### Configuration Format
 Job classification rules are defined in JSON files under
 `var/tagger/jobclasses/`. Each rule file defines:
 - **Metrics required**: Which job metrics to analyze
 - **Requirements**: Pre-conditions that must be met
 - **Variables**: Computed values used in the rule
 - **Rule expression**: Boolean expression that determines if the rule matches
 - **Hint template**: Message displayed when the rule matches
 ### Parameters File
 `jobclasses/parameters.json` defines shared threshold values used across multiple rules:
 ```json
 {
  "lowcpuload_threshold_factor": 0.9,
  "highmemoryusage_threshold_factor": 0.9,
  "job_min_duration_seconds": 600.0,
  "sampling_interval_seconds": 30.0
 }
 ```
 ### Rule File Structure
 **Example: `jobclasses/lowUtilization.json`**
 ```json
 {
  "name": "Low resource utilization",
  "tag": "lowutilization",
  "parameters": ["job_min_duration_seconds"],
  "metrics": ["flops_any", "mem_bw"],
  "requirements": [
    "job.shared == \"none\"",
    "job.duration > job_min_duration_seconds"
  ],
  "variables": [
    {
      "name": "mem_bw_perc",
      "expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
    }
  ],
  "rule": "flops_any.avg < flops_any.limits.alert",
  "hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
 }
 ```
 #### Field Descriptions
 | Field          | Description                                                                   |
 | -------------- | ----------------------------------------------------------------------------- |
 | `name`         | Human-readable description of the rule                                        |
 | `tag`          | Tag identifier applied when the rule matches                                  |
 | `parameters`   | List of parameter names from `parameters.json` to include in rule environment |
 | `metrics`      | List of metrics required for evaluation (must be present in job data)         |
 | `requirements` | Boolean expressions that must all be true for the rule to be evaluated        |
 | `variables`    | Named expressions computed before evaluating the main rule                    |
 | `rule`         | Boolean expression that determines if the job matches this classification     |
 | `hint`         | Go template string for generating a user-visible message                      |
 ### Expression Environment
 Expressions in `requirements`, `variables`, and `rule` have access to:
 **Job Properties:**
 - `job.shared` - Shared node allocation type
 - `job.duration` - Job runtime in seconds
 - `job.numCores` - Number of CPU cores
 - `job.numNodes` - Number of nodes
 - `job.jobState` - Job completion state
 - `job.numAcc` - Number of accelerators
 - `job.smt` - SMT setting
 **Metric Statistics (for each metric in `metrics`):**
 - `<metric>.min` - Minimum value
 - `<metric>.max` - Maximum value
 - `<metric>.avg` - Average value
 - `<metric>.limits.peak` - Peak limit from cluster config
 - `<metric>.limits.normal` - Normal threshold
 - `<metric>.limits.caution` - Caution threshold
 - `<metric>.limits.alert` - Alert threshold
 **Parameters:**
 - All parameters listed in the `parameters` field
 **Variables:**
 - All variables defined in the `variables` array
 ### Expression Language
 Rules use the [expr](https://github.com/expr-lang/expr) language for expressions. Supported operations:
 - **Arithmetic**: `+`, `-`, `*`, `/`, `%`, `^`
 - **Comparison**: `==`, `!=`, `<`, `<=`, `>`, `>=`
 - **Logical**: `&&`, `||`, `!`
 - **Functions**: Standard math functions (see expr documentation)
 ### Hint Templates
 Hints use Go's `text/template` syntax. Variables from the evaluation environment are accessible:
 ```
 {{.flops_any.avg}}          # Access metric average
 {{.job.duration}}            # Access job property
 {{.my_variable}}             # Access computed variable
 ```
 ### Adding New Classification Rules
 1. Create a new JSON file in `var/tagger/jobclasses/` (e.g., `memoryLeak.json`)
 2. Define the rule structure:
   ```json
   {
     "name": "Memory Leak Detection",
     "tag": "memory_leak",
     "parameters": ["memory_leak_slope_threshold"],
     "metrics": ["mem_used"],
     "requirements": ["job.duration > 3600"],
     "variables": [
       {
         "name": "mem_growth",
         "expr": "(mem_used.max - mem_used.min) / job.duration"
       }
     ],
     "rule": "mem_growth > memory_leak_slope_threshold",
     "hint": "Memory usage grew by {{.mem_growth}} per second"
   }
   ```
 3. Add any new parameters to `parameters.json`
 4. The file is automatically detected and loaded
 ## Configuration Paths
 The tagger system reads from these paths (relative to cc-backend working directory):
 - **Application patterns**: `./var/tagger/apps/`
 - **Job classification rules**: `./var/tagger/jobclasses/`
 These paths are defined as constants in the source code and cannot be changed without recompiling.
 ## Troubleshooting
 ### Tags Not Applied
 1. **Check tagging is enabled**: Verify `enable-job-taggers: true` is set in `config.json`
 2. **Check configuration exists**:
   ```bash
   ls -la var/tagger/apps
   ls -la var/tagger/jobclasses
   ```
 3. **Check logs for errors**:
   ```bash
   ./cc-backend -server -loglevel debug
   ```
 4. **Verify file permissions**: Ensure cc-backend can read the configuration files
 5. **For existing jobs**: Use `./cc-backend -apply-tags` to retroactively tag jobs
 ### Rules Not Matching
 1. **Enable debug logging**: Set `loglevel: debug` to see detailed rule evaluation
 2. **Check requirements**: Ensure all requirements in the rule are satisfied
 3. **Verify metrics exist**: Classification rules require job metrics to be available
 4. **Check metric names**: Ensure metric names match those in your cluster configuration
 ### File Watch Not Working
 If changes to configuration files aren't detected:
 1. Restart cc-backend to reload all configuration
 2. Check filesystem supports file watching (network filesystems may not)
 3. Check logs for file watch setup messages
 ## Best Practices
 1. **Start Simple**: Begin with basic rules and refine based on results
 2. **Use Requirements**: Filter out irrelevant jobs early with requirements
 3. **Test Incrementally**: Add one rule at a time and verify behavior
 4. **Document Rules**: Use descriptive names and clear hint messages
 5. **Share Parameters**: Define common thresholds in `parameters.json` for consistency
 6. **Version Control**: Keep your `var/tagger/` configuration in version control
 7. **Backup Before Changes**: Test new rules on a copy before deploying to production
 ## Examples
 ### Simple Application Detection
 **File: `var/tagger/apps/python.txt`**
 ```
 python
 python3
 \.py
 ```
 This detects jobs running Python scripts.
 ### Complex Classification Rule
 **File: `var/tagger/jobclasses/cpuImbalance.json`**
 ```json
 {
  "name": "CPU Load Imbalance",
  "tag": "cpu_imbalance",
  "parameters": ["core_load_imbalance_threshold_factor"],
  "metrics": ["cpu_load"],
  "requirements": ["job.numCores > 1", "job.duration > 600"],
  "variables": [
    {
      "name": "load_variance",
      "expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
    }
  ],
  "rule": "load_variance > core_load_imbalance_threshold_factor",
  "hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
 }
 ```
 This detects jobs where CPU load is unevenly distributed across cores.
 ## Reference
 ### Configuration Options
 **Main Configuration (`config.json`)**:
 - `enable-job-taggers` (boolean, default: `false`) - Enables automatic job tagging system
  - Must be set to `true` to activate automatic tagging on job start/stop events
  - Does not affect the `-apply-tags` command line option
 **Command Line Options**:
 - `-apply-tags` - Apply all tagging rules to existing jobs in the database
  - Works independently of `enable-job-taggers` configuration
  - Useful for retroactively tagging jobs or re-evaluating with updated rules
 ### Default Configuration Location
 The example configurations are provided in:
 - `configs/tagger/apps/` - Example application patterns (16 applications)
 - `configs/tagger/jobclasses/` - Example classification rules (3 rules)
 Copy these to `var/tagger/` and customize for your environment.
 ### Tag Types
 - `app` - Application tags (e.g., "vasp", "gromacs")
 - `jobClass` - Classification tags (e.g., "lowutilization", "highload")
 Tags can be queried and filtered in the ClusterCockpit UI and API.