mirror of https://github.com/ClusterCockpit/cc-backend synced 2026-01-15 17:21:46 +01:00

Files

Jan Eitzinger a9366d14c6 Add README for tagging. Enable tagging by flag without configuration option

2026-01-13 08:32:32 +01:00

12 KiB

Raw Blame History

Job Tagging Configuration

ClusterCockpit provides automatic job tagging functionality to classify and categorize jobs based on configurable rules. The tagging system consists of two main components:

Application Detection - Identifies which application a job is running
Job Classification - Analyzes job performance characteristics and applies classification tags

Directory Structure

configs/tagger/
├── apps/              # Application detection patterns
│   ├── vasp.txt
│   ├── gromacs.txt
│   └── ...
└── jobclasses/        # Job classification rules
    ├── parameters.json
    ├── lowUtilization.json
    ├── highload.json
    └── ...

Activating Tagger Rules

Step 1: Copy Configuration Files

To activate tagging, review, adapt, and copy the configuration files from configs/tagger/ to var/tagger/:

# From the cc-backend root directory
mkdir -p var/tagger
cp -r configs/tagger/apps var/tagger/
cp -r configs/tagger/jobclasses var/tagger/

Step 2: Enable Tagging in Configuration

Add or set the following configuration key in the main section of your config.json:

{
  "enable-job-taggers": true
}

Important: Automatic tagging is disabled by default. You must explicitly enable it by setting enable-job-taggers: true in the main configuration file.

Step 3: Restart cc-backend

The tagger system automatically loads configuration from ./var/tagger/ at startup. After copying the files and enabling the feature, restart cc-backend:

./cc-backend -server

Step 4: Verify Configuration Loaded

Check the logs for messages indicating successful configuration loading:

[INFO] Setup file watch for ./var/tagger/apps
[INFO] Setup file watch for ./var/tagger/jobclasses

How Tagging Works

Automatic Tagging

When enable-job-taggers is set to true in the configuration, tags are automatically applied when:

Job Start: Application detection runs immediately when a job starts
Job Stop: Job classification runs when a job completes

The system analyzes job metadata and metrics to determine appropriate tags.

Note: Automatic tagging only works for jobs that start or stop after the feature is enabled. Existing jobs are not automatically retagged.

Manual Tagging (Retroactive)

To apply tags to existing jobs in the database, use the -apply-tags command line option:

./cc-backend -apply-tags

This processes all jobs in the database and applies current tagging rules. This is useful when:

You have existing jobs that were created before tagging was enabled
You've added new tagging rules and want to apply them to historical data
You've modified existing rules and want to re-evaluate all jobs

Hot Reload

The tagger system watches the configuration directories for changes. You can modify or add rules without restarting cc-backend:

Changes to var/tagger/apps/* are detected automatically
Changes to var/tagger/jobclasses/* are detected automatically

Application Detection

Application detection identifies which software a job is running by matching patterns in the job script.

Configuration Format

Application patterns are stored in text files under var/tagger/apps/. Each file contains one or more regular expression patterns (one per line) that match against the job script.

Example: apps/vasp.txt

vasp
VASP

How It Works

When a job starts, the system retrieves the job script from metadata
Each line in the app files is treated as a regex pattern
Patterns are matched case-insensitively against the lowercased job script
If a match is found, a tag of type app with the filename (without extension) is applied
Only the first matching application is tagged

Adding New Applications

Create a new file in var/tagger/apps/ (e.g., tensorflow.txt)
Add regex patterns, one per line:
```
tensorflow
tf\.keras
import tensorflow
```
The file is automatically detected and loaded

Note: The tag name will be the filename without the .txt extension (e.g., tensorflow).

Job Classification

Job classification analyzes completed jobs based on their metrics and properties to identify performance issues or characteristics.

Configuration Format

Job classification rules are defined in JSON files under var/tagger/jobclasses/. Each rule file defines:

Metrics required: Which job metrics to analyze
Requirements: Pre-conditions that must be met
Variables: Computed values used in the rule
Rule expression: Boolean expression that determines if the rule matches
Hint template: Message displayed when the rule matches

Parameters File

jobclasses/parameters.json defines shared threshold values used across multiple rules:

{
  "lowcpuload_threshold_factor": 0.9,
  "highmemoryusage_threshold_factor": 0.9,
  "job_min_duration_seconds": 600.0,
  "sampling_interval_seconds": 30.0
}

Rule File Structure

Example: jobclasses/lowUtilization.json

{
  "name": "Low resource utilization",
  "tag": "lowutilization",
  "parameters": ["job_min_duration_seconds"],
  "metrics": ["flops_any", "mem_bw"],
  "requirements": [
    "job.shared == \"none\"",
    "job.duration > job_min_duration_seconds"
  ],
  "variables": [
    {
      "name": "mem_bw_perc",
      "expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
    }
  ],
  "rule": "flops_any.avg < flops_any.limits.alert",
  "hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
}

Field Descriptions

Field	Description
`name`	Human-readable description of the rule
`tag`	Tag identifier applied when the rule matches
`parameters`	List of parameter names from `parameters.json` to include in rule environment
`metrics`	List of metrics required for evaluation (must be present in job data)
`requirements`	Boolean expressions that must all be true for the rule to be evaluated
`variables`	Named expressions computed before evaluating the main rule
`rule`	Boolean expression that determines if the job matches this classification
`hint`	Go template string for generating a user-visible message

Expression Environment

Expressions in requirements, variables, and rule have access to:

Job Properties:

job.shared - Shared node allocation type
job.duration - Job runtime in seconds
job.numCores - Number of CPU cores
job.numNodes - Number of nodes
job.jobState - Job completion state
job.numAcc - Number of accelerators
job.smt - SMT setting

Metric Statistics (for each metric in metrics):

<metric>.min - Minimum value
<metric>.max - Maximum value
<metric>.avg - Average value
<metric>.limits.peak - Peak limit from cluster config
<metric>.limits.normal - Normal threshold
<metric>.limits.caution - Caution threshold
<metric>.limits.alert - Alert threshold

Parameters:

All parameters listed in the parameters field

Variables:

All variables defined in the variables array

Expression Language

Rules use the expr language for expressions. Supported operations:

Arithmetic: +, -, *, /, %, ^
Comparison: ==, !=, <, <=, >, >=
Logical: &&, ||, !
Functions: Standard math functions (see expr documentation)

Hint Templates

Hints use Go's text/template syntax. Variables from the evaluation environment are accessible:

{{.flops_any.avg}}          # Access metric average
{{.job.duration}}            # Access job property
{{.my_variable}}             # Access computed variable

Adding New Classification Rules

Create a new JSON file in var/tagger/jobclasses/ (e.g., memoryLeak.json)

Define the rule structure:

{
  "name": "Memory Leak Detection",
  "tag": "memory_leak",
  "parameters": ["memory_leak_slope_threshold"],
  "metrics": ["mem_used"],
  "requirements": ["job.duration > 3600"],
  "variables": [
    {
      "name": "mem_growth",
      "expr": "(mem_used.max - mem_used.min) / job.duration"
    }
  ],
  "rule": "mem_growth > memory_leak_slope_threshold",
  "hint": "Memory usage grew by {{.mem_growth}} per second"
}

Add any new parameters to parameters.json
The file is automatically detected and loaded

Configuration Paths

The tagger system reads from these paths (relative to cc-backend working directory):

Application patterns: ./var/tagger/apps/
Job classification rules: ./var/tagger/jobclasses/

These paths are defined as constants in the source code and cannot be changed without recompiling.

Troubleshooting

Tags Not Applied

Check tagging is enabled: Verify enable-job-taggers: true is set in config.json

Check configuration exists:

ls -la var/tagger/apps
ls -la var/tagger/jobclasses

Check logs for errors:
```
./cc-backend -server -loglevel debug
```
Verify file permissions: Ensure cc-backend can read the configuration files
For existing jobs: Use ./cc-backend -apply-tags to retroactively tag jobs

Rules Not Matching

Enable debug logging: Set loglevel: debug to see detailed rule evaluation
Check requirements: Ensure all requirements in the rule are satisfied
Verify metrics exist: Classification rules require job metrics to be available
Check metric names: Ensure metric names match those in your cluster configuration

File Watch Not Working

If changes to configuration files aren't detected:

Restart cc-backend to reload all configuration
Check filesystem supports file watching (network filesystems may not)
Check logs for file watch setup messages

Best Practices

Start Simple: Begin with basic rules and refine based on results
Use Requirements: Filter out irrelevant jobs early with requirements
Test Incrementally: Add one rule at a time and verify behavior
Document Rules: Use descriptive names and clear hint messages
Share Parameters: Define common thresholds in parameters.json for consistency
Version Control: Keep your var/tagger/ configuration in version control
Backup Before Changes: Test new rules on a copy before deploying to production

Examples

Simple Application Detection

File: var/tagger/apps/python.txt

python
python3
\.py

This detects jobs running Python scripts.

Complex Classification Rule

File: var/tagger/jobclasses/cpuImbalance.json

{
  "name": "CPU Load Imbalance",
  "tag": "cpu_imbalance",
  "parameters": ["core_load_imbalance_threshold_factor"],
  "metrics": ["cpu_load"],
  "requirements": ["job.numCores > 1", "job.duration > 600"],
  "variables": [
    {
      "name": "load_variance",
      "expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
    }
  ],
  "rule": "load_variance > core_load_imbalance_threshold_factor",
  "hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
}

This detects jobs where CPU load is unevenly distributed across cores.

Reference

Configuration Options

Main Configuration (config.json):

enable-job-taggers (boolean, default: false) - Enables automatic job tagging system
- Must be set to true to activate automatic tagging on job start/stop events
- Does not affect the -apply-tags command line option

Command Line Options:

-apply-tags - Apply all tagging rules to existing jobs in the database
- Works independently of enable-job-taggers configuration
- Useful for retroactively tagging jobs or re-evaluating with updated rules

Default Configuration Location

The example configurations are provided in:

configs/tagger/apps/ - Example application patterns (16 applications)
configs/tagger/jobclasses/ - Example classification rules (3 rules)

Copy these to var/tagger/ and customize for your environment.

Tag Types

app - Application tags (e.g., "vasp", "gromacs")
jobClass - Classification tags (e.g., "lowutilization", "highload")

Tags can be queried and filtered in the ClusterCockpit UI and API.

12 KiB Raw Blame History

Job Tagging Configuration

Directory Structure

Activating Tagger Rules

Step 1: Copy Configuration Files

Step 2: Enable Tagging in Configuration

Step 3: Restart cc-backend

Step 4: Verify Configuration Loaded

How Tagging Works

Automatic Tagging

Manual Tagging (Retroactive)

Hot Reload

Application Detection

Configuration Format

How It Works

Adding New Applications

Job Classification

Configuration Format

Parameters File

Rule File Structure

Field Descriptions

Expression Environment

Expression Language

Hint Templates

Adding New Classification Rules

Configuration Paths

Troubleshooting

Tags Not Applied

Rules Not Matching

File Watch Not Working

Best Practices

Examples

Simple Application Detection

Complex Classification Rule

Reference

Configuration Options

Default Configuration Location

Tag Types

12 KiB

Raw Blame History