12 KiB
Job Tagging Configuration
ClusterCockpit provides automatic job tagging functionality to classify and categorize jobs based on configurable rules. The tagging system consists of two main components:
- Application Detection - Identifies which application a job is running
- Job Classification - Analyzes job performance characteristics and applies classification tags
Directory Structure
configs/tagger/
├── apps/ # Application detection patterns
│ ├── vasp.txt
│ ├── gromacs.txt
│ └── ...
└── jobclasses/ # Job classification rules
├── parameters.json
├── lowUtilization.json
├── highload.json
└── ...
Activating Tagger Rules
Step 1: Copy Configuration Files
To activate tagging, review, adapt, and copy the configuration files from
configs/tagger/ to var/tagger/:
# From the cc-backend root directory
mkdir -p var/tagger
cp -r configs/tagger/apps var/tagger/
cp -r configs/tagger/jobclasses var/tagger/
Step 2: Enable Tagging in Configuration
Add or set the following configuration key in the main section of your config.json:
{
"enable-job-taggers": true
}
Important: Automatic tagging is disabled by default. You must explicitly
enable it by setting enable-job-taggers: true in the main configuration file.
Step 3: Restart cc-backend
The tagger system automatically loads configuration from ./var/tagger/ at
startup. After copying the files and enabling the feature, restart cc-backend:
./cc-backend -server
Step 4: Verify Configuration Loaded
Check the logs for messages indicating successful configuration loading:
[INFO] Setup file watch for ./var/tagger/apps
[INFO] Setup file watch for ./var/tagger/jobclasses
How Tagging Works
Automatic Tagging
When enable-job-taggers is set to true in the configuration, tags are
automatically applied when:
- Job Start: Application detection runs immediately when a job starts
- Job Stop: Job classification runs when a job completes
The system analyzes job metadata and metrics to determine appropriate tags.
Note: Automatic tagging only works for jobs that start or stop after the feature is enabled. Existing jobs are not automatically retagged.
Manual Tagging (Retroactive)
To apply tags to existing jobs in the database, use the -apply-tags command
line option:
./cc-backend -apply-tags
This processes all jobs in the database and applies current tagging rules. This is useful when:
- You have existing jobs that were created before tagging was enabled
- You've added new tagging rules and want to apply them to historical data
- You've modified existing rules and want to re-evaluate all jobs
Hot Reload
The tagger system watches the configuration directories for changes. You can
modify or add rules without restarting cc-backend:
- Changes to
var/tagger/apps/*are detected automatically - Changes to
var/tagger/jobclasses/*are detected automatically
Application Detection
Application detection identifies which software a job is running by matching patterns in the job script.
Configuration Format
Application patterns are stored in text files under var/tagger/apps/. Each
file contains one or more regular expression patterns (one per line) that match
against the job script.
Example: apps/vasp.txt
vasp
VASP
How It Works
- When a job starts, the system retrieves the job script from metadata
- Each line in the app files is treated as a regex pattern
- Patterns are matched case-insensitively against the lowercased job script
- If a match is found, a tag of type
appwith the filename (without extension) is applied - Only the first matching application is tagged
Adding New Applications
-
Create a new file in
var/tagger/apps/(e.g.,tensorflow.txt) -
Add regex patterns, one per line:
tensorflow tf\.keras import tensorflow -
The file is automatically detected and loaded
Note: The tag name will be the filename without the .txt extension (e.g., tensorflow).
Job Classification
Job classification analyzes completed jobs based on their metrics and properties to identify performance issues or characteristics.
Configuration Format
Job classification rules are defined in JSON files under
var/tagger/jobclasses/. Each rule file defines:
- Metrics required: Which job metrics to analyze
- Requirements: Pre-conditions that must be met
- Variables: Computed values used in the rule
- Rule expression: Boolean expression that determines if the rule matches
- Hint template: Message displayed when the rule matches
Parameters File
jobclasses/parameters.json defines shared threshold values used across multiple rules:
{
"lowcpuload_threshold_factor": 0.9,
"highmemoryusage_threshold_factor": 0.9,
"job_min_duration_seconds": 600.0,
"sampling_interval_seconds": 30.0
}
Rule File Structure
Example: jobclasses/lowUtilization.json
{
"name": "Low resource utilization",
"tag": "lowutilization",
"parameters": ["job_min_duration_seconds"],
"metrics": ["flops_any", "mem_bw"],
"requirements": [
"job.shared == \"none\"",
"job.duration > job_min_duration_seconds"
],
"variables": [
{
"name": "mem_bw_perc",
"expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
}
],
"rule": "flops_any.avg < flops_any.limits.alert",
"hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
}
Field Descriptions
| Field | Description |
|---|---|
name |
Human-readable description of the rule |
tag |
Tag identifier applied when the rule matches |
parameters |
List of parameter names from parameters.json to include in rule environment |
metrics |
List of metrics required for evaluation (must be present in job data) |
requirements |
Boolean expressions that must all be true for the rule to be evaluated |
variables |
Named expressions computed before evaluating the main rule |
rule |
Boolean expression that determines if the job matches this classification |
hint |
Go template string for generating a user-visible message |
Expression Environment
Expressions in requirements, variables, and rule have access to:
Job Properties:
job.shared- Shared node allocation typejob.duration- Job runtime in secondsjob.numCores- Number of CPU coresjob.numNodes- Number of nodesjob.jobState- Job completion statejob.numAcc- Number of acceleratorsjob.smt- SMT setting
Metric Statistics (for each metric in metrics):
<metric>.min- Minimum value<metric>.max- Maximum value<metric>.avg- Average value<metric>.limits.peak- Peak limit from cluster config<metric>.limits.normal- Normal threshold<metric>.limits.caution- Caution threshold<metric>.limits.alert- Alert threshold
Parameters:
- All parameters listed in the
parametersfield
Variables:
- All variables defined in the
variablesarray
Expression Language
Rules use the expr language for expressions. Supported operations:
- Arithmetic:
+,-,*,/,%,^ - Comparison:
==,!=,<,<=,>,>= - Logical:
&&,||,! - Functions: Standard math functions (see expr documentation)
Hint Templates
Hints use Go's text/template syntax. Variables from the evaluation environment are accessible:
{{.flops_any.avg}} # Access metric average
{{.job.duration}} # Access job property
{{.my_variable}} # Access computed variable
Adding New Classification Rules
-
Create a new JSON file in
var/tagger/jobclasses/(e.g.,memoryLeak.json) -
Define the rule structure:
{ "name": "Memory Leak Detection", "tag": "memory_leak", "parameters": ["memory_leak_slope_threshold"], "metrics": ["mem_used"], "requirements": ["job.duration > 3600"], "variables": [ { "name": "mem_growth", "expr": "(mem_used.max - mem_used.min) / job.duration" } ], "rule": "mem_growth > memory_leak_slope_threshold", "hint": "Memory usage grew by {{.mem_growth}} per second" } -
Add any new parameters to
parameters.json -
The file is automatically detected and loaded
Configuration Paths
The tagger system reads from these paths (relative to cc-backend working directory):
- Application patterns:
./var/tagger/apps/ - Job classification rules:
./var/tagger/jobclasses/
These paths are defined as constants in the source code and cannot be changed without recompiling.
Troubleshooting
Tags Not Applied
-
Check tagging is enabled: Verify
enable-job-taggers: trueis set inconfig.json -
Check configuration exists:
ls -la var/tagger/apps ls -la var/tagger/jobclasses -
Check logs for errors:
./cc-backend -server -loglevel debug -
Verify file permissions: Ensure cc-backend can read the configuration files
-
For existing jobs: Use
./cc-backend -apply-tagsto retroactively tag jobs
Rules Not Matching
- Enable debug logging: Set
loglevel: debugto see detailed rule evaluation - Check requirements: Ensure all requirements in the rule are satisfied
- Verify metrics exist: Classification rules require job metrics to be available
- Check metric names: Ensure metric names match those in your cluster configuration
File Watch Not Working
If changes to configuration files aren't detected:
- Restart cc-backend to reload all configuration
- Check filesystem supports file watching (network filesystems may not)
- Check logs for file watch setup messages
Best Practices
- Start Simple: Begin with basic rules and refine based on results
- Use Requirements: Filter out irrelevant jobs early with requirements
- Test Incrementally: Add one rule at a time and verify behavior
- Document Rules: Use descriptive names and clear hint messages
- Share Parameters: Define common thresholds in
parameters.jsonfor consistency - Version Control: Keep your
var/tagger/configuration in version control - Backup Before Changes: Test new rules on a copy before deploying to production
Examples
Simple Application Detection
File: var/tagger/apps/python.txt
python
python3
\.py
This detects jobs running Python scripts.
Complex Classification Rule
File: var/tagger/jobclasses/cpuImbalance.json
{
"name": "CPU Load Imbalance",
"tag": "cpu_imbalance",
"parameters": ["core_load_imbalance_threshold_factor"],
"metrics": ["cpu_load"],
"requirements": ["job.numCores > 1", "job.duration > 600"],
"variables": [
{
"name": "load_variance",
"expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
}
],
"rule": "load_variance > core_load_imbalance_threshold_factor",
"hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
}
This detects jobs where CPU load is unevenly distributed across cores.
Reference
Configuration Options
Main Configuration (config.json):
enable-job-taggers(boolean, default:false) - Enables automatic job tagging system- Must be set to
trueto activate automatic tagging on job start/stop events - Does not affect the
-apply-tagscommand line option
- Must be set to
Command Line Options:
-apply-tags- Apply all tagging rules to existing jobs in the database- Works independently of
enable-job-taggersconfiguration - Useful for retroactively tagging jobs or re-evaluating with updated rules
- Works independently of
Default Configuration Location
The example configurations are provided in:
configs/tagger/apps/- Example application patterns (16 applications)configs/tagger/jobclasses/- Example classification rules (3 rules)
Copy these to var/tagger/ and customize for your environment.
Tag Types
app- Application tags (e.g., "vasp", "gromacs")jobClass- Classification tags (e.g., "lowutilization", "highload")
Tags can be queried and filtered in the ClusterCockpit UI and API.