mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2026-01-15 17:21:46 +01:00
Add README for tagging. Enable tagging by flag without configuration option
This commit is contained in:
@@ -302,6 +302,8 @@ func initSubsystems() error {
|
|||||||
|
|
||||||
// Apply tags if requested
|
// Apply tags if requested
|
||||||
if flagApplyTags {
|
if flagApplyTags {
|
||||||
|
tagger.Init()
|
||||||
|
|
||||||
if err := tagger.RunTaggers(); err != nil {
|
if err := tagger.RunTaggers(); err != nil {
|
||||||
return fmt.Errorf("running job taggers: %w", err)
|
return fmt.Errorf("running job taggers: %w", err)
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -0,0 +1,419 @@
|
|||||||
|
# Job Tagging Configuration
|
||||||
|
|
||||||
|
ClusterCockpit provides automatic job tagging functionality to classify and
|
||||||
|
categorize jobs based on configurable rules. The tagging system consists of two
|
||||||
|
main components:
|
||||||
|
|
||||||
|
1. **Application Detection** - Identifies which application a job is running
|
||||||
|
2. **Job Classification** - Analyzes job performance characteristics and applies classification tags
|
||||||
|
|
||||||
|
## Directory Structure
|
||||||
|
|
||||||
|
```
|
||||||
|
configs/tagger/
|
||||||
|
├── apps/ # Application detection patterns
|
||||||
|
│ ├── vasp.txt
|
||||||
|
│ ├── gromacs.txt
|
||||||
|
│ └── ...
|
||||||
|
└── jobclasses/ # Job classification rules
|
||||||
|
├── parameters.json
|
||||||
|
├── lowUtilization.json
|
||||||
|
├── highload.json
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Activating Tagger Rules
|
||||||
|
|
||||||
|
### Step 1: Copy Configuration Files
|
||||||
|
|
||||||
|
To activate tagging, review, adapt, and copy the configuration files from
|
||||||
|
`configs/tagger/` to `var/tagger/`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# From the cc-backend root directory
|
||||||
|
mkdir -p var/tagger
|
||||||
|
cp -r configs/tagger/apps var/tagger/
|
||||||
|
cp -r configs/tagger/jobclasses var/tagger/
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Enable Tagging in Configuration
|
||||||
|
|
||||||
|
Add or set the following configuration key in the `main` section of your `config.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"enable-job-taggers": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Important**: Automatic tagging is disabled by default. You must explicitly
|
||||||
|
enable it by setting `enable-job-taggers: true` in the main configuration file.
|
||||||
|
|
||||||
|
### Step 3: Restart cc-backend
|
||||||
|
|
||||||
|
The tagger system automatically loads configuration from `./var/tagger/` at
|
||||||
|
startup. After copying the files and enabling the feature, restart cc-backend:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./cc-backend -server
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Verify Configuration Loaded
|
||||||
|
|
||||||
|
Check the logs for messages indicating successful configuration loading:
|
||||||
|
|
||||||
|
```
|
||||||
|
[INFO] Setup file watch for ./var/tagger/apps
|
||||||
|
[INFO] Setup file watch for ./var/tagger/jobclasses
|
||||||
|
```
|
||||||
|
|
||||||
|
## How Tagging Works
|
||||||
|
|
||||||
|
### Automatic Tagging
|
||||||
|
|
||||||
|
When `enable-job-taggers` is set to `true` in the configuration, tags are
|
||||||
|
automatically applied when:
|
||||||
|
|
||||||
|
- **Job Start**: Application detection runs immediately when a job starts
|
||||||
|
- **Job Stop**: Job classification runs when a job completes
|
||||||
|
|
||||||
|
The system analyzes job metadata and metrics to determine appropriate tags.
|
||||||
|
|
||||||
|
**Note**: Automatic tagging only works for jobs that start or stop after the
|
||||||
|
feature is enabled. Existing jobs are not automatically retagged.
|
||||||
|
|
||||||
|
### Manual Tagging (Retroactive)
|
||||||
|
|
||||||
|
To apply tags to existing jobs in the database, use the `-apply-tags` command
|
||||||
|
line option:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./cc-backend -apply-tags
|
||||||
|
```
|
||||||
|
|
||||||
|
This processes all jobs in the database and applies current tagging rules. This
|
||||||
|
is useful when:
|
||||||
|
|
||||||
|
- You have existing jobs that were created before tagging was enabled
|
||||||
|
- You've added new tagging rules and want to apply them to historical data
|
||||||
|
- You've modified existing rules and want to re-evaluate all jobs
|
||||||
|
|
||||||
|
### Hot Reload
|
||||||
|
|
||||||
|
The tagger system watches the configuration directories for changes. You can
|
||||||
|
modify or add rules without restarting `cc-backend`:
|
||||||
|
|
||||||
|
- Changes to `var/tagger/apps/*` are detected automatically
|
||||||
|
- Changes to `var/tagger/jobclasses/*` are detected automatically
|
||||||
|
|
||||||
|
## Application Detection
|
||||||
|
|
||||||
|
Application detection identifies which software a job is running by matching
|
||||||
|
patterns in the job script.
|
||||||
|
|
||||||
|
### Configuration Format
|
||||||
|
|
||||||
|
Application patterns are stored in text files under `var/tagger/apps/`. Each
|
||||||
|
file contains one or more regular expression patterns (one per line) that match
|
||||||
|
against the job script.
|
||||||
|
|
||||||
|
**Example: `apps/vasp.txt`**
|
||||||
|
|
||||||
|
```
|
||||||
|
vasp
|
||||||
|
VASP
|
||||||
|
```
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
|
||||||
|
1. When a job starts, the system retrieves the job script from metadata
|
||||||
|
2. Each line in the app files is treated as a regex pattern
|
||||||
|
3. Patterns are matched case-insensitively against the lowercased job script
|
||||||
|
4. If a match is found, a tag of type `app` with the filename (without extension) is applied
|
||||||
|
5. Only the first matching application is tagged
|
||||||
|
|
||||||
|
### Adding New Applications
|
||||||
|
|
||||||
|
1. Create a new file in `var/tagger/apps/` (e.g., `tensorflow.txt`)
|
||||||
|
2. Add regex patterns, one per line:
|
||||||
|
|
||||||
|
```
|
||||||
|
tensorflow
|
||||||
|
tf\.keras
|
||||||
|
import tensorflow
|
||||||
|
```
|
||||||
|
|
||||||
|
3. The file is automatically detected and loaded
|
||||||
|
|
||||||
|
**Note**: The tag name will be the filename without the `.txt` extension (e.g., `tensorflow`).
|
||||||
|
|
||||||
|
## Job Classification
|
||||||
|
|
||||||
|
Job classification analyzes completed jobs based on their metrics and properties
|
||||||
|
to identify performance issues or characteristics.
|
||||||
|
|
||||||
|
### Configuration Format
|
||||||
|
|
||||||
|
Job classification rules are defined in JSON files under
|
||||||
|
`var/tagger/jobclasses/`. Each rule file defines:
|
||||||
|
|
||||||
|
- **Metrics required**: Which job metrics to analyze
|
||||||
|
- **Requirements**: Pre-conditions that must be met
|
||||||
|
- **Variables**: Computed values used in the rule
|
||||||
|
- **Rule expression**: Boolean expression that determines if the rule matches
|
||||||
|
- **Hint template**: Message displayed when the rule matches
|
||||||
|
|
||||||
|
### Parameters File
|
||||||
|
|
||||||
|
`jobclasses/parameters.json` defines shared threshold values used across multiple rules:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"lowcpuload_threshold_factor": 0.9,
|
||||||
|
"highmemoryusage_threshold_factor": 0.9,
|
||||||
|
"job_min_duration_seconds": 600.0,
|
||||||
|
"sampling_interval_seconds": 30.0
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rule File Structure
|
||||||
|
|
||||||
|
**Example: `jobclasses/lowUtilization.json`**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "Low resource utilization",
|
||||||
|
"tag": "lowutilization",
|
||||||
|
"parameters": ["job_min_duration_seconds"],
|
||||||
|
"metrics": ["flops_any", "mem_bw"],
|
||||||
|
"requirements": [
|
||||||
|
"job.shared == \"none\"",
|
||||||
|
"job.duration > job_min_duration_seconds"
|
||||||
|
],
|
||||||
|
"variables": [
|
||||||
|
{
|
||||||
|
"name": "mem_bw_perc",
|
||||||
|
"expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"rule": "flops_any.avg < flops_any.limits.alert",
|
||||||
|
"hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Field Descriptions
|
||||||
|
|
||||||
|
| Field | Description |
|
||||||
|
| -------------- | ----------------------------------------------------------------------------- |
|
||||||
|
| `name` | Human-readable description of the rule |
|
||||||
|
| `tag` | Tag identifier applied when the rule matches |
|
||||||
|
| `parameters` | List of parameter names from `parameters.json` to include in rule environment |
|
||||||
|
| `metrics` | List of metrics required for evaluation (must be present in job data) |
|
||||||
|
| `requirements` | Boolean expressions that must all be true for the rule to be evaluated |
|
||||||
|
| `variables` | Named expressions computed before evaluating the main rule |
|
||||||
|
| `rule` | Boolean expression that determines if the job matches this classification |
|
||||||
|
| `hint` | Go template string for generating a user-visible message |
|
||||||
|
|
||||||
|
### Expression Environment
|
||||||
|
|
||||||
|
Expressions in `requirements`, `variables`, and `rule` have access to:
|
||||||
|
|
||||||
|
**Job Properties:**
|
||||||
|
|
||||||
|
- `job.shared` - Shared node allocation type
|
||||||
|
- `job.duration` - Job runtime in seconds
|
||||||
|
- `job.numCores` - Number of CPU cores
|
||||||
|
- `job.numNodes` - Number of nodes
|
||||||
|
- `job.jobState` - Job completion state
|
||||||
|
- `job.numAcc` - Number of accelerators
|
||||||
|
- `job.smt` - SMT setting
|
||||||
|
|
||||||
|
**Metric Statistics (for each metric in `metrics`):**
|
||||||
|
|
||||||
|
- `<metric>.min` - Minimum value
|
||||||
|
- `<metric>.max` - Maximum value
|
||||||
|
- `<metric>.avg` - Average value
|
||||||
|
- `<metric>.limits.peak` - Peak limit from cluster config
|
||||||
|
- `<metric>.limits.normal` - Normal threshold
|
||||||
|
- `<metric>.limits.caution` - Caution threshold
|
||||||
|
- `<metric>.limits.alert` - Alert threshold
|
||||||
|
|
||||||
|
**Parameters:**
|
||||||
|
|
||||||
|
- All parameters listed in the `parameters` field
|
||||||
|
|
||||||
|
**Variables:**
|
||||||
|
|
||||||
|
- All variables defined in the `variables` array
|
||||||
|
|
||||||
|
### Expression Language
|
||||||
|
|
||||||
|
Rules use the [expr](https://github.com/expr-lang/expr) language for expressions. Supported operations:
|
||||||
|
|
||||||
|
- **Arithmetic**: `+`, `-`, `*`, `/`, `%`, `^`
|
||||||
|
- **Comparison**: `==`, `!=`, `<`, `<=`, `>`, `>=`
|
||||||
|
- **Logical**: `&&`, `||`, `!`
|
||||||
|
- **Functions**: Standard math functions (see expr documentation)
|
||||||
|
|
||||||
|
### Hint Templates
|
||||||
|
|
||||||
|
Hints use Go's `text/template` syntax. Variables from the evaluation environment are accessible:
|
||||||
|
|
||||||
|
```
|
||||||
|
{{.flops_any.avg}} # Access metric average
|
||||||
|
{{.job.duration}} # Access job property
|
||||||
|
{{.my_variable}} # Access computed variable
|
||||||
|
```
|
||||||
|
|
||||||
|
### Adding New Classification Rules
|
||||||
|
|
||||||
|
1. Create a new JSON file in `var/tagger/jobclasses/` (e.g., `memoryLeak.json`)
|
||||||
|
2. Define the rule structure:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "Memory Leak Detection",
|
||||||
|
"tag": "memory_leak",
|
||||||
|
"parameters": ["memory_leak_slope_threshold"],
|
||||||
|
"metrics": ["mem_used"],
|
||||||
|
"requirements": ["job.duration > 3600"],
|
||||||
|
"variables": [
|
||||||
|
{
|
||||||
|
"name": "mem_growth",
|
||||||
|
"expr": "(mem_used.max - mem_used.min) / job.duration"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"rule": "mem_growth > memory_leak_slope_threshold",
|
||||||
|
"hint": "Memory usage grew by {{.mem_growth}} per second"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
3. Add any new parameters to `parameters.json`
|
||||||
|
4. The file is automatically detected and loaded
|
||||||
|
|
||||||
|
## Configuration Paths
|
||||||
|
|
||||||
|
The tagger system reads from these paths (relative to cc-backend working directory):
|
||||||
|
|
||||||
|
- **Application patterns**: `./var/tagger/apps/`
|
||||||
|
- **Job classification rules**: `./var/tagger/jobclasses/`
|
||||||
|
|
||||||
|
These paths are defined as constants in the source code and cannot be changed without recompiling.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
### Tags Not Applied
|
||||||
|
|
||||||
|
1. **Check tagging is enabled**: Verify `enable-job-taggers: true` is set in `config.json`
|
||||||
|
|
||||||
|
2. **Check configuration exists**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -la var/tagger/apps
|
||||||
|
ls -la var/tagger/jobclasses
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Check logs for errors**:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./cc-backend -server -loglevel debug
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Verify file permissions**: Ensure cc-backend can read the configuration files
|
||||||
|
|
||||||
|
5. **For existing jobs**: Use `./cc-backend -apply-tags` to retroactively tag jobs
|
||||||
|
|
||||||
|
### Rules Not Matching
|
||||||
|
|
||||||
|
1. **Enable debug logging**: Set `loglevel: debug` to see detailed rule evaluation
|
||||||
|
2. **Check requirements**: Ensure all requirements in the rule are satisfied
|
||||||
|
3. **Verify metrics exist**: Classification rules require job metrics to be available
|
||||||
|
4. **Check metric names**: Ensure metric names match those in your cluster configuration
|
||||||
|
|
||||||
|
### File Watch Not Working
|
||||||
|
|
||||||
|
If changes to configuration files aren't detected:
|
||||||
|
|
||||||
|
1. Restart cc-backend to reload all configuration
|
||||||
|
2. Check filesystem supports file watching (network filesystems may not)
|
||||||
|
3. Check logs for file watch setup messages
|
||||||
|
|
||||||
|
## Best Practices
|
||||||
|
|
||||||
|
1. **Start Simple**: Begin with basic rules and refine based on results
|
||||||
|
2. **Use Requirements**: Filter out irrelevant jobs early with requirements
|
||||||
|
3. **Test Incrementally**: Add one rule at a time and verify behavior
|
||||||
|
4. **Document Rules**: Use descriptive names and clear hint messages
|
||||||
|
5. **Share Parameters**: Define common thresholds in `parameters.json` for consistency
|
||||||
|
6. **Version Control**: Keep your `var/tagger/` configuration in version control
|
||||||
|
7. **Backup Before Changes**: Test new rules on a copy before deploying to production
|
||||||
|
|
||||||
|
## Examples
|
||||||
|
|
||||||
|
### Simple Application Detection
|
||||||
|
|
||||||
|
**File: `var/tagger/apps/python.txt`**
|
||||||
|
|
||||||
|
```
|
||||||
|
python
|
||||||
|
python3
|
||||||
|
\.py
|
||||||
|
```
|
||||||
|
|
||||||
|
This detects jobs running Python scripts.
|
||||||
|
|
||||||
|
### Complex Classification Rule
|
||||||
|
|
||||||
|
**File: `var/tagger/jobclasses/cpuImbalance.json`**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "CPU Load Imbalance",
|
||||||
|
"tag": "cpu_imbalance",
|
||||||
|
"parameters": ["core_load_imbalance_threshold_factor"],
|
||||||
|
"metrics": ["cpu_load"],
|
||||||
|
"requirements": ["job.numCores > 1", "job.duration > 600"],
|
||||||
|
"variables": [
|
||||||
|
{
|
||||||
|
"name": "load_variance",
|
||||||
|
"expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"rule": "load_variance > core_load_imbalance_threshold_factor",
|
||||||
|
"hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This detects jobs where CPU load is unevenly distributed across cores.
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
### Configuration Options
|
||||||
|
|
||||||
|
**Main Configuration (`config.json`)**:
|
||||||
|
|
||||||
|
- `enable-job-taggers` (boolean, default: `false`) - Enables automatic job tagging system
|
||||||
|
- Must be set to `true` to activate automatic tagging on job start/stop events
|
||||||
|
- Does not affect the `-apply-tags` command line option
|
||||||
|
|
||||||
|
**Command Line Options**:
|
||||||
|
|
||||||
|
- `-apply-tags` - Apply all tagging rules to existing jobs in the database
|
||||||
|
- Works independently of `enable-job-taggers` configuration
|
||||||
|
- Useful for retroactively tagging jobs or re-evaluating with updated rules
|
||||||
|
|
||||||
|
### Default Configuration Location
|
||||||
|
|
||||||
|
The example configurations are provided in:
|
||||||
|
|
||||||
|
- `configs/tagger/apps/` - Example application patterns (16 applications)
|
||||||
|
- `configs/tagger/jobclasses/` - Example classification rules (3 rules)
|
||||||
|
|
||||||
|
Copy these to `var/tagger/` and customize for your environment.
|
||||||
|
|
||||||
|
### Tag Types
|
||||||
|
|
||||||
|
- `app` - Application tags (e.g., "vasp", "gromacs")
|
||||||
|
- `jobClass` - Classification tags (e.g., "lowutilization", "highload")
|
||||||
|
|
||||||
|
Tags can be queried and filtered in the ClusterCockpit UI and API.
|
||||||
|
|||||||
Reference in New Issue
Block a user