mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2026-02-12 05:51:45 +01:00
305
CLAUDE.md
Normal file
305
CLAUDE.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# CLAUDE.md
|
||||
|
||||
This file provides guidance to Claude Code (claude.ai/code) when working with
|
||||
code in this repository.
|
||||
|
||||
## Project Overview
|
||||
|
||||
ClusterCockpit is a job-specific performance monitoring framework for HPC
|
||||
clusters. This is a Golang backend that provides REST and GraphQL APIs, serves a
|
||||
Svelte-based frontend, and manages job archives and metric data from various
|
||||
time-series databases.
|
||||
|
||||
## Build and Development Commands
|
||||
|
||||
### Building
|
||||
|
||||
```bash
|
||||
# Build everything (frontend + backend)
|
||||
make
|
||||
|
||||
# Build only the frontend
|
||||
make frontend
|
||||
|
||||
# Build only the backend (requires frontend to be built first)
|
||||
go build -ldflags='-s -X main.date=$(date +"%Y-%m-%d:T%H:%M:%S") -X main.version=1.4.4 -X main.commit=$(git rev-parse --short HEAD)' ./cmd/cc-backend
|
||||
```
|
||||
|
||||
### Testing
|
||||
|
||||
```bash
|
||||
# Run all tests
|
||||
make test
|
||||
|
||||
# Run tests with verbose output
|
||||
go test -v ./...
|
||||
|
||||
# Run tests for a specific package
|
||||
go test ./internal/repository
|
||||
```
|
||||
|
||||
### Code Generation
|
||||
|
||||
```bash
|
||||
# Regenerate GraphQL schema and resolvers (after modifying api/*.graphqls)
|
||||
make graphql
|
||||
|
||||
# Regenerate Swagger/OpenAPI docs (after modifying API comments)
|
||||
make swagger
|
||||
```
|
||||
|
||||
### Frontend Development
|
||||
|
||||
```bash
|
||||
cd web/frontend
|
||||
|
||||
# Install dependencies
|
||||
npm install
|
||||
|
||||
# Build for production
|
||||
npm run build
|
||||
|
||||
# Development mode with watch
|
||||
npm run dev
|
||||
```
|
||||
|
||||
### Running
|
||||
|
||||
```bash
|
||||
# Initialize database and create admin user
|
||||
./cc-backend -init-db -add-user demo:admin:demo
|
||||
|
||||
# Start server in development mode (enables GraphQL Playground and Swagger UI)
|
||||
./cc-backend -server -dev -loglevel info
|
||||
|
||||
# Start demo with sample data
|
||||
./startDemo.sh
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
### Backend Structure
|
||||
|
||||
The backend follows a layered architecture with clear separation of concerns:
|
||||
|
||||
- **cmd/cc-backend**: Entry point, orchestrates initialization of all subsystems
|
||||
- **internal/repository**: Data access layer using repository pattern
|
||||
- Abstracts database operations (SQLite3 only)
|
||||
- Implements LRU caching for performance
|
||||
- Provides repositories for Job, User, Node, and Tag entities
|
||||
- Transaction support for batch operations
|
||||
- **internal/api**: REST API endpoints (Swagger/OpenAPI documented)
|
||||
- **internal/graph**: GraphQL API (uses gqlgen)
|
||||
- Schema in `api/*.graphqls`
|
||||
- Generated code in `internal/graph/generated/`
|
||||
- Resolvers in `internal/graph/schema.resolvers.go`
|
||||
- **internal/auth**: Authentication layer
|
||||
- Supports local accounts, LDAP, OIDC, and JWT tokens
|
||||
- Implements rate limiting for login attempts
|
||||
- **internal/metricstore**: Metric store with data loading API
|
||||
- In-memory metric storage with checkpointing
|
||||
- Query API for loading job metric data
|
||||
- **internal/archiver**: Job archiving to file-based archive
|
||||
- **internal/api/nats.go**: NATS-based API for job and node operations
|
||||
- Subscribes to NATS subjects for job events (start/stop)
|
||||
- Handles node state updates via NATS
|
||||
- Uses InfluxDB line protocol message format
|
||||
- **pkg/archive**: Job archive backend implementations
|
||||
- File system backend (default)
|
||||
- S3 backend
|
||||
- SQLite backend (experimental)
|
||||
- **pkg/nats**: NATS client and message decoding utilities
|
||||
|
||||
### Frontend Structure
|
||||
|
||||
- **web/frontend**: Svelte 5 application
|
||||
- Uses Rollup for building
|
||||
- Components organized by feature (analysis, job, user, etc.)
|
||||
- GraphQL client using @urql/svelte
|
||||
- Bootstrap 5 + SvelteStrap for UI
|
||||
- uPlot for time-series visualization
|
||||
- **web/templates**: Server-side Go templates
|
||||
|
||||
### Key Concepts
|
||||
|
||||
**Job Archive**: Completed jobs are stored in a file-based archive following the
|
||||
[ClusterCockpit job-archive
|
||||
specification](https://github.com/ClusterCockpit/cc-specifications/tree/master/job-archive).
|
||||
Each job has a `meta.json` file with metadata and metric data files.
|
||||
|
||||
**Metric Data Repositories**: Time-series metric data is stored separately from
|
||||
job metadata. The system supports multiple backends (cc-metric-store is
|
||||
recommended). Configuration is per-cluster in `config.json`.
|
||||
|
||||
**Authentication Flow**:
|
||||
|
||||
1. Multiple authenticators can be configured (local, LDAP, OIDC, JWT)
|
||||
2. Each authenticator's `CanLogin` method is called to determine if it should handle the request
|
||||
3. The first authenticator that returns true performs the actual `Login`
|
||||
4. JWT tokens are used for API authentication
|
||||
|
||||
**Database Migrations**: SQL migrations in `internal/repository/migrations/` are
|
||||
applied automatically on startup. Version tracking in `version` table.
|
||||
|
||||
**Scopes**: Metrics can be collected at different scopes:
|
||||
|
||||
- Node scope (always available)
|
||||
- Core scope (for jobs with ≤8 nodes)
|
||||
- Accelerator scope (for GPU/accelerator metrics)
|
||||
|
||||
## Configuration
|
||||
|
||||
- **config.json**: Main configuration (clusters, metric repositories, archive settings)
|
||||
- `main.apiSubjects`: NATS subject configuration (optional)
|
||||
- `subjectJobEvent`: Subject for job start/stop events (e.g., "cc.job.event")
|
||||
- `subjectNodeState`: Subject for node state updates (e.g., "cc.node.state")
|
||||
- `nats`: NATS client connection configuration (optional)
|
||||
- `address`: NATS server address (e.g., "nats://localhost:4222")
|
||||
- `username`: Authentication username (optional)
|
||||
- `password`: Authentication password (optional)
|
||||
- `creds-file-path`: Path to NATS credentials file (optional)
|
||||
- **.env**: Environment variables (secrets like JWT keys)
|
||||
- Copy from `configs/env-template.txt`
|
||||
- NEVER commit this file
|
||||
- **cluster.json**: Cluster topology and metric definitions (loaded from archive or config)
|
||||
|
||||
## Database
|
||||
|
||||
- Default: SQLite 3 (`./var/job.db`)
|
||||
- Connection managed by `internal/repository`
|
||||
- Schema version in `internal/repository/migration.go`
|
||||
|
||||
## Code Generation
|
||||
|
||||
**GraphQL** (gqlgen):
|
||||
|
||||
- Schema: `api/*.graphqls`
|
||||
- Config: `gqlgen.yml`
|
||||
- Generated code: `internal/graph/generated/`
|
||||
- Custom resolvers: `internal/graph/schema.resolvers.go`
|
||||
- Run `make graphql` after schema changes
|
||||
|
||||
**Swagger/OpenAPI**:
|
||||
|
||||
- Annotations in `internal/api/*.go`
|
||||
- Generated docs: `api/docs.go`, `api/swagger.yaml`
|
||||
- Run `make swagger` after API changes
|
||||
|
||||
## Testing Conventions
|
||||
|
||||
- Test files use `_test.go` suffix
|
||||
- Test data in `testdata/` subdirectories
|
||||
- Repository tests use in-memory SQLite
|
||||
- API tests use httptest
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Adding a new GraphQL field
|
||||
|
||||
1. Edit schema in `api/*.graphqls`
|
||||
2. Run `make graphql`
|
||||
3. Implement resolver in `internal/graph/schema.resolvers.go`
|
||||
|
||||
### Adding a new REST endpoint
|
||||
|
||||
1. Add handler in `internal/api/*.go`
|
||||
2. Add route in `internal/api/rest.go`
|
||||
3. Add Swagger annotations
|
||||
4. Run `make swagger`
|
||||
|
||||
### Adding a new metric data backend
|
||||
|
||||
1. Implement metric loading functions in `internal/metricstore/query.go`
|
||||
2. Add cluster configuration to metric store initialization
|
||||
3. Update config.json schema documentation
|
||||
|
||||
### Modifying database schema
|
||||
|
||||
1. Create new migration in `internal/repository/migrations/`
|
||||
2. Increment `repository.Version`
|
||||
3. Test with fresh database and existing database
|
||||
|
||||
## NATS API
|
||||
|
||||
The backend supports a NATS-based API as an alternative to the REST API for job and node operations.
|
||||
|
||||
### Setup
|
||||
|
||||
1. Configure NATS client connection in `config.json`:
|
||||
```json
|
||||
{
|
||||
"nats": {
|
||||
"address": "nats://localhost:4222",
|
||||
"username": "user",
|
||||
"password": "pass"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. Configure API subjects in `config.json` under `main`:
|
||||
```json
|
||||
{
|
||||
"main": {
|
||||
"apiSubjects": {
|
||||
"subjectJobEvent": "cc.job.event",
|
||||
"subjectNodeState": "cc.node.state"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Message Format
|
||||
|
||||
Messages use **InfluxDB line protocol** format with the following structure:
|
||||
|
||||
#### Job Events
|
||||
|
||||
**Start Job:**
|
||||
```
|
||||
job,function=start_job event="{\"jobId\":123,\"user\":\"alice\",\"cluster\":\"test\", ...}" 1234567890000000000
|
||||
```
|
||||
|
||||
**Stop Job:**
|
||||
```
|
||||
job,function=stop_job event="{\"jobId\":123,\"cluster\":\"test\",\"startTime\":1234567890,\"stopTime\":1234571490,\"jobState\":\"completed\"}" 1234571490000000000
|
||||
```
|
||||
|
||||
**Tags:**
|
||||
- `function`: Either `start_job` or `stop_job`
|
||||
|
||||
**Fields:**
|
||||
- `event`: JSON payload containing job data (see REST API documentation for schema)
|
||||
|
||||
#### Node State Updates
|
||||
|
||||
```json
|
||||
{
|
||||
"cluster": "testcluster",
|
||||
"nodes": [
|
||||
{
|
||||
"hostname": "node001",
|
||||
"states": ["allocated"],
|
||||
"cpusAllocated": 8,
|
||||
"memoryAllocated": 16384,
|
||||
"gpusAllocated": 0,
|
||||
"jobsRunning": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Implementation Notes
|
||||
|
||||
- NATS API mirrors REST API functionality but uses messaging
|
||||
- Job start/stop events are processed asynchronously
|
||||
- Duplicate job detection is handled (same as REST API)
|
||||
- All validation rules from REST API apply
|
||||
- Messages are logged; no responses are sent back to publishers
|
||||
- If NATS client is unavailable, API subscriptions are skipped (logged as warning)
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Go 1.24.0+ (check go.mod for exact version)
|
||||
- Node.js (for frontend builds)
|
||||
- SQLite 3 (only supported database)
|
||||
- Optional: NATS server for NATS API integration
|
||||
71
README.md
71
README.md
@@ -22,19 +22,23 @@ switching from PHP Symfony to a Golang based solution are explained
|
||||
## Overview
|
||||
|
||||
This is a Golang web backend for the ClusterCockpit job-specific performance
|
||||
monitoring framework. It provides a REST API for integrating ClusterCockpit with
|
||||
an HPC cluster batch system and external analysis scripts. Data exchange between
|
||||
the web front-end and the back-end is based on a GraphQL API. The web frontend
|
||||
is also served by the backend using [Svelte](https://svelte.dev/) components.
|
||||
Layout and styling are based on [Bootstrap 5](https://getbootstrap.com/) using
|
||||
monitoring framework. It provides a REST API and an optional NATS-based messaging
|
||||
API for integrating ClusterCockpit with an HPC cluster batch system and external
|
||||
analysis scripts. Data exchange between the web front-end and the back-end is
|
||||
based on a GraphQL API. The web frontend is also served by the backend using
|
||||
[Svelte](https://svelte.dev/) components. Layout and styling are based on
|
||||
[Bootstrap 5](https://getbootstrap.com/) using
|
||||
[Bootstrap Icons](https://icons.getbootstrap.com/).
|
||||
|
||||
The backend uses [SQLite 3](https://sqlite.org/) as a relational SQL database by
|
||||
default. Optionally it can use a MySQL/MariaDB database server. While there are
|
||||
metric data backends for the InfluxDB and Prometheus time series databases, the
|
||||
only tested and supported setup is to use cc-metric-store as the metric data
|
||||
backend. Documentation on how to integrate ClusterCockpit with other time series
|
||||
databases will be added in the future.
|
||||
The backend uses [SQLite 3](https://sqlite.org/) as the relational SQL database.
|
||||
While there are metric data backends for the InfluxDB and Prometheus time series
|
||||
databases, the only tested and supported setup is to use cc-metric-store as the
|
||||
metric data backend. Documentation on how to integrate ClusterCockpit with other
|
||||
time series databases will be added in the future.
|
||||
|
||||
For real-time integration with HPC systems, the backend can subscribe to
|
||||
[NATS](https://nats.io/) subjects to receive job start/stop events and node
|
||||
state updates, providing an alternative to REST API polling.
|
||||
|
||||
Completed batch jobs are stored in a file-based job archive according to
|
||||
[this specification](https://github.com/ClusterCockpit/cc-specifications/tree/master/job-archive).
|
||||
@@ -131,27 +135,58 @@ ln -s <your-existing-job-archive> ./var/job-archive
|
||||
|
||||
## Project file structure
|
||||
|
||||
- [`.github/`](https://github.com/ClusterCockpit/cc-backend/tree/master/.github)
|
||||
GitHub Actions workflows and dependabot configuration for CI/CD.
|
||||
- [`api/`](https://github.com/ClusterCockpit/cc-backend/tree/master/api)
|
||||
contains the API schema files for the REST and GraphQL APIs. The REST API is
|
||||
documented in the OpenAPI 3.0 format in
|
||||
[./api/openapi.yaml](./api/openapi.yaml).
|
||||
[./api/swagger.yaml](./api/swagger.yaml). The GraphQL schema is in
|
||||
[./api/schema.graphqls](./api/schema.graphqls).
|
||||
- [`cmd/cc-backend`](https://github.com/ClusterCockpit/cc-backend/tree/master/cmd/cc-backend)
|
||||
contains `main.go` for the main application.
|
||||
contains the main application entry point and CLI implementation.
|
||||
- [`configs/`](https://github.com/ClusterCockpit/cc-backend/tree/master/configs)
|
||||
contains documentation about configuration and command line options and required
|
||||
environment variables. A sample configuration file is provided.
|
||||
- [`docs/`](https://github.com/ClusterCockpit/cc-backend/tree/master/docs)
|
||||
contains more in-depth documentation.
|
||||
environment variables. Sample configuration files are provided.
|
||||
- [`init/`](https://github.com/ClusterCockpit/cc-backend/tree/master/init)
|
||||
contains an example of setting up systemd for production use.
|
||||
- [`internal/`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal)
|
||||
contains library source code that is not intended for use by others.
|
||||
- [`api`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/api)
|
||||
REST API handlers and NATS integration
|
||||
- [`archiver`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/archiver)
|
||||
Job archiving functionality
|
||||
- [`auth`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/auth)
|
||||
Authentication (local, LDAP, OIDC) and JWT token handling
|
||||
- [`config`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/config)
|
||||
Configuration management and validation
|
||||
- [`graph`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/graph)
|
||||
GraphQL schema and resolvers
|
||||
- [`importer`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/importer)
|
||||
Job data import and database initialization
|
||||
- [`metricstore`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/metricstore)
|
||||
In-memory metric data store with checkpointing and metric loading
|
||||
- [`metricdispatch`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/metricdispatch)
|
||||
Dispatches metric data loading to appropriate backends
|
||||
- [`repository`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/repository)
|
||||
Database repository layer for jobs and metadata
|
||||
- [`routerConfig`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/routerConfig)
|
||||
HTTP router configuration and middleware
|
||||
- [`tagger`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/tagger)
|
||||
Job classification and application detection
|
||||
- [`taskmanager`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/taskmanager)
|
||||
Background task management and scheduled jobs
|
||||
- [`pkg/`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg)
|
||||
contains Go packages that can be used by other projects.
|
||||
- [`archive`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg/archive)
|
||||
Job archive backend implementations (filesystem, S3)
|
||||
- [`nats`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg/nats)
|
||||
NATS client and message handling
|
||||
- [`tools/`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools)
|
||||
Additional command line helper tools.
|
||||
- [`archive-manager`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/archive-manager)
|
||||
Commands for getting infos about and existing job archive.
|
||||
Commands for getting infos about an existing job archive.
|
||||
- [`archive-migration`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/archive-migration)
|
||||
Tool for migrating job archives between formats.
|
||||
- [`convert-pem-pubkey`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/convert-pem-pubkey)
|
||||
Tool to convert external pubkey for use in `cc-backend`.
|
||||
- [`gen-keypair`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/gen-keypair)
|
||||
@@ -163,7 +198,7 @@ ln -s <your-existing-job-archive> ./var/job-archive
|
||||
- [`frontend`](https://github.com/ClusterCockpit/cc-backend/tree/master/web/frontend)
|
||||
Svelte components and static assets for the frontend UI
|
||||
- [`templates`](https://github.com/ClusterCockpit/cc-backend/tree/master/web/templates)
|
||||
Server-side Go templates
|
||||
Server-side Go templates, including monitoring views
|
||||
- [`gqlgen.yml`](https://github.com/ClusterCockpit/cc-backend/blob/master/gqlgen.yml)
|
||||
Configures the behaviour and generation of
|
||||
[gqlgen](https://github.com/99designs/gqlgen).
|
||||
|
||||
@@ -458,6 +458,7 @@ input JobFilter {
|
||||
state: [JobState!]
|
||||
metricStats: [MetricStatItem!]
|
||||
shared: String
|
||||
schedule: String
|
||||
node: StringInput
|
||||
}
|
||||
|
||||
|
||||
1178
api/swagger.json
1178
api/swagger.json
File diff suppressed because it is too large
Load Diff
690
api/swagger.yaml
690
api/swagger.yaml
File diff suppressed because it is too large
Load Diff
@@ -15,8 +15,8 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
)
|
||||
|
||||
const envString = `
|
||||
@@ -36,7 +36,7 @@ const configString = `
|
||||
"short-running-jobs-duration": 300,
|
||||
"resampling": {
|
||||
"minimumPoints": 600,
|
||||
"trigger": 180,
|
||||
"trigger": 300,
|
||||
"resolutions": [
|
||||
240,
|
||||
60
|
||||
@@ -48,7 +48,7 @@ const configString = `
|
||||
"emission-constant": 317
|
||||
},
|
||||
"cron": {
|
||||
"commit-job-worker": "2m",
|
||||
"commit-job-worker": "1m",
|
||||
"duration-worker": "5m",
|
||||
"footprint-worker": "10m"
|
||||
},
|
||||
@@ -60,31 +60,7 @@ const configString = `
|
||||
"jwts": {
|
||||
"max-age": "2000h"
|
||||
}
|
||||
},
|
||||
"clusters": [
|
||||
{
|
||||
"name": "name",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://localhost:8082",
|
||||
"token": ""
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2023-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
`
|
||||
|
||||
@@ -105,9 +81,9 @@ func initEnv() {
|
||||
cclog.Abortf("Could not create default ./var folder with permissions '0o777'. Application initialization failed, exited.\nError: %s\n", err.Error())
|
||||
}
|
||||
|
||||
err := repository.MigrateDB("sqlite3", "./var/job.db")
|
||||
err := repository.MigrateDB("./var/job.db")
|
||||
if err != nil {
|
||||
cclog.Abortf("Could not initialize default sqlite3 database as './var/job.db'. Application initialization failed, exited.\nError: %s\n", err.Error())
|
||||
cclog.Abortf("Could not initialize default SQLite database as './var/job.db'. Application initialization failed, exited.\nError: %s\n", err.Error())
|
||||
}
|
||||
if err := os.Mkdir("var/job-archive", 0o777); err != nil {
|
||||
cclog.Abortf("Could not create default ./var/job-archive folder with permissions '0o777'. Application initialization failed, exited.\nError: %s\n", err.Error())
|
||||
|
||||
@@ -24,23 +24,21 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/auth"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/importer"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/memorystore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdata"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/tagger"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/taskmanager"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/nats"
|
||||
"github.com/ClusterCockpit/cc-backend/web"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/runtimeEnv"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/runtime"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
"github.com/google/gops/agent"
|
||||
"github.com/joho/godotenv"
|
||||
|
||||
_ "github.com/go-sql-driver/mysql"
|
||||
_ "github.com/mattn/go-sqlite3"
|
||||
)
|
||||
|
||||
@@ -104,12 +102,7 @@ func initConfiguration() error {
|
||||
return fmt.Errorf("main configuration must be present")
|
||||
}
|
||||
|
||||
clustercfg := ccconf.GetPackageConfig("clusters")
|
||||
if clustercfg == nil {
|
||||
return fmt.Errorf("cluster configuration must be present")
|
||||
}
|
||||
|
||||
config.Init(cfg, clustercfg)
|
||||
config.Init(cfg)
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -120,30 +113,30 @@ func initDatabase() error {
|
||||
|
||||
func handleDatabaseCommands() error {
|
||||
if flagMigrateDB {
|
||||
err := repository.MigrateDB(config.Keys.DBDriver, config.Keys.DB)
|
||||
err := repository.MigrateDB(config.Keys.DB)
|
||||
if err != nil {
|
||||
return fmt.Errorf("migrating database to version %d: %w", repository.Version, err)
|
||||
}
|
||||
cclog.Exitf("MigrateDB Success: Migrated '%s' database at location '%s' to version %d.\n",
|
||||
config.Keys.DBDriver, config.Keys.DB, repository.Version)
|
||||
cclog.Exitf("MigrateDB Success: Migrated SQLite database at '%s' to version %d.\n",
|
||||
config.Keys.DB, repository.Version)
|
||||
}
|
||||
|
||||
if flagRevertDB {
|
||||
err := repository.RevertDB(config.Keys.DBDriver, config.Keys.DB)
|
||||
err := repository.RevertDB(config.Keys.DB)
|
||||
if err != nil {
|
||||
return fmt.Errorf("reverting database to version %d: %w", repository.Version-1, err)
|
||||
}
|
||||
cclog.Exitf("RevertDB Success: Reverted '%s' database at location '%s' to version %d.\n",
|
||||
config.Keys.DBDriver, config.Keys.DB, repository.Version-1)
|
||||
cclog.Exitf("RevertDB Success: Reverted SQLite database at '%s' to version %d.\n",
|
||||
config.Keys.DB, repository.Version-1)
|
||||
}
|
||||
|
||||
if flagForceDB {
|
||||
err := repository.ForceDB(config.Keys.DBDriver, config.Keys.DB)
|
||||
err := repository.ForceDB(config.Keys.DB)
|
||||
if err != nil {
|
||||
return fmt.Errorf("forcing database to version %d: %w", repository.Version, err)
|
||||
}
|
||||
cclog.Exitf("ForceDB Success: Forced '%s' database at location '%s' to version %d.\n",
|
||||
config.Keys.DBDriver, config.Keys.DB, repository.Version)
|
||||
cclog.Exitf("ForceDB Success: Forced SQLite database at '%s' to version %d.\n",
|
||||
config.Keys.DB, repository.Version)
|
||||
}
|
||||
|
||||
return nil
|
||||
@@ -278,16 +271,14 @@ func initSubsystems() error {
|
||||
// Initialize job archive
|
||||
archiveCfg := ccconf.GetPackageConfig("archive")
|
||||
if archiveCfg == nil {
|
||||
cclog.Debug("Archive configuration not found, using default archive configuration")
|
||||
archiveCfg = json.RawMessage(defaultArchiveConfig)
|
||||
}
|
||||
if err := archive.Init(archiveCfg, config.Keys.DisableArchive); err != nil {
|
||||
return fmt.Errorf("initializing archive: %w", err)
|
||||
}
|
||||
|
||||
// Initialize metricdata
|
||||
if err := metricdata.Init(); err != nil {
|
||||
return fmt.Errorf("initializing metricdata repository: %w", err)
|
||||
}
|
||||
// Note: metricstore.Init() is called later in runServer() with proper configuration
|
||||
|
||||
// Handle database re-initialization
|
||||
if flagReinitDB {
|
||||
@@ -312,6 +303,8 @@ func initSubsystems() error {
|
||||
|
||||
// Apply tags if requested
|
||||
if flagApplyTags {
|
||||
tagger.Init()
|
||||
|
||||
if err := tagger.RunTaggers(); err != nil {
|
||||
return fmt.Errorf("running job taggers: %w", err)
|
||||
}
|
||||
@@ -323,13 +316,17 @@ func initSubsystems() error {
|
||||
func runServer(ctx context.Context) error {
|
||||
var wg sync.WaitGroup
|
||||
|
||||
// Start metric store if enabled
|
||||
if memorystore.InternalCCMSFlag {
|
||||
mscfg := ccconf.GetPackageConfig("metric-store")
|
||||
if mscfg == nil {
|
||||
return fmt.Errorf("metric store configuration must be present")
|
||||
}
|
||||
memorystore.Init(mscfg, &wg)
|
||||
// Initialize metric store if configuration is provided
|
||||
mscfg := ccconf.GetPackageConfig("metric-store")
|
||||
if mscfg != nil {
|
||||
metricstore.Init(mscfg, &wg)
|
||||
|
||||
// Inject repository as NodeProvider to break import cycle
|
||||
ms := metricstore.GetMemoryStore()
|
||||
jobRepo := repository.GetJobRepository()
|
||||
ms.SetNodeProvider(jobRepo)
|
||||
} else {
|
||||
return fmt.Errorf("missing metricstore configuration")
|
||||
}
|
||||
|
||||
// Start archiver and task manager
|
||||
@@ -372,7 +369,7 @@ func runServer(ctx context.Context) error {
|
||||
case <-ctx.Done():
|
||||
}
|
||||
|
||||
runtimeEnv.SystemdNotifiy(false, "Shutting down ...")
|
||||
runtime.SystemdNotify(false, "Shutting down ...")
|
||||
srv.Shutdown(ctx)
|
||||
util.FsWatcherShutdown()
|
||||
taskmanager.Shutdown()
|
||||
@@ -382,24 +379,39 @@ func runServer(ctx context.Context) error {
|
||||
if os.Getenv(envGOGC) == "" {
|
||||
debug.SetGCPercent(25)
|
||||
}
|
||||
runtimeEnv.SystemdNotifiy(true, "running")
|
||||
runtime.SystemdNotify(true, "running")
|
||||
|
||||
// Wait for completion or error
|
||||
waitDone := make(chan struct{})
|
||||
go func() {
|
||||
wg.Wait()
|
||||
close(waitDone)
|
||||
}()
|
||||
|
||||
go func() {
|
||||
<-waitDone
|
||||
close(errChan)
|
||||
}()
|
||||
|
||||
// Check for server startup errors
|
||||
// Wait for either:
|
||||
// 1. An error from server startup
|
||||
// 2. Completion of all goroutines (normal shutdown or crash)
|
||||
select {
|
||||
case err := <-errChan:
|
||||
// errChan will be closed when waitDone is closed, which happens
|
||||
// when all goroutines complete (either from normal shutdown or error)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
case <-time.After(100 * time.Millisecond):
|
||||
// Server started successfully, wait for completion
|
||||
if err := <-errChan; err != nil {
|
||||
return err
|
||||
// Give the server 100ms to start and report any immediate startup errors
|
||||
// After that, just wait for normal shutdown completion
|
||||
select {
|
||||
case err := <-errChan:
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
case <-waitDone:
|
||||
// Normal shutdown completed
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -29,12 +29,12 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph/generated"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/memorystore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/routerConfig"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/nats"
|
||||
"github.com/ClusterCockpit/cc-backend/web"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/runtimeEnv"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/runtime"
|
||||
"github.com/gorilla/handlers"
|
||||
"github.com/gorilla/mux"
|
||||
httpSwagger "github.com/swaggo/http-swagger"
|
||||
@@ -49,9 +49,10 @@ const (
|
||||
|
||||
// Server encapsulates the HTTP server state and dependencies
|
||||
type Server struct {
|
||||
router *mux.Router
|
||||
server *http.Server
|
||||
apiHandle *api.RestAPI
|
||||
router *mux.Router
|
||||
server *http.Server
|
||||
restAPIHandle *api.RestAPI
|
||||
natsAPIHandle *api.NatsAPI
|
||||
}
|
||||
|
||||
func onFailureResponse(rw http.ResponseWriter, r *http.Request, err error) {
|
||||
@@ -104,7 +105,7 @@ func (s *Server) init() error {
|
||||
|
||||
authHandle := auth.GetAuthInstance()
|
||||
|
||||
s.apiHandle = api.New()
|
||||
s.restAPIHandle = api.New()
|
||||
|
||||
info := map[string]any{}
|
||||
info["hasOpenIDConnect"] = false
|
||||
@@ -240,15 +241,20 @@ func (s *Server) init() error {
|
||||
|
||||
// Mount all /monitoring/... and /api/... routes.
|
||||
routerConfig.SetupRoutes(secured, buildInfo)
|
||||
s.apiHandle.MountAPIRoutes(securedapi)
|
||||
s.apiHandle.MountUserAPIRoutes(userapi)
|
||||
s.apiHandle.MountConfigAPIRoutes(configapi)
|
||||
s.apiHandle.MountFrontendAPIRoutes(frontendapi)
|
||||
s.restAPIHandle.MountAPIRoutes(securedapi)
|
||||
s.restAPIHandle.MountUserAPIRoutes(userapi)
|
||||
s.restAPIHandle.MountConfigAPIRoutes(configapi)
|
||||
s.restAPIHandle.MountFrontendAPIRoutes(frontendapi)
|
||||
|
||||
if memorystore.InternalCCMSFlag {
|
||||
s.apiHandle.MountMetricStoreAPIRoutes(metricstoreapi)
|
||||
if config.Keys.APISubjects != nil {
|
||||
s.natsAPIHandle = api.NewNatsAPI()
|
||||
if err := s.natsAPIHandle.StartSubscriptions(); err != nil {
|
||||
return fmt.Errorf("starting NATS subscriptions: %w", err)
|
||||
}
|
||||
}
|
||||
|
||||
s.restAPIHandle.MountMetricStoreAPIRoutes(metricstoreapi)
|
||||
|
||||
if config.Keys.EmbedStaticFiles {
|
||||
if i, err := os.Stat("./var/img"); err == nil {
|
||||
if i.IsDir() {
|
||||
@@ -339,7 +345,7 @@ func (s *Server) Start(ctx context.Context) error {
|
||||
// Because this program will want to bind to a privileged port (like 80), the listener must
|
||||
// be established first, then the user can be changed, and after that,
|
||||
// the actual http server can be started.
|
||||
if err := runtimeEnv.DropPrivileges(config.Keys.Group, config.Keys.User); err != nil {
|
||||
if err := runtime.DropPrivileges(config.Keys.Group, config.Keys.User); err != nil {
|
||||
return fmt.Errorf("dropping privileges: %w", err)
|
||||
}
|
||||
|
||||
@@ -375,9 +381,7 @@ func (s *Server) Shutdown(ctx context.Context) {
|
||||
}
|
||||
|
||||
// Archive all the metric store data
|
||||
if memorystore.InternalCCMSFlag {
|
||||
memorystore.Shutdown()
|
||||
}
|
||||
metricstore.Shutdown()
|
||||
|
||||
// Shutdown archiver with 10 second timeout for fast shutdown
|
||||
if err := archiver.Shutdown(10 * time.Second); err != nil {
|
||||
|
||||
@@ -1,106 +1,22 @@
|
||||
{
|
||||
"main": {
|
||||
"addr": "127.0.0.1:8080",
|
||||
"short-running-jobs-duration": 300,
|
||||
"resampling": {
|
||||
"minimumPoints": 600,
|
||||
"trigger": 180,
|
||||
"resolutions": [
|
||||
240,
|
||||
60
|
||||
]
|
||||
},
|
||||
"apiAllowedIPs": [
|
||||
"*"
|
||||
],
|
||||
"emission-constant": 317
|
||||
"apiAllowedIPs": ["*"]
|
||||
},
|
||||
"cron": {
|
||||
"commit-job-worker": "2m",
|
||||
"duration-worker": "5m",
|
||||
"footprint-worker": "10m"
|
||||
},
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
"commit-job-worker": "1m",
|
||||
"duration-worker": "3m",
|
||||
"footprint-worker": "5m"
|
||||
},
|
||||
"auth": {
|
||||
"jwts": {
|
||||
"max-age": "2000h"
|
||||
}
|
||||
},
|
||||
"nats": {
|
||||
"address": "nats://0.0.0.0:4222",
|
||||
"username": "root",
|
||||
"password": "root"
|
||||
},
|
||||
"clusters": [
|
||||
{
|
||||
"name": "fritz",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store-internal",
|
||||
"url": "http://localhost:8082",
|
||||
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "alex",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store-internal",
|
||||
"url": "http://localhost:8082",
|
||||
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
}
|
||||
],
|
||||
"metric-store": {
|
||||
"checkpoints": {
|
||||
"file-format": "avro",
|
||||
"interval": "1h",
|
||||
"directory": "./var/checkpoints",
|
||||
"restore": "48h"
|
||||
"interval": "1h"
|
||||
},
|
||||
"archive": {
|
||||
"interval": "1h",
|
||||
"directory": "./var/archive"
|
||||
},
|
||||
"retention-in-memory": "48h",
|
||||
"subscriptions": [
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "fritz"
|
||||
},
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "alex"
|
||||
}
|
||||
]
|
||||
"retention-in-memory": "12h"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,64 +0,0 @@
|
||||
{
|
||||
"addr": "127.0.0.1:8080",
|
||||
"short-running-jobs-duration": 300,
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"jwts": {
|
||||
"max-age": "2000h"
|
||||
},
|
||||
"db-driver": "mysql",
|
||||
"db": "clustercockpit:demo@tcp(127.0.0.1:3306)/clustercockpit",
|
||||
"enable-resampling": {
|
||||
"trigger": 30,
|
||||
"resolutions": [600, 300, 120, 60]
|
||||
},
|
||||
"emission-constant": 317,
|
||||
"clusters": [
|
||||
{
|
||||
"name": "fritz",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://localhost:8082",
|
||||
"token": ""
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "alex",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://localhost:8082",
|
||||
"token": ""
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -5,50 +5,61 @@
|
||||
"https-key-file": "/etc/letsencrypt/live/url/privkey.pem",
|
||||
"user": "clustercockpit",
|
||||
"group": "clustercockpit",
|
||||
"validate": false,
|
||||
"apiAllowedIPs": ["*"],
|
||||
"short-running-jobs-duration": 300,
|
||||
"enable-job-taggers": true,
|
||||
"resampling": {
|
||||
"minimumPoints": 600,
|
||||
"trigger": 180,
|
||||
"resolutions": [
|
||||
240,
|
||||
60
|
||||
]
|
||||
"resolutions": [240, 60]
|
||||
},
|
||||
"apiSubjects": {
|
||||
"subjectJobEvent": "cc.job.event",
|
||||
"subjectNodeState": "cc.node.state"
|
||||
}
|
||||
},
|
||||
"nats": {
|
||||
"address": "nats://0.0.0.0:4222",
|
||||
"username": "root",
|
||||
"password": "root"
|
||||
},
|
||||
"auth": {
|
||||
"jwts": {
|
||||
"max-age": "2000h"
|
||||
}
|
||||
},
|
||||
"cron": {
|
||||
"commit-job-worker": "2m",
|
||||
"commit-job-worker": "1m",
|
||||
"duration-worker": "5m",
|
||||
"footprint-worker": "10m"
|
||||
},
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"clusters": [
|
||||
{
|
||||
"name": "test",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://localhost:8082",
|
||||
"token": "eyJhbGciOiJF-E-pQBQ"
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
"kind": "s3",
|
||||
"endpoint": "http://x.x.x.x",
|
||||
"bucket": "jobarchive",
|
||||
"accessKey": "xx",
|
||||
"secretKey": "xx",
|
||||
"retention": {
|
||||
"policy": "move",
|
||||
"age": 365,
|
||||
"location": "./var/archive"
|
||||
}
|
||||
]
|
||||
},
|
||||
"metric-store": {
|
||||
"checkpoints": {
|
||||
"interval": "1h"
|
||||
},
|
||||
"retention-in-memory": "12h",
|
||||
"subscriptions": [
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "fritz"
|
||||
},
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "alex"
|
||||
}
|
||||
]
|
||||
},
|
||||
"ui-file": "ui-config.json"
|
||||
}
|
||||
|
||||
|
||||
22
configs/startJobPayload.json
Normal file
22
configs/startJobPayload.json
Normal file
@@ -0,0 +1,22 @@
|
||||
{
|
||||
"cluster": "fritz",
|
||||
"jobId": 123000,
|
||||
"jobState": "running",
|
||||
"numAcc": 0,
|
||||
"numHwthreads": 72,
|
||||
"numNodes": 1,
|
||||
"partition": "main",
|
||||
"requestedMemory": 128000,
|
||||
"resources": [{ "hostname": "f0726" }],
|
||||
"startTime": 1649723812,
|
||||
"subCluster": "main",
|
||||
"submitTime": 1649723812,
|
||||
"user": "k106eb10",
|
||||
"project": "k106eb",
|
||||
"walltime": 86400,
|
||||
"metaData": {
|
||||
"slurmInfo": "JobId=398759\nJobName=myJob\nUserId=dummyUser\nGroupId=dummyGroup\nAccount=dummyAccount\nQOS=normal Requeue=False Restarts=0 BatchFlag=True\nTimeLimit=1439'\nSubmitTime=2023-02-09T14:10:18\nPartition=singlenode\nNodeList=xx\nNumNodes=xx NumCPUs=72 NumTasks=72 CPUs/Task=1\nNTasksPerNode:Socket:Core=0:None:None\nTRES_req=cpu=72,mem=250000M,node=1,billing=72\nTRES_alloc=cpu=72,node=1,billing=72\nCommand=myCmd\nWorkDir=myDir\nStdErr=\nStdOut=\n",
|
||||
"jobScript": "#!/bin/bash -l\n#SBATCH --job-name=dummy_job\n#SBATCH --time=23:59:00\n#SBATCH --partition=singlenode\n#SBATCH --ntasks=72\n#SBATCH --hint=multithread\n#SBATCH --chdir=/home/atuin/k106eb/dummy/\n#SBATCH --export=NONE\nunset SLURM_EXPORT_ENV\n\n#This is a dummy job script\n./mybinary\n",
|
||||
"jobName": "ams_pipeline"
|
||||
}
|
||||
}
|
||||
7
configs/stopJobPayload.json
Normal file
7
configs/stopJobPayload.json
Normal file
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"cluster": "fritz",
|
||||
"jobId": 123000,
|
||||
"jobState": "completed",
|
||||
"startTime": 1649723812,
|
||||
"stopTime": 1649763839
|
||||
}
|
||||
419
configs/tagger/README.md
Normal file
419
configs/tagger/README.md
Normal file
@@ -0,0 +1,419 @@
|
||||
# Job Tagging Configuration
|
||||
|
||||
ClusterCockpit provides automatic job tagging functionality to classify and
|
||||
categorize jobs based on configurable rules. The tagging system consists of two
|
||||
main components:
|
||||
|
||||
1. **Application Detection** - Identifies which application a job is running
|
||||
2. **Job Classification** - Analyzes job performance characteristics and applies classification tags
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
configs/tagger/
|
||||
├── apps/ # Application detection patterns
|
||||
│ ├── vasp.txt
|
||||
│ ├── gromacs.txt
|
||||
│ └── ...
|
||||
└── jobclasses/ # Job classification rules
|
||||
├── parameters.json
|
||||
├── lowUtilization.json
|
||||
├── highload.json
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Activating Tagger Rules
|
||||
|
||||
### Step 1: Copy Configuration Files
|
||||
|
||||
To activate tagging, review, adapt, and copy the configuration files from
|
||||
`configs/tagger/` to `var/tagger/`:
|
||||
|
||||
```bash
|
||||
# From the cc-backend root directory
|
||||
mkdir -p var/tagger
|
||||
cp -r configs/tagger/apps var/tagger/
|
||||
cp -r configs/tagger/jobclasses var/tagger/
|
||||
```
|
||||
|
||||
### Step 2: Enable Tagging in Configuration
|
||||
|
||||
Add or set the following configuration key in the `main` section of your `config.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"enable-job-taggers": true
|
||||
}
|
||||
```
|
||||
|
||||
**Important**: Automatic tagging is disabled by default. You must explicitly
|
||||
enable it by setting `enable-job-taggers: true` in the main configuration file.
|
||||
|
||||
### Step 3: Restart cc-backend
|
||||
|
||||
The tagger system automatically loads configuration from `./var/tagger/` at
|
||||
startup. After copying the files and enabling the feature, restart cc-backend:
|
||||
|
||||
```bash
|
||||
./cc-backend -server
|
||||
```
|
||||
|
||||
### Step 4: Verify Configuration Loaded
|
||||
|
||||
Check the logs for messages indicating successful configuration loading:
|
||||
|
||||
```
|
||||
[INFO] Setup file watch for ./var/tagger/apps
|
||||
[INFO] Setup file watch for ./var/tagger/jobclasses
|
||||
```
|
||||
|
||||
## How Tagging Works
|
||||
|
||||
### Automatic Tagging
|
||||
|
||||
When `enable-job-taggers` is set to `true` in the configuration, tags are
|
||||
automatically applied when:
|
||||
|
||||
- **Job Start**: Application detection runs immediately when a job starts
|
||||
- **Job Stop**: Job classification runs when a job completes
|
||||
|
||||
The system analyzes job metadata and metrics to determine appropriate tags.
|
||||
|
||||
**Note**: Automatic tagging only works for jobs that start or stop after the
|
||||
feature is enabled. Existing jobs are not automatically retagged.
|
||||
|
||||
### Manual Tagging (Retroactive)
|
||||
|
||||
To apply tags to existing jobs in the database, use the `-apply-tags` command
|
||||
line option:
|
||||
|
||||
```bash
|
||||
./cc-backend -apply-tags
|
||||
```
|
||||
|
||||
This processes all jobs in the database and applies current tagging rules. This
|
||||
is useful when:
|
||||
|
||||
- You have existing jobs that were created before tagging was enabled
|
||||
- You've added new tagging rules and want to apply them to historical data
|
||||
- You've modified existing rules and want to re-evaluate all jobs
|
||||
|
||||
### Hot Reload
|
||||
|
||||
The tagger system watches the configuration directories for changes. You can
|
||||
modify or add rules without restarting `cc-backend`:
|
||||
|
||||
- Changes to `var/tagger/apps/*` are detected automatically
|
||||
- Changes to `var/tagger/jobclasses/*` are detected automatically
|
||||
|
||||
## Application Detection
|
||||
|
||||
Application detection identifies which software a job is running by matching
|
||||
patterns in the job script.
|
||||
|
||||
### Configuration Format
|
||||
|
||||
Application patterns are stored in text files under `var/tagger/apps/`. Each
|
||||
file contains one or more regular expression patterns (one per line) that match
|
||||
against the job script.
|
||||
|
||||
**Example: `apps/vasp.txt`**
|
||||
|
||||
```
|
||||
vasp
|
||||
VASP
|
||||
```
|
||||
|
||||
### How It Works
|
||||
|
||||
1. When a job starts, the system retrieves the job script from metadata
|
||||
2. Each line in the app files is treated as a regex pattern
|
||||
3. Patterns are matched case-insensitively against the lowercased job script
|
||||
4. If a match is found, a tag of type `app` with the filename (without extension) is applied
|
||||
5. Only the first matching application is tagged
|
||||
|
||||
### Adding New Applications
|
||||
|
||||
1. Create a new file in `var/tagger/apps/` (e.g., `tensorflow.txt`)
|
||||
2. Add regex patterns, one per line:
|
||||
|
||||
```
|
||||
tensorflow
|
||||
tf\.keras
|
||||
import tensorflow
|
||||
```
|
||||
|
||||
3. The file is automatically detected and loaded
|
||||
|
||||
**Note**: The tag name will be the filename without the `.txt` extension (e.g., `tensorflow`).
|
||||
|
||||
## Job Classification
|
||||
|
||||
Job classification analyzes completed jobs based on their metrics and properties
|
||||
to identify performance issues or characteristics.
|
||||
|
||||
### Configuration Format
|
||||
|
||||
Job classification rules are defined in JSON files under
|
||||
`var/tagger/jobclasses/`. Each rule file defines:
|
||||
|
||||
- **Metrics required**: Which job metrics to analyze
|
||||
- **Requirements**: Pre-conditions that must be met
|
||||
- **Variables**: Computed values used in the rule
|
||||
- **Rule expression**: Boolean expression that determines if the rule matches
|
||||
- **Hint template**: Message displayed when the rule matches
|
||||
|
||||
### Parameters File
|
||||
|
||||
`jobclasses/parameters.json` defines shared threshold values used across multiple rules:
|
||||
|
||||
```json
|
||||
{
|
||||
"lowcpuload_threshold_factor": 0.9,
|
||||
"highmemoryusage_threshold_factor": 0.9,
|
||||
"job_min_duration_seconds": 600.0,
|
||||
"sampling_interval_seconds": 30.0
|
||||
}
|
||||
```
|
||||
|
||||
### Rule File Structure
|
||||
|
||||
**Example: `jobclasses/lowUtilization.json`**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Low resource utilization",
|
||||
"tag": "lowutilization",
|
||||
"parameters": ["job_min_duration_seconds"],
|
||||
"metrics": ["flops_any", "mem_bw"],
|
||||
"requirements": [
|
||||
"job.shared == \"none\"",
|
||||
"job.duration > job_min_duration_seconds"
|
||||
],
|
||||
"variables": [
|
||||
{
|
||||
"name": "mem_bw_perc",
|
||||
"expr": "1.0 - (mem_bw.avg / mem_bw.limits.peak)"
|
||||
}
|
||||
],
|
||||
"rule": "flops_any.avg < flops_any.limits.alert",
|
||||
"hint": "Average flop rate {{.flops_any.avg}} falls below threshold {{.flops_any.limits.alert}}"
|
||||
}
|
||||
```
|
||||
|
||||
#### Field Descriptions
|
||||
|
||||
| Field | Description |
|
||||
| -------------- | ----------------------------------------------------------------------------- |
|
||||
| `name` | Human-readable description of the rule |
|
||||
| `tag` | Tag identifier applied when the rule matches |
|
||||
| `parameters` | List of parameter names from `parameters.json` to include in rule environment |
|
||||
| `metrics` | List of metrics required for evaluation (must be present in job data) |
|
||||
| `requirements` | Boolean expressions that must all be true for the rule to be evaluated |
|
||||
| `variables` | Named expressions computed before evaluating the main rule |
|
||||
| `rule` | Boolean expression that determines if the job matches this classification |
|
||||
| `hint` | Go template string for generating a user-visible message |
|
||||
|
||||
### Expression Environment
|
||||
|
||||
Expressions in `requirements`, `variables`, and `rule` have access to:
|
||||
|
||||
**Job Properties:**
|
||||
|
||||
- `job.shared` - Shared node allocation type
|
||||
- `job.duration` - Job runtime in seconds
|
||||
- `job.numCores` - Number of CPU cores
|
||||
- `job.numNodes` - Number of nodes
|
||||
- `job.jobState` - Job completion state
|
||||
- `job.numAcc` - Number of accelerators
|
||||
- `job.smt` - SMT setting
|
||||
|
||||
**Metric Statistics (for each metric in `metrics`):**
|
||||
|
||||
- `<metric>.min` - Minimum value
|
||||
- `<metric>.max` - Maximum value
|
||||
- `<metric>.avg` - Average value
|
||||
- `<metric>.limits.peak` - Peak limit from cluster config
|
||||
- `<metric>.limits.normal` - Normal threshold
|
||||
- `<metric>.limits.caution` - Caution threshold
|
||||
- `<metric>.limits.alert` - Alert threshold
|
||||
|
||||
**Parameters:**
|
||||
|
||||
- All parameters listed in the `parameters` field
|
||||
|
||||
**Variables:**
|
||||
|
||||
- All variables defined in the `variables` array
|
||||
|
||||
### Expression Language
|
||||
|
||||
Rules use the [expr](https://github.com/expr-lang/expr) language for expressions. Supported operations:
|
||||
|
||||
- **Arithmetic**: `+`, `-`, `*`, `/`, `%`, `^`
|
||||
- **Comparison**: `==`, `!=`, `<`, `<=`, `>`, `>=`
|
||||
- **Logical**: `&&`, `||`, `!`
|
||||
- **Functions**: Standard math functions (see expr documentation)
|
||||
|
||||
### Hint Templates
|
||||
|
||||
Hints use Go's `text/template` syntax. Variables from the evaluation environment are accessible:
|
||||
|
||||
```
|
||||
{{.flops_any.avg}} # Access metric average
|
||||
{{.job.duration}} # Access job property
|
||||
{{.my_variable}} # Access computed variable
|
||||
```
|
||||
|
||||
### Adding New Classification Rules
|
||||
|
||||
1. Create a new JSON file in `var/tagger/jobclasses/` (e.g., `memoryLeak.json`)
|
||||
2. Define the rule structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "Memory Leak Detection",
|
||||
"tag": "memory_leak",
|
||||
"parameters": ["memory_leak_slope_threshold"],
|
||||
"metrics": ["mem_used"],
|
||||
"requirements": ["job.duration > 3600"],
|
||||
"variables": [
|
||||
{
|
||||
"name": "mem_growth",
|
||||
"expr": "(mem_used.max - mem_used.min) / job.duration"
|
||||
}
|
||||
],
|
||||
"rule": "mem_growth > memory_leak_slope_threshold",
|
||||
"hint": "Memory usage grew by {{.mem_growth}} per second"
|
||||
}
|
||||
```
|
||||
|
||||
3. Add any new parameters to `parameters.json`
|
||||
4. The file is automatically detected and loaded
|
||||
|
||||
## Configuration Paths
|
||||
|
||||
The tagger system reads from these paths (relative to cc-backend working directory):
|
||||
|
||||
- **Application patterns**: `./var/tagger/apps/`
|
||||
- **Job classification rules**: `./var/tagger/jobclasses/`
|
||||
|
||||
These paths are defined as constants in the source code and cannot be changed without recompiling.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Tags Not Applied
|
||||
|
||||
1. **Check tagging is enabled**: Verify `enable-job-taggers: true` is set in `config.json`
|
||||
|
||||
2. **Check configuration exists**:
|
||||
|
||||
```bash
|
||||
ls -la var/tagger/apps
|
||||
ls -la var/tagger/jobclasses
|
||||
```
|
||||
|
||||
3. **Check logs for errors**:
|
||||
|
||||
```bash
|
||||
./cc-backend -server -loglevel debug
|
||||
```
|
||||
|
||||
4. **Verify file permissions**: Ensure cc-backend can read the configuration files
|
||||
|
||||
5. **For existing jobs**: Use `./cc-backend -apply-tags` to retroactively tag jobs
|
||||
|
||||
### Rules Not Matching
|
||||
|
||||
1. **Enable debug logging**: Set `loglevel: debug` to see detailed rule evaluation
|
||||
2. **Check requirements**: Ensure all requirements in the rule are satisfied
|
||||
3. **Verify metrics exist**: Classification rules require job metrics to be available
|
||||
4. **Check metric names**: Ensure metric names match those in your cluster configuration
|
||||
|
||||
### File Watch Not Working
|
||||
|
||||
If changes to configuration files aren't detected:
|
||||
|
||||
1. Restart cc-backend to reload all configuration
|
||||
2. Check filesystem supports file watching (network filesystems may not)
|
||||
3. Check logs for file watch setup messages
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start Simple**: Begin with basic rules and refine based on results
|
||||
2. **Use Requirements**: Filter out irrelevant jobs early with requirements
|
||||
3. **Test Incrementally**: Add one rule at a time and verify behavior
|
||||
4. **Document Rules**: Use descriptive names and clear hint messages
|
||||
5. **Share Parameters**: Define common thresholds in `parameters.json` for consistency
|
||||
6. **Version Control**: Keep your `var/tagger/` configuration in version control
|
||||
7. **Backup Before Changes**: Test new rules on a copy before deploying to production
|
||||
|
||||
## Examples
|
||||
|
||||
### Simple Application Detection
|
||||
|
||||
**File: `var/tagger/apps/python.txt`**
|
||||
|
||||
```
|
||||
python
|
||||
python3
|
||||
\.py
|
||||
```
|
||||
|
||||
This detects jobs running Python scripts.
|
||||
|
||||
### Complex Classification Rule
|
||||
|
||||
**File: `var/tagger/jobclasses/cpuImbalance.json`**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "CPU Load Imbalance",
|
||||
"tag": "cpu_imbalance",
|
||||
"parameters": ["core_load_imbalance_threshold_factor"],
|
||||
"metrics": ["cpu_load"],
|
||||
"requirements": ["job.numCores > 1", "job.duration > 600"],
|
||||
"variables": [
|
||||
{
|
||||
"name": "load_variance",
|
||||
"expr": "(cpu_load.max - cpu_load.min) / cpu_load.avg"
|
||||
}
|
||||
],
|
||||
"rule": "load_variance > core_load_imbalance_threshold_factor",
|
||||
"hint": "CPU load varies by {{printf \"%.1f%%\" (load_variance * 100)}} across cores"
|
||||
}
|
||||
```
|
||||
|
||||
This detects jobs where CPU load is unevenly distributed across cores.
|
||||
|
||||
## Reference
|
||||
|
||||
### Configuration Options
|
||||
|
||||
**Main Configuration (`config.json`)**:
|
||||
|
||||
- `enable-job-taggers` (boolean, default: `false`) - Enables automatic job tagging system
|
||||
- Must be set to `true` to activate automatic tagging on job start/stop events
|
||||
- Does not affect the `-apply-tags` command line option
|
||||
|
||||
**Command Line Options**:
|
||||
|
||||
- `-apply-tags` - Apply all tagging rules to existing jobs in the database
|
||||
- Works independently of `enable-job-taggers` configuration
|
||||
- Useful for retroactively tagging jobs or re-evaluating with updated rules
|
||||
|
||||
### Default Configuration Location
|
||||
|
||||
The example configurations are provided in:
|
||||
|
||||
- `configs/tagger/apps/` - Example application patterns (16 applications)
|
||||
- `configs/tagger/jobclasses/` - Example classification rules (3 rules)
|
||||
|
||||
Copy these to `var/tagger/` and customize for your environment.
|
||||
|
||||
### Tag Types
|
||||
|
||||
- `app` - Application tags (e.g., "vasp", "gromacs")
|
||||
- `jobClass` - Classification tags (e.g., "lowutilization", "highload")
|
||||
|
||||
Tags can be queried and filtered in the ClusterCockpit UI and API.
|
||||
30
go.mod
30
go.mod
@@ -11,7 +11,7 @@ tool (
|
||||
|
||||
require (
|
||||
github.com/99designs/gqlgen v0.17.85
|
||||
github.com/ClusterCockpit/cc-lib v1.0.2
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.1.0
|
||||
github.com/Masterminds/squirrel v1.5.4
|
||||
github.com/aws/aws-sdk-go-v2 v1.41.1
|
||||
github.com/aws/aws-sdk-go-v2/config v1.32.6
|
||||
@@ -21,7 +21,6 @@ require (
|
||||
github.com/expr-lang/expr v1.17.7
|
||||
github.com/go-co-op/gocron/v2 v2.19.0
|
||||
github.com/go-ldap/ldap/v3 v3.4.12
|
||||
github.com/go-sql-driver/mysql v1.9.3
|
||||
github.com/golang-jwt/jwt/v5 v5.3.0
|
||||
github.com/golang-migrate/migrate/v4 v4.19.1
|
||||
github.com/google/gops v0.3.28
|
||||
@@ -34,8 +33,6 @@ require (
|
||||
github.com/linkedin/goavro/v2 v2.14.1
|
||||
github.com/mattn/go-sqlite3 v1.14.33
|
||||
github.com/nats-io/nats.go v1.47.0
|
||||
github.com/prometheus/client_golang v1.23.2
|
||||
github.com/prometheus/common v0.67.4
|
||||
github.com/qustavo/sqlhooks/v2 v2.1.0
|
||||
github.com/santhosh-tekuri/jsonschema/v5 v5.3.1
|
||||
github.com/stretchr/testify v1.11.1
|
||||
@@ -48,10 +45,10 @@ require (
|
||||
)
|
||||
|
||||
require (
|
||||
filippo.io/edwards25519 v1.1.0 // indirect
|
||||
github.com/Azure/go-ntlmssp v0.0.0-20221128193559-754e69321358 // indirect
|
||||
github.com/KyleBanks/depth v1.2.1 // indirect
|
||||
github.com/agnivade/levenshtein v1.2.1 // indirect
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0 // indirect
|
||||
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.7.4 // indirect
|
||||
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.18.17 // indirect
|
||||
github.com/aws/aws-sdk-go-v2/internal/configsources v1.4.17 // indirect
|
||||
@@ -67,8 +64,6 @@ require (
|
||||
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.35.13 // indirect
|
||||
github.com/aws/aws-sdk-go-v2/service/sts v1.41.6 // indirect
|
||||
github.com/aws/smithy-go v1.24.0 // indirect
|
||||
github.com/beorn7/perks v1.0.1 // indirect
|
||||
github.com/cespare/xxhash/v2 v2.3.0 // indirect
|
||||
github.com/cpuguy83/go-md2man/v2 v2.0.7 // indirect
|
||||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
|
||||
github.com/felixge/httpsnoop v1.0.4 // indirect
|
||||
@@ -88,28 +83,27 @@ require (
|
||||
github.com/go-viper/mapstructure/v2 v2.4.0 // indirect
|
||||
github.com/goccy/go-yaml v1.19.0 // indirect
|
||||
github.com/golang/snappy v0.0.4 // indirect
|
||||
github.com/google/go-cmp v0.7.0 // indirect
|
||||
github.com/google/uuid v1.6.0 // indirect
|
||||
github.com/gorilla/securecookie v1.1.2 // indirect
|
||||
github.com/gorilla/websocket v1.5.3 // indirect
|
||||
github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect
|
||||
github.com/influxdata/influxdb-client-go/v2 v2.14.0 // indirect
|
||||
github.com/influxdata/line-protocol v0.0.0-20210922203350-b1ad95c89adf // indirect
|
||||
github.com/jonboulle/clockwork v0.5.0 // indirect
|
||||
github.com/jpillora/backoff v1.0.0 // indirect
|
||||
github.com/json-iterator/go v1.1.12 // indirect
|
||||
github.com/klauspost/compress v1.18.1 // indirect
|
||||
github.com/klauspost/compress v1.18.2 // indirect
|
||||
github.com/kr/pretty v0.3.1 // indirect
|
||||
github.com/lann/builder v0.0.0-20180802200727-47ae307949d0 // indirect
|
||||
github.com/lann/ps v0.0.0-20150810152359-62de8c46ede0 // indirect
|
||||
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd // indirect
|
||||
github.com/modern-go/reflect2 v1.0.2 // indirect
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 // indirect
|
||||
github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f // indirect
|
||||
github.com/nats-io/nkeys v0.4.11 // indirect
|
||||
github.com/nats-io/nkeys v0.4.12 // indirect
|
||||
github.com/nats-io/nuid v1.0.1 // indirect
|
||||
github.com/oapi-codegen/runtime v1.1.1 // indirect
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 // indirect
|
||||
github.com/prometheus/client_model v0.6.2 // indirect
|
||||
github.com/prometheus/procfs v0.16.1 // indirect
|
||||
github.com/prometheus/common v0.67.4 // indirect
|
||||
github.com/robfig/cron/v3 v3.0.1 // indirect
|
||||
github.com/russross/blackfriday/v2 v2.1.0 // indirect
|
||||
github.com/sosodev/duration v1.3.1 // indirect
|
||||
github.com/stmcginnis/gofish v0.20.0 // indirect
|
||||
github.com/stretchr/objx v0.5.2 // indirect
|
||||
github.com/swaggo/files v1.0.1 // indirect
|
||||
github.com/urfave/cli/v2 v2.27.7 // indirect
|
||||
@@ -117,13 +111,13 @@ require (
|
||||
github.com/xrash/smetrics v0.0.0-20250705151800-55b8f293f342 // indirect
|
||||
go.yaml.in/yaml/v2 v2.4.3 // indirect
|
||||
go.yaml.in/yaml/v3 v3.0.4 // indirect
|
||||
golang.org/x/exp v0.0.0-20250620022241-b7579e27df2b // indirect
|
||||
golang.org/x/mod v0.31.0 // indirect
|
||||
golang.org/x/net v0.48.0 // indirect
|
||||
golang.org/x/sync v0.19.0 // indirect
|
||||
golang.org/x/sys v0.39.0 // indirect
|
||||
golang.org/x/text v0.32.0 // indirect
|
||||
golang.org/x/tools v0.40.0 // indirect
|
||||
google.golang.org/protobuf v1.36.11 // indirect
|
||||
gopkg.in/yaml.v3 v3.0.1 // indirect
|
||||
sigs.k8s.io/yaml v1.6.0 // indirect
|
||||
)
|
||||
|
||||
91
go.sum
91
go.sum
@@ -2,22 +2,19 @@ filippo.io/edwards25519 v1.1.0 h1:FNf4tywRC1HmFuKW5xopWpigGjJKiJSV0Cqo0cJWDaA=
|
||||
filippo.io/edwards25519 v1.1.0/go.mod h1:BxyFTGdWcka3PhytdK4V28tE5sGfRvvvRV7EaN4VDT4=
|
||||
github.com/99designs/gqlgen v0.17.85 h1:EkGx3U2FDcxQm8YDLQSpXIAVmpDyZ3IcBMOJi2nH1S0=
|
||||
github.com/99designs/gqlgen v0.17.85/go.mod h1:yvs8s0bkQlRfqg03YXr3eR4OQUowVhODT/tHzCXnbOU=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161 h1:L/gRVlceqvL25UVaW/CKtUDjefjrs0SPonmDGUVOYP0=
|
||||
github.com/Azure/go-ansiterm v0.0.0-20230124172434-306776ec8161/go.mod h1:xomTg63KZ2rFqZQzSB4Vz2SUXa1BpHTVz9L5PTmPC4E=
|
||||
github.com/Azure/go-ntlmssp v0.0.0-20221128193559-754e69321358 h1:mFRzDkZVAjdal+s7s0MwaRv9igoPqLRdzOLzw/8Xvq8=
|
||||
github.com/Azure/go-ntlmssp v0.0.0-20221128193559-754e69321358/go.mod h1:chxPXzSsl7ZWRAuOIE23GDNzjWuZquvFlgA8xmpunjU=
|
||||
github.com/ClusterCockpit/cc-lib v1.0.2 h1:ZWn3oZkXgxrr3zSigBdlOOfayZ4Om4xL20DhmritPPg=
|
||||
github.com/ClusterCockpit/cc-lib v1.0.2/go.mod h1:UGdOvXEnjFqlnPSxtvtFwO6BtXYW6NnXFoud9FtN93k=
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.1.0 h1:B6l6h0IjfEuY9DU6aVM3fSsj24lQ1eudXK9QTKmJjqg=
|
||||
github.com/ClusterCockpit/cc-lib/v2 v2.1.0/go.mod h1:JuxMAuEOaLLNEnnL9U3ejha8kMvsSatLdKPZEgJw6iw=
|
||||
github.com/KyleBanks/depth v1.2.1 h1:5h8fQADFrWtarTdtDudMmGsC7GPbOAu6RVB3ffsVFHc=
|
||||
github.com/KyleBanks/depth v1.2.1/go.mod h1:jzSb9d0L43HxTQfT+oSA1EEp2q+ne2uh6XgeJcm8brE=
|
||||
github.com/Masterminds/squirrel v1.5.4 h1:uUcX/aBc8O7Fg9kaISIUsHXdKuqehiXAMQTYX8afzqM=
|
||||
github.com/Masterminds/squirrel v1.5.4/go.mod h1:NNaOrjSoIDfDA40n7sr2tPNZRfjzjA400rg+riTZj10=
|
||||
github.com/Microsoft/go-winio v0.6.2 h1:F2VQgta7ecxGYO8k3ZZz3RS8fVIXVxONVUPlNERoyfY=
|
||||
github.com/Microsoft/go-winio v0.6.2/go.mod h1:yd8OoFMLzJbo9gZq8j5qaps8bJ9aShtEA8Ipt1oGCvU=
|
||||
github.com/NVIDIA/go-nvml v0.13.0-1 h1:OLX8Jq3dONuPOQPC7rndB6+iDmDakw0XTYgzMxObkEw=
|
||||
github.com/NVIDIA/go-nvml v0.13.0-1/go.mod h1:+KNA7c7gIBH7SKSJ1ntlwkfN80zdx8ovl4hrK3LmPt4=
|
||||
github.com/PuerkitoBio/goquery v1.11.0 h1:jZ7pwMQXIITcUXNH83LLk+txlaEy6NVOfTuP43xxfqw=
|
||||
github.com/PuerkitoBio/goquery v1.11.0/go.mod h1:wQHgxUOU3JGuj3oD/QFfxUdlzW6xPHfqyHre6VMY4DQ=
|
||||
github.com/RaveNoX/go-jsoncommentstrip v1.0.0/go.mod h1:78ihd09MekBnJnxpICcwzCMzGrKSKYe4AqU6PDYYpjk=
|
||||
github.com/agnivade/levenshtein v1.2.1 h1:EHBY3UOn1gwdy/VbFwgo4cxecRznFk7fKWN1KOX7eoM=
|
||||
github.com/agnivade/levenshtein v1.2.1/go.mod h1:QVVI16kDrtSuwcpd0p1+xMC6Z/VfhtCyDIjcwga4/DU=
|
||||
github.com/alexbrainman/sspi v0.0.0-20250919150558-7d374ff0d59e h1:4dAU9FXIyQktpoUAgOJK3OTFc/xug0PCXYCqU0FgDKI=
|
||||
@@ -26,6 +23,8 @@ github.com/andreyvit/diff v0.0.0-20170406064948-c7f18ee00883 h1:bvNMNQO63//z+xNg
|
||||
github.com/andreyvit/diff v0.0.0-20170406064948-c7f18ee00883/go.mod h1:rCTlJbsFo29Kk6CurOXKm700vrz8f0KW0JNfpkRJY/8=
|
||||
github.com/andybalholm/cascadia v1.3.3 h1:AG2YHrzJIm4BZ19iwJ/DAua6Btl3IwJX+VI4kktS1LM=
|
||||
github.com/andybalholm/cascadia v1.3.3/go.mod h1:xNd9bqTn98Ln4DwST8/nG+H0yuB8Hmgu1YHNnWw0GeA=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.5.0-default-no-op h1:Ucf+QxEKMbPogRO5guBNe5cgd9uZgfoJLOYs8WWhtjM=
|
||||
github.com/antithesishq/antithesis-sdk-go v0.5.0-default-no-op/go.mod h1:IUpT2DPAKh6i/YhSbt6Gl3v2yvUZjmKncl7U91fup7E=
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0 h1:axGnT1gRIfimI7gJifB699GoE/oq+F2MU7Dml6nw9rQ=
|
||||
github.com/apapsch/go-jsonmerge/v2 v2.0.0/go.mod h1:lvDnEdqiQrp0O42VQGgmlKpxL1AP2+08jFMw88y4klk=
|
||||
github.com/arbovm/levenshtein v0.0.0-20160628152529-48b4e1c0c4d0 h1:jfIu9sQUG6Ig+0+Ap1h4unLjW6YQJpKZVmUzxsD4E/Q=
|
||||
@@ -70,12 +69,9 @@ github.com/aws/smithy-go v1.24.0 h1:LpilSUItNPFr1eY85RYgTIg5eIEPtvFbskaFcmmIUnk=
|
||||
github.com/aws/smithy-go v1.24.0/go.mod h1:LEj2LM3rBRQJxPZTB4KuzZkaZYnZPnvgIhb4pu07mx0=
|
||||
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
|
||||
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
|
||||
github.com/bmatcuk/doublestar v1.1.1/go.mod h1:UD6OnuiIn0yFxxA2le/rnRU1G4RaI4UvFv1sNto9p6w=
|
||||
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
|
||||
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
|
||||
github.com/containerd/errdefs v1.0.0 h1:tg5yIfIlQIrxYtu9ajqY42W3lpS19XqdxRQeEwYG8PI=
|
||||
github.com/containerd/errdefs v1.0.0/go.mod h1:+YBYIdtsnF4Iw6nWZhJcqGSg/dwvV7tyJ/kCkyJ2k+M=
|
||||
github.com/containerd/errdefs/pkg v0.3.0 h1:9IKJ06FvyNlexW690DXuQNx2KA2cUJXx151Xdx3ZPPE=
|
||||
github.com/containerd/errdefs/pkg v0.3.0/go.mod h1:NJw6s9HwNuRhnjJhM7pylWwMyAkmCQvQ4GpJHEqRLVk=
|
||||
github.com/coreos/go-oidc/v3 v3.17.0 h1:hWBGaQfbi0iVviX4ibC7bk8OKT5qNr4klBaCHVNvehc=
|
||||
github.com/coreos/go-oidc/v3 v3.17.0/go.mod h1:wqPbKFrVnE90vty060SB40FCJ8fTHTxSwyXJqZH+sI8=
|
||||
github.com/cpuguy83/go-md2man/v2 v2.0.7 h1:zbFlGlXEAKlwXpmvle3d8Oe3YnkKIK4xSRTd3sHPnBo=
|
||||
@@ -87,16 +83,6 @@ github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc h1:U9qPSI2PIWSS1
|
||||
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
|
||||
github.com/dgryski/trifles v0.0.0-20230903005119-f50d829f2e54 h1:SG7nF6SRlWhcT7cNTs5R6Hk4V2lcmLz2NsG2VnInyNo=
|
||||
github.com/dgryski/trifles v0.0.0-20230903005119-f50d829f2e54/go.mod h1:if7Fbed8SFyPtHLHbg49SI7NAdJiC5WIA09pe59rfAA=
|
||||
github.com/dhui/dktest v0.4.6 h1:+DPKyScKSEp3VLtbMDHcUq6V5Lm5zfZZVb0Sk7Ahom4=
|
||||
github.com/dhui/dktest v0.4.6/go.mod h1:JHTSYDtKkvFNFHJKqCzVzqXecyv+tKt8EzceOmQOgbU=
|
||||
github.com/distribution/reference v0.6.0 h1:0IXCQ5g4/QMHHkarYzh5l+u8T3t73zM5QvfrDyIgxBk=
|
||||
github.com/distribution/reference v0.6.0/go.mod h1:BbU0aIcezP1/5jX/8MP0YiH4SdvB5Y4f/wlDRiLyi3E=
|
||||
github.com/docker/docker v28.3.3+incompatible h1:Dypm25kh4rmk49v1eiVbsAtpAsYURjYkaKubwuBdxEI=
|
||||
github.com/docker/docker v28.3.3+incompatible/go.mod h1:eEKB0N0r5NX/I1kEveEz05bcu8tLC/8azJZsviup8Sk=
|
||||
github.com/docker/go-connections v0.5.0 h1:USnMq7hx7gwdVZq1L49hLXaFtUdTADjXGp+uj1Br63c=
|
||||
github.com/docker/go-connections v0.5.0/go.mod h1:ov60Kzw0kKElRwhNs9UlUHAE/F9Fe6GLaXnqyDdmEXc=
|
||||
github.com/docker/go-units v0.5.0 h1:69rxXcBk27SvSaaxTtLh/8llcHD8vYHT7WSdRZ/jvr4=
|
||||
github.com/docker/go-units v0.5.0/go.mod h1:fgPhTUdO+D/Jk86RDLlptpiXQzgHJF7gydDDbaIK4Dk=
|
||||
github.com/expr-lang/expr v1.17.7 h1:Q0xY/e/2aCIp8g9s/LGvMDCC5PxYlvHgDZRQ4y16JX8=
|
||||
github.com/expr-lang/expr v1.17.7/go.mod h1:8/vRC7+7HBzESEqt5kKpYXxrxkr31SaO8r40VO/1IT4=
|
||||
github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
|
||||
@@ -115,10 +101,6 @@ github.com/go-jose/go-jose/v4 v4.1.3 h1:CVLmWDhDVRa6Mi/IgCgaopNosCaHz7zrMeF9MlZR
|
||||
github.com/go-jose/go-jose/v4 v4.1.3/go.mod h1:x4oUasVrzR7071A4TnHLGSPpNOm2a21K9Kf04k1rs08=
|
||||
github.com/go-ldap/ldap/v3 v3.4.12 h1:1b81mv7MagXZ7+1r7cLTWmyuTqVqdwbtJSjC0DAp9s4=
|
||||
github.com/go-ldap/ldap/v3 v3.4.12/go.mod h1:+SPAGcTtOfmGsCb3h1RFiq4xpp4N636G75OEace8lNo=
|
||||
github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
|
||||
github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
|
||||
github.com/go-logr/stdr v1.2.2 h1:hSWxHoqTgW2S2qGc0LTAI563KZ5YKYRhT3MFKZMbjag=
|
||||
github.com/go-logr/stdr v1.2.2/go.mod h1:mMo/vtBO5dYbehREoey6XUKy/eSumjCCveDpRre4VKE=
|
||||
github.com/go-openapi/jsonpointer v0.22.3 h1:dKMwfV4fmt6Ah90zloTbUKWMD+0he+12XYAsPotrkn8=
|
||||
github.com/go-openapi/jsonpointer v0.22.3/go.mod h1:0lBbqeRsQ5lIanv3LHZBrmRGHLHcQoOXQnf88fHlGWo=
|
||||
github.com/go-openapi/jsonreference v0.21.3 h1:96Dn+MRPa0nYAR8DR1E03SblB5FJvh7W6krPI0Z7qMc=
|
||||
@@ -147,15 +129,12 @@ github.com/go-openapi/testify/enable/yaml/v2 v2.0.2/go.mod h1:kme83333GCtJQHXQ8U
|
||||
github.com/go-openapi/testify/v2 v2.0.2 h1:X999g3jeLcoY8qctY/c/Z8iBHTbwLz7R2WXd6Ub6wls=
|
||||
github.com/go-openapi/testify/v2 v2.0.2/go.mod h1:HCPmvFFnheKK2BuwSA0TbbdxJ3I16pjwMkYkP4Ywn54=
|
||||
github.com/go-sql-driver/mysql v1.4.1/go.mod h1:zAC/RDZ24gD3HViQzih4MyKcchzm+sOG5ZlKdlhCg5w=
|
||||
github.com/go-sql-driver/mysql v1.8.1 h1:LedoTUt/eveggdHS9qUFC1EFSa8bU2+1pZjSRpvNJ1Y=
|
||||
github.com/go-sql-driver/mysql v1.8.1/go.mod h1:wEBSXgmK//2ZFJyE+qWnIsVGmvmEKlqwuVSjsCm7DZg=
|
||||
github.com/go-sql-driver/mysql v1.9.3 h1:U/N249h2WzJ3Ukj8SowVFjdtZKfu9vlLZxjPXV1aweo=
|
||||
github.com/go-sql-driver/mysql v1.9.3/go.mod h1:qn46aNg1333BRMNU69Lq93t8du/dwxI64Gl8i5p1WMU=
|
||||
github.com/go-viper/mapstructure/v2 v2.4.0 h1:EBsztssimR/CONLSZZ04E8qAkxNYq4Qp9LvH92wZUgs=
|
||||
github.com/go-viper/mapstructure/v2 v2.4.0/go.mod h1:oJDH3BJKyqBA2TXFhDsKDGDTlndYOZ6rGS0BRZIxGhM=
|
||||
github.com/goccy/go-yaml v1.19.0 h1:EmkZ9RIsX+Uq4DYFowegAuJo8+xdX3T/2dwNPXbxEYE=
|
||||
github.com/goccy/go-yaml v1.19.0/go.mod h1:XBurs7gK8ATbW4ZPGKgcbrY1Br56PdM69F7LkFRi1kA=
|
||||
github.com/gogo/protobuf v1.3.2 h1:Ov1cvc58UF3b5XjBnZv7+opcTcQFZebYjWzi34vdm4Q=
|
||||
github.com/gogo/protobuf v1.3.2/go.mod h1:P1XiOD3dCwIKUDQYPy72D8LYyHL2YPYrpS2s69NZV8Q=
|
||||
github.com/golang-jwt/jwt/v5 v5.3.0 h1:pv4AsKCKKZuqlgs5sUmn4x8UlGa0kEVt/puTpKx9vvo=
|
||||
github.com/golang-jwt/jwt/v5 v5.3.0/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE=
|
||||
github.com/golang-migrate/migrate/v4 v4.19.1 h1:OCyb44lFuQfYXYLx1SCxPZQGU7mcaZ7gH9yH4jSFbBA=
|
||||
@@ -167,7 +146,8 @@ github.com/google/go-cmp v0.5.2/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/
|
||||
github.com/google/go-cmp v0.5.5/go.mod h1:v8dTdLbMG2kIc/vJvl+f65V22dbkXbowE6jgT/gNBxE=
|
||||
github.com/google/go-cmp v0.7.0 h1:wk8382ETsv4JYUZwIsn6YpYiWiBsYLSJiTsyBybVuN8=
|
||||
github.com/google/go-cmp v0.7.0/go.mod h1:pXiqmnSA92OHEEa9HXL2W4E7lf9JzCmGVUdgjX3N/iU=
|
||||
github.com/google/gofuzz v1.0.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
|
||||
github.com/google/go-tpm v0.9.7 h1:u89J4tUUeDTlH8xxC3CTW7OHZjbjKoHdQ9W7gCUhtxA=
|
||||
github.com/google/go-tpm v0.9.7/go.mod h1:h9jEsEECg7gtLis0upRBQU+GhYVH6jMjrFxI8u6bVUY=
|
||||
github.com/google/gofuzz v1.2.0 h1:xRy4A+RhZaiKjJ1bPfwQ8sedCA+YS2YcCHW6ec7JMi0=
|
||||
github.com/google/gofuzz v1.2.0/go.mod h1:dBl0BpW6vV/+mYPU4Po3pmUjxk6FQPldtuIdl/M65Eg=
|
||||
github.com/google/gops v0.3.28 h1:2Xr57tqKAmQYRAfG12E+yLcoa2Y42UJo2lOrUFL9ark=
|
||||
@@ -217,12 +197,9 @@ github.com/joho/godotenv v1.5.1 h1:7eLL/+HRGLY0ldzfGMeQkb7vMd0as4CfYvUVzLqw0N0=
|
||||
github.com/joho/godotenv v1.5.1/go.mod h1:f4LDr5Voq0i2e/R5DDNOoa2zzDfwtkZa6DnEwAbqwq4=
|
||||
github.com/jonboulle/clockwork v0.5.0 h1:Hyh9A8u51kptdkR+cqRpT1EebBwTn1oK9YfGYbdFz6I=
|
||||
github.com/jonboulle/clockwork v0.5.0/go.mod h1:3mZlmanh0g2NDKO5TWZVJAfofYk64M7XN3SzBPjZF60=
|
||||
github.com/jpillora/backoff v1.0.0 h1:uvFg412JmmHBHw7iwprIxkPMI+sGQ4kzOWsMeHnm2EA=
|
||||
github.com/jpillora/backoff v1.0.0/go.mod h1:J/6gKK9jxlEcS3zixgDgUAsiuZ7yrSoa/FX5e0EB2j4=
|
||||
github.com/json-iterator/go v1.1.12 h1:PV8peI4a0ysnczrg+LtxykD8LfKY9ML6u2jnxaEnrnM=
|
||||
github.com/json-iterator/go v1.1.12/go.mod h1:e30LSqwooZae/UwlEbR2852Gd8hjQvJoHmT4TnhNGBo=
|
||||
github.com/klauspost/compress v1.18.1 h1:bcSGx7UbpBqMChDtsF28Lw6v/G94LPrrbMbdC3JH2co=
|
||||
github.com/klauspost/compress v1.18.1/go.mod h1:ZQFFVG+MdnR0P+l6wpXgIL4NTtwiKIdBnrBd8Nrxr+0=
|
||||
github.com/juju/gnuflag v0.0.0-20171113085948-2ce1bb71843d/go.mod h1:2PavIy+JPciBPrBUjwbNvtwB6RQlve+hkpll6QSNmOE=
|
||||
github.com/klauspost/compress v1.18.2 h1:iiPHWW0YrcFgpBYhsA6D1+fqHssJscY/Tm/y2Uqnapk=
|
||||
github.com/klauspost/compress v1.18.2/go.mod h1:R0h/fSBs8DE4ENlcrlib3PsXS61voFxhIs2DeRhCvJ4=
|
||||
github.com/kr/pretty v0.2.1/go.mod h1:ipq/a2n7PKx3OHsz4KJII5eveXtPO4qwEXGdVfWzfnI=
|
||||
github.com/kr/pretty v0.3.1 h1:flRD4NNwYAUpkphVc1HcthR4KEIFJ65n8Mw5qdRn3LE=
|
||||
github.com/kr/pretty v0.3.1/go.mod h1:hoEshYVHaxMs3cyo3Yncou5ZscifuDolrwPKZanG3xk=
|
||||
@@ -243,37 +220,25 @@ github.com/mattn/go-sqlite3 v1.10.0/go.mod h1:FPy6KqzDD04eiIsT53CuJW3U88zkxoIYsO
|
||||
github.com/mattn/go-sqlite3 v1.14.22/go.mod h1:Uh1q+B4BYcTPb+yiD3kU8Ct7aC0hY9fxUwlHK0RXw+Y=
|
||||
github.com/mattn/go-sqlite3 v1.14.33 h1:A5blZ5ulQo2AtayQ9/limgHEkFreKj1Dv226a1K73s0=
|
||||
github.com/mattn/go-sqlite3 v1.14.33/go.mod h1:Uh1q+B4BYcTPb+yiD3kU8Ct7aC0hY9fxUwlHK0RXw+Y=
|
||||
github.com/moby/docker-image-spec v1.3.1 h1:jMKff3w6PgbfSa69GfNg+zN/XLhfXJGnEx3Nl2EsFP0=
|
||||
github.com/moby/docker-image-spec v1.3.1/go.mod h1:eKmb5VW8vQEh/BAr2yvVNvuiJuY6UIocYsFu/DxxRpo=
|
||||
github.com/moby/term v0.5.0 h1:xt8Q1nalod/v7BqbG21f8mQPqH+xAaC9C3N3wfWbVP0=
|
||||
github.com/moby/term v0.5.0/go.mod h1:8FzsFHVUBGZdbDsJw/ot+X+d5HLUbvklYLJ9uGfcI3Y=
|
||||
github.com/modern-go/concurrent v0.0.0-20180228061459-e0a39a4cb421/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
|
||||
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd h1:TRLaZ9cD/w8PVh93nsPXa1VrQ6jlwL5oN8l14QlcNfg=
|
||||
github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd/go.mod h1:6dJC0mAP4ikYIbvyc7fijjWJddQyLn8Ig3JB5CqoB9Q=
|
||||
github.com/modern-go/reflect2 v1.0.2 h1:xBagoLtFs94CBntxluKeaWgTMpvLxC4ur3nMaC9Gz0M=
|
||||
github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
|
||||
github.com/morikuni/aec v1.0.0 h1:nP9CBfwrvYnBRgY6qfDQkygYDmYwOilePFkwzv4dU8A=
|
||||
github.com/morikuni/aec v1.0.0/go.mod h1:BbKIizmSmc5MMPqRYbxO4ZU0S0+P200+tUnFx7PXmsc=
|
||||
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76 h1:KGuD/pM2JpL9FAYvBrnBBeENKZNh6eNtjqytV6TYjnk=
|
||||
github.com/minio/highwayhash v1.0.4-0.20251030100505-070ab1a87a76/go.mod h1:GGYsuwP/fPD6Y9hMiXuapVvlIUEhFhMTh0rxU3ik1LQ=
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
|
||||
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
|
||||
github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f h1:KUppIJq7/+SVif2QVs3tOP0zanoHgBEVAwHxUSIzRqU=
|
||||
github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f/go.mod h1:qRWi+5nqEBWmkhHvq77mSJWrCKwh8bxhgT7d/eI7P4U=
|
||||
github.com/nats-io/jwt/v2 v2.8.0 h1:K7uzyz50+yGZDO5o772eRE7atlcSEENpL7P+b74JV1g=
|
||||
github.com/nats-io/jwt/v2 v2.8.0/go.mod h1:me11pOkwObtcBNR8AiMrUbtVOUGkqYjMQZ6jnSdVUIA=
|
||||
github.com/nats-io/nats-server/v2 v2.12.3 h1:KRv+1n7lddMVgkJPQer+pt36TcO0ENxjilBmeWdjcHs=
|
||||
github.com/nats-io/nats-server/v2 v2.12.3/go.mod h1:MQXjG9WjyXKz9koWzUc3jYUMKD8x3CLmTNy91IQQz3Y=
|
||||
github.com/nats-io/nats.go v1.47.0 h1:YQdADw6J/UfGUd2Oy6tn4Hq6YHxCaJrVKayxxFqYrgM=
|
||||
github.com/nats-io/nats.go v1.47.0/go.mod h1:iRWIPokVIFbVijxuMQq4y9ttaBTMe0SFdlZfMDd+33g=
|
||||
github.com/nats-io/nkeys v0.4.11 h1:q44qGV008kYd9W1b1nEBkNzvnWxtRSQ7A8BoqRrcfa0=
|
||||
github.com/nats-io/nkeys v0.4.11/go.mod h1:szDimtgmfOi9n25JpfIdGw12tZFYXqhGxjhVxsatHVE=
|
||||
github.com/nats-io/nkeys v0.4.12 h1:nssm7JKOG9/x4J8II47VWCL1Ds29avyiQDRn0ckMvDc=
|
||||
github.com/nats-io/nkeys v0.4.12/go.mod h1:MT59A1HYcjIcyQDJStTfaOY6vhy9XTUjOFo+SVsvpBg=
|
||||
github.com/nats-io/nuid v1.0.1 h1:5iA8DT8V7q8WK2EScv2padNa/rTESc1KdnPw4TC2paw=
|
||||
github.com/nats-io/nuid v1.0.1/go.mod h1:19wcPz3Ph3q0Jbyiqsd0kePYG7A95tJPxeL+1OSON2c=
|
||||
github.com/niemeyer/pretty v0.0.0-20200227124842-a10e7caefd8e/go.mod h1:zD1mROLANZcx1PVRCS0qkT7pwLkGfwJo4zjcN/Tysno=
|
||||
github.com/oapi-codegen/runtime v1.1.1 h1:EXLHh0DXIJnWhdRPN2w4MXAzFyE4CskzhNLUmtpMYro=
|
||||
github.com/oapi-codegen/runtime v1.1.1/go.mod h1:SK9X900oXmPWilYR5/WKPzt3Kqxn/uS/+lbpREv+eCg=
|
||||
github.com/opencontainers/go-digest v1.0.0 h1:apOUWs51W5PlhuyGyz9FCeeBIOUDA/6nW8Oi/yOhh5U=
|
||||
github.com/opencontainers/go-digest v1.0.0/go.mod h1:0JzlMkj0TRzQZfJkVvzbP0HBR3IKzErnv2BNG4W4MAM=
|
||||
github.com/opencontainers/image-spec v1.1.0 h1:8SG7/vwALn54lVB/0yZ/MMwhFrPYtpEHQb2IpWsCzug=
|
||||
github.com/opencontainers/image-spec v1.1.0/go.mod h1:W4s4sFTMaBeK1BQLXbG4AdM2szdn85PY75RI83NrTrM=
|
||||
github.com/opentracing/opentracing-go v1.1.0/go.mod h1:UkNAQd3GIcIGf0SeVgPpRdFStlNbqXla1AfSYxPUl2o=
|
||||
github.com/pkg/errors v0.9.1 h1:FEBLx1zS214owpjy7qsBeixbURkuhQAwrK5UwLGTwt4=
|
||||
github.com/pkg/errors v0.9.1/go.mod h1:bwawxfHBFNV+L2hUp1rHADufV3IMtnDRdf1r5NINEl0=
|
||||
github.com/pkg/diff v0.0.0-20210226163009-20ebb0f2a09e/go.mod h1:pJLUxLENpZxwdsKMEsNbx1VGcRFpLqf3715MtcvvzbA=
|
||||
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRIccs7FGNTlIRMkT8wgtp5eCXdBlqhYGL6U=
|
||||
github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
|
||||
@@ -289,6 +254,7 @@ github.com/qustavo/sqlhooks/v2 v2.1.0 h1:54yBemHnGHp/7xgT+pxwmIlMSDNYKx5JW5dfRAi
|
||||
github.com/qustavo/sqlhooks/v2 v2.1.0/go.mod h1:aMREyKo7fOKTwiLuWPsaHRXEmtqG4yREztO0idF83AU=
|
||||
github.com/robfig/cron/v3 v3.0.1 h1:WdRxkvbJztn8LMz/QEvLN5sBU+xKpSqwwUO1Pjr4qDs=
|
||||
github.com/robfig/cron/v3 v3.0.1/go.mod h1:eQICP3HwyT7UooqI/z+Ov+PtYAWygg1TEWWzGIFLtro=
|
||||
github.com/rogpeppe/go-internal v1.9.0/go.mod h1:WtVeX8xhTBvf0smdhujwtBcq4Qrzq/fJaraNFVN+nFs=
|
||||
github.com/rogpeppe/go-internal v1.14.1 h1:UQB4HGPB6osV0SQTLymcB4TgvyWu6ZyliaW0tI/otEQ=
|
||||
github.com/rogpeppe/go-internal v1.14.1/go.mod h1:MaRKkUm5W0goXpeCfT7UZI6fk/L7L7so1lCWt35ZSgc=
|
||||
github.com/russross/blackfriday/v2 v2.1.0 h1:JIOH55/0cWyOuilr9/qlrm0BSXldqnqwMsf35Ld67mk=
|
||||
@@ -299,6 +265,9 @@ github.com/sergi/go-diff v1.3.1 h1:xkr+Oxo4BOQKmkn/B9eMK0g5Kg/983T9DqqPHwYqD+8=
|
||||
github.com/sergi/go-diff v1.3.1/go.mod h1:aMJSSKb2lpPvRNec0+w3fl7LP9IOFzdc9Pa4NFbPK1I=
|
||||
github.com/sosodev/duration v1.3.1 h1:qtHBDMQ6lvMQsL15g4aopM4HEfOaYuhWBw3NPTtlqq4=
|
||||
github.com/sosodev/duration v1.3.1/go.mod h1:RQIBBX0+fMLc/D9+Jb/fwvVmo0eZvDDEERAikUR6SDg=
|
||||
github.com/spkg/bom v0.0.0-20160624110644-59b7046e48ad/go.mod h1:qLr4V1qq6nMqFKkMo8ZTx3f+BZEkzsRUY10Xsm2mwU0=
|
||||
github.com/stmcginnis/gofish v0.20.0 h1:hH2V2Qe898F2wWT1loApnkDUrXXiLKqbSlMaH3Y1n08=
|
||||
github.com/stmcginnis/gofish v0.20.0/go.mod h1:PzF5i8ecRG9A2ol8XT64npKUunyraJ+7t0kYMpQAtqU=
|
||||
github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME=
|
||||
github.com/stretchr/objx v0.4.0/go.mod h1:YvHI0jy2hoMjB+UWwv71VJQ9isScKT/TqJzVSSt89Yw=
|
||||
github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY=
|
||||
@@ -325,16 +294,6 @@ github.com/vektah/gqlparser/v2 v2.5.31/go.mod h1:c1I28gSOVNzlfc4WuDlqU7voQnsqI6O
|
||||
github.com/xrash/smetrics v0.0.0-20250705151800-55b8f293f342 h1:FnBeRrxr7OU4VvAzt5X7s6266i6cSVkkFPS0TuXWbIg=
|
||||
github.com/xrash/smetrics v0.0.0-20250705151800-55b8f293f342/go.mod h1:Ohn+xnUBiLI6FVj/9LpzZWtj1/D6lUovWYBkxHVV3aM=
|
||||
github.com/yuin/goldmark v1.4.13/go.mod h1:6yULJ656Px+3vBD8DxQVa3kxgyrAnzto9xy5taEt/CY=
|
||||
go.opentelemetry.io/auto/sdk v1.1.0 h1:cH53jehLUN6UFLY71z+NDOiNJqDdPRaXzTel0sJySYA=
|
||||
go.opentelemetry.io/auto/sdk v1.1.0/go.mod h1:3wSPjt5PWp2RhlCcmmOial7AvC4DQqZb7a7wCow3W8A=
|
||||
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0 h1:F7Jx+6hwnZ41NSFTO5q4LYDtJRXBf2PD0rNBkeB/lus=
|
||||
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp v0.61.0/go.mod h1:UHB22Z8QsdRDrnAtX4PntOl36ajSxcdUMt1sF7Y6E7Q=
|
||||
go.opentelemetry.io/otel v1.37.0 h1:9zhNfelUvx0KBfu/gb+ZgeAfAgtWrfHJZcAqFC228wQ=
|
||||
go.opentelemetry.io/otel v1.37.0/go.mod h1:ehE/umFRLnuLa/vSccNq9oS1ErUlkkK71gMcN34UG8I=
|
||||
go.opentelemetry.io/otel/metric v1.37.0 h1:mvwbQS5m0tbmqML4NqK+e3aDiO02vsf/WgbsdpcPoZE=
|
||||
go.opentelemetry.io/otel/metric v1.37.0/go.mod h1:04wGrZurHYKOc+RKeye86GwKiTb9FKm1WHtO+4EVr2E=
|
||||
go.opentelemetry.io/otel/trace v1.37.0 h1:HLdcFNbRQBE2imdSEgm/kwqmQj1Or1l/7bW6mxVK7z4=
|
||||
go.opentelemetry.io/otel/trace v1.37.0/go.mod h1:TlgrlQ+PtQO5XFerSPUYG0JSgGyryXewPGyayAWSBS0=
|
||||
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
|
||||
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
|
||||
go.yaml.in/yaml/v2 v2.4.3 h1:6gvOSjQoTB3vt1l+CU+tSyi/HOjfOjRLJ4YwYZGwRO0=
|
||||
|
||||
52
gqlgen.yml
52
gqlgen.yml
@@ -52,51 +52,51 @@ models:
|
||||
- github.com/99designs/gqlgen/graphql.Int64
|
||||
- github.com/99designs/gqlgen/graphql.Int32
|
||||
Job:
|
||||
model: "github.com/ClusterCockpit/cc-lib/schema.Job"
|
||||
model: "github.com/ClusterCockpit/cc-lib/v2/schema.Job"
|
||||
fields:
|
||||
tags:
|
||||
resolver: true
|
||||
metaData:
|
||||
resolver: true
|
||||
Cluster:
|
||||
model: "github.com/ClusterCockpit/cc-lib/schema.Cluster"
|
||||
model: "github.com/ClusterCockpit/cc-lib/v2/schema.Cluster"
|
||||
fields:
|
||||
partitions:
|
||||
resolver: true
|
||||
# Node:
|
||||
# model: "github.com/ClusterCockpit/cc-lib/schema.Node"
|
||||
# model: "github.com/ClusterCockpit/cc-lib/v2/schema.Node"
|
||||
# fields:
|
||||
# metaData:
|
||||
# resolver: true
|
||||
NullableFloat: { model: "github.com/ClusterCockpit/cc-lib/schema.Float" }
|
||||
MetricScope: { model: "github.com/ClusterCockpit/cc-lib/schema.MetricScope" }
|
||||
MetricValue: { model: "github.com/ClusterCockpit/cc-lib/schema.MetricValue" }
|
||||
NullableFloat: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Float" }
|
||||
MetricScope: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.MetricScope" }
|
||||
MetricValue: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.MetricValue" }
|
||||
JobStatistics:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.JobStatistics" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.JobStatistics" }
|
||||
GlobalMetricListItem:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.GlobalMetricListItem" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.GlobalMetricListItem" }
|
||||
ClusterSupport:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.ClusterSupport" }
|
||||
Tag: { model: "github.com/ClusterCockpit/cc-lib/schema.Tag" }
|
||||
Resource: { model: "github.com/ClusterCockpit/cc-lib/schema.Resource" }
|
||||
JobState: { model: "github.com/ClusterCockpit/cc-lib/schema.JobState" }
|
||||
Node: { model: "github.com/ClusterCockpit/cc-lib/schema.Node" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.ClusterSupport" }
|
||||
Tag: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Tag" }
|
||||
Resource: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Resource" }
|
||||
JobState: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.JobState" }
|
||||
Node: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Node" }
|
||||
SchedulerState:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.SchedulerState" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.SchedulerState" }
|
||||
HealthState:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.MonitoringState" }
|
||||
JobMetric: { model: "github.com/ClusterCockpit/cc-lib/schema.JobMetric" }
|
||||
Series: { model: "github.com/ClusterCockpit/cc-lib/schema.Series" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.MonitoringState" }
|
||||
JobMetric: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.JobMetric" }
|
||||
Series: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Series" }
|
||||
MetricStatistics:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.MetricStatistics" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.MetricStatistics" }
|
||||
MetricConfig:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.MetricConfig" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.MetricConfig" }
|
||||
SubClusterConfig:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.SubClusterConfig" }
|
||||
Accelerator: { model: "github.com/ClusterCockpit/cc-lib/schema.Accelerator" }
|
||||
Topology: { model: "github.com/ClusterCockpit/cc-lib/schema.Topology" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.SubClusterConfig" }
|
||||
Accelerator: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Accelerator" }
|
||||
Topology: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Topology" }
|
||||
FilterRanges:
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/schema.FilterRanges" }
|
||||
SubCluster: { model: "github.com/ClusterCockpit/cc-lib/schema.SubCluster" }
|
||||
StatsSeries: { model: "github.com/ClusterCockpit/cc-lib/schema.StatsSeries" }
|
||||
Unit: { model: "github.com/ClusterCockpit/cc-lib/schema.Unit" }
|
||||
{ model: "github.com/ClusterCockpit/cc-lib/v2/schema.FilterRanges" }
|
||||
SubCluster: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.SubCluster" }
|
||||
StatsSeries: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.StatsSeries" }
|
||||
Unit: { model: "github.com/ClusterCockpit/cc-lib/v2/schema.Unit" }
|
||||
|
||||
@@ -3,7 +3,7 @@ Description=ClusterCockpit Web Server
|
||||
Documentation=https://github.com/ClusterCockpit/cc-backend
|
||||
Wants=network-online.target
|
||||
After=network-online.target
|
||||
After=mariadb.service mysql.service
|
||||
# Database is file-based SQLite - no service dependency required
|
||||
|
||||
[Service]
|
||||
WorkingDirectory=/opt/monitoring/cc-backend
|
||||
|
||||
@@ -23,47 +23,38 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/auth"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricDataDispatcher"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdata"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdispatch"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/gorilla/mux"
|
||||
|
||||
_ "github.com/mattn/go-sqlite3"
|
||||
)
|
||||
|
||||
func setup(t *testing.T) *api.RestAPI {
|
||||
repository.ResetConnection()
|
||||
|
||||
const testconfig = `{
|
||||
"main": {
|
||||
"addr": "0.0.0.0:8080",
|
||||
"validate": false,
|
||||
"apiAllowedIPs": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
"main": {
|
||||
"addr": "0.0.0.0:8080",
|
||||
"validate": false,
|
||||
"apiAllowedIPs": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"auth": {
|
||||
"jwts": {
|
||||
"max-age": "2m"
|
||||
"jwts": {
|
||||
"max-age": "2m"
|
||||
}
|
||||
}
|
||||
},
|
||||
"clusters": [
|
||||
{
|
||||
"name": "testcluster",
|
||||
"metricDataRepository": {"kind": "test", "url": "bla:8081"},
|
||||
"filterRanges": {
|
||||
"numNodes": { "from": 1, "to": 64 },
|
||||
"duration": { "from": 0, "to": 86400 },
|
||||
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
|
||||
}
|
||||
}
|
||||
]
|
||||
}`
|
||||
const testclusterJSON = `{
|
||||
"name": "testcluster",
|
||||
@@ -141,7 +132,7 @@ func setup(t *testing.T) *api.RestAPI {
|
||||
}
|
||||
|
||||
dbfilepath := filepath.Join(tmpdir, "test.db")
|
||||
err := repository.MigrateDB("sqlite3", dbfilepath)
|
||||
err := repository.MigrateDB(dbfilepath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
@@ -155,11 +146,7 @@ func setup(t *testing.T) *api.RestAPI {
|
||||
|
||||
// Load and check main configuration
|
||||
if cfg := ccconf.GetPackageConfig("main"); cfg != nil {
|
||||
if clustercfg := ccconf.GetPackageConfig("clusters"); clustercfg != nil {
|
||||
config.Init(cfg, clustercfg)
|
||||
} else {
|
||||
cclog.Abort("Cluster configuration must be present")
|
||||
}
|
||||
config.Init(cfg)
|
||||
} else {
|
||||
cclog.Abort("Main configuration must be present")
|
||||
}
|
||||
@@ -171,9 +158,7 @@ func setup(t *testing.T) *api.RestAPI {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
if err := metricdata.Init(); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
// metricstore initialization removed - it's initialized via callback in tests
|
||||
|
||||
archiver.Start(repository.GetJobRepository(), context.Background())
|
||||
|
||||
@@ -190,11 +175,9 @@ func setup(t *testing.T) *api.RestAPI {
|
||||
}
|
||||
|
||||
func cleanup() {
|
||||
// Gracefully shutdown archiver with timeout
|
||||
if err := archiver.Shutdown(5 * time.Second); err != nil {
|
||||
cclog.Warnf("Archiver shutdown timeout in tests: %v", err)
|
||||
}
|
||||
// TODO: Clear all caches, reset all modules, etc...
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -221,7 +204,7 @@ func TestRestApi(t *testing.T) {
|
||||
},
|
||||
}
|
||||
|
||||
metricdata.TestLoadDataCallback = func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error) {
|
||||
metricstore.TestLoadDataCallback = func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error) {
|
||||
return testData, nil
|
||||
}
|
||||
|
||||
@@ -230,7 +213,7 @@ func TestRestApi(t *testing.T) {
|
||||
r.StrictSlash(true)
|
||||
restapi.MountAPIRoutes(r)
|
||||
|
||||
var TestJobId int64 = 123
|
||||
var TestJobID int64 = 123
|
||||
TestClusterName := "testcluster"
|
||||
var TestStartTime int64 = 123456789
|
||||
|
||||
@@ -280,7 +263,7 @@ func TestRestApi(t *testing.T) {
|
||||
}
|
||||
// resolver := graph.GetResolverInstance()
|
||||
restapi.JobRepository.SyncJobs()
|
||||
job, err := restapi.JobRepository.Find(&TestJobId, &TestClusterName, &TestStartTime)
|
||||
job, err := restapi.JobRepository.Find(&TestJobID, &TestClusterName, &TestStartTime)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
@@ -338,7 +321,7 @@ func TestRestApi(t *testing.T) {
|
||||
}
|
||||
|
||||
// Archiving happens asynchronously, will be completed in cleanup
|
||||
job, err := restapi.JobRepository.Find(&TestJobId, &TestClusterName, &TestStartTime)
|
||||
job, err := restapi.JobRepository.Find(&TestJobID, &TestClusterName, &TestStartTime)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
@@ -366,7 +349,7 @@ func TestRestApi(t *testing.T) {
|
||||
}
|
||||
|
||||
t.Run("CheckArchive", func(t *testing.T) {
|
||||
data, err := metricDataDispatcher.LoadData(stoppedJob, []string{"load_one"}, []schema.MetricScope{schema.MetricScopeNode}, context.Background(), 60)
|
||||
data, err := metricdispatch.LoadData(stoppedJob, []string{"load_one"}, []schema.MetricScope{schema.MetricScopeNode}, context.Background(), 60)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
@@ -13,7 +13,7 @@ import (
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// GetClustersAPIResponse model
|
||||
@@ -27,7 +27,7 @@ type GetClustersAPIResponse struct {
|
||||
// @description Get a list of all cluster configs. Specific cluster can be requested using query parameter.
|
||||
// @produce json
|
||||
// @param cluster query string false "Job Cluster"
|
||||
// @success 200 {object} api.GetClustersApiResponse "Array of clusters"
|
||||
// @success 200 {object} api.GetClustersAPIResponse "Array of clusters"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
|
||||
1178
internal/api/docs.go
1178
internal/api/docs.go
File diff suppressed because it is too large
Load Diff
@@ -22,11 +22,11 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph/model"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/importer"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricDataDispatcher"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdispatch"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/gorilla/mux"
|
||||
)
|
||||
|
||||
@@ -104,7 +104,7 @@ type JobMetricWithName struct {
|
||||
// @param items-per-page query int false "Items per page (Default: 25)"
|
||||
// @param page query int false "Page Number (Default: 1)"
|
||||
// @param with-metadata query bool false "Include metadata (e.g. jobScript) in response"
|
||||
// @success 200 {object} api.GetJobsApiResponse "Job array and page info"
|
||||
// @success 200 {object} api.GetJobsAPIResponse "Job array and page info"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -232,7 +232,7 @@ func (api *RestAPI) getJobs(rw http.ResponseWriter, r *http.Request) {
|
||||
// @produce json
|
||||
// @param id path int true "Database ID of Job"
|
||||
// @param all-metrics query bool false "Include all available metrics"
|
||||
// @success 200 {object} api.GetJobApiResponse "Job resource"
|
||||
// @success 200 {object} api.GetJobAPIResponse "Job resource"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -293,7 +293,7 @@ func (api *RestAPI) getCompleteJobByID(rw http.ResponseWriter, r *http.Request)
|
||||
}
|
||||
|
||||
if r.URL.Query().Get("all-metrics") == "true" {
|
||||
data, err = metricDataDispatcher.LoadData(job, nil, scopes, r.Context(), resolution)
|
||||
data, err = metricdispatch.LoadData(job, nil, scopes, r.Context(), resolution)
|
||||
if err != nil {
|
||||
cclog.Warnf("REST: error while loading all-metrics job data for JobID %d on %s", job.JobID, job.Cluster)
|
||||
return
|
||||
@@ -324,8 +324,8 @@ func (api *RestAPI) getCompleteJobByID(rw http.ResponseWriter, r *http.Request)
|
||||
// @accept json
|
||||
// @produce json
|
||||
// @param id path int true "Database ID of Job"
|
||||
// @param request body api.GetJobApiRequest true "Array of metric names"
|
||||
// @success 200 {object} api.GetJobApiResponse "Job resource"
|
||||
// @param request body api.GetJobAPIRequest true "Array of metric names"
|
||||
// @success 200 {object} api.GetJobAPIResponse "Job resource"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -389,7 +389,7 @@ func (api *RestAPI) getJobByID(rw http.ResponseWriter, r *http.Request) {
|
||||
resolution = max(resolution, mc.Timestep)
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadData(job, metrics, scopes, r.Context(), resolution)
|
||||
data, err := metricdispatch.LoadData(job, metrics, scopes, r.Context(), resolution)
|
||||
if err != nil {
|
||||
cclog.Warnf("REST: error while loading job data for JobID %d on %s", job.JobID, job.Cluster)
|
||||
return
|
||||
@@ -478,7 +478,7 @@ func (api *RestAPI) editMeta(rw http.ResponseWriter, r *http.Request) {
|
||||
// @accept json
|
||||
// @produce json
|
||||
// @param id path int true "Job Database ID"
|
||||
// @param request body api.TagJobApiRequest true "Array of tag-objects to add"
|
||||
// @param request body api.TagJobAPIRequest true "Array of tag-objects to add"
|
||||
// @success 200 {object} schema.Job "Updated job resource"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
@@ -542,7 +542,7 @@ func (api *RestAPI) tagJob(rw http.ResponseWriter, r *http.Request) {
|
||||
// @accept json
|
||||
// @produce json
|
||||
// @param id path int true "Job Database ID"
|
||||
// @param request body api.TagJobApiRequest true "Array of tag-objects to remove"
|
||||
// @param request body api.TagJobAPIRequest true "Array of tag-objects to remove"
|
||||
// @success 200 {object} schema.Job "Updated job resource"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
@@ -606,7 +606,7 @@ func (api *RestAPI) removeTagJob(rw http.ResponseWriter, r *http.Request) {
|
||||
// @description Tag wills be removed from respective archive files.
|
||||
// @accept json
|
||||
// @produce plain
|
||||
// @param request body api.TagJobApiRequest true "Array of tag-objects to remove"
|
||||
// @param request body api.TagJobAPIRequest true "Array of tag-objects to remove"
|
||||
// @success 200 {string} string "Success Response"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
@@ -650,7 +650,7 @@ func (api *RestAPI) removeTags(rw http.ResponseWriter, r *http.Request) {
|
||||
// @accept json
|
||||
// @produce json
|
||||
// @param request body schema.Job true "Job to add"
|
||||
// @success 201 {object} api.DefaultApiResponse "Job added successfully"
|
||||
// @success 201 {object} api.DefaultAPIResponse "Job added successfully"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -728,7 +728,7 @@ func (api *RestAPI) startJob(rw http.ResponseWriter, r *http.Request) {
|
||||
// @description Job to stop is specified by request body. All fields are required in this case.
|
||||
// @description Returns full job resource information according to 'Job' scheme.
|
||||
// @produce json
|
||||
// @param request body api.StopJobApiRequest true "All fields required"
|
||||
// @param request body api.StopJobAPIRequest true "All fields required"
|
||||
// @success 200 {object} schema.Job "Success message"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
@@ -754,7 +754,6 @@ func (api *RestAPI) stopJobByRequest(rw http.ResponseWriter, r *http.Request) {
|
||||
return
|
||||
}
|
||||
|
||||
// cclog.Printf("loading db job for stopJobByRequest... : stopJobApiRequest=%v", req)
|
||||
job, err = api.JobRepository.Find(req.JobID, req.Cluster, req.StartTime)
|
||||
if err != nil {
|
||||
// Try cached jobs if not found in main repository
|
||||
@@ -776,7 +775,7 @@ func (api *RestAPI) stopJobByRequest(rw http.ResponseWriter, r *http.Request) {
|
||||
// @description Job to remove is specified by database ID. This will not remove the job from the job archive.
|
||||
// @produce json
|
||||
// @param id path int true "Database ID of Job"
|
||||
// @success 200 {object} api.DefaultApiResponse "Success message"
|
||||
// @success 200 {object} api.DefaultAPIResponse "Success message"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -820,8 +819,8 @@ func (api *RestAPI) deleteJobByID(rw http.ResponseWriter, r *http.Request) {
|
||||
// @description Job to delete is specified by request body. All fields are required in this case.
|
||||
// @accept json
|
||||
// @produce json
|
||||
// @param request body api.DeleteJobApiRequest true "All fields required"
|
||||
// @success 200 {object} api.DefaultApiResponse "Success message"
|
||||
// @param request body api.DeleteJobAPIRequest true "All fields required"
|
||||
// @success 200 {object} api.DefaultAPIResponse "Success message"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
@@ -873,7 +872,7 @@ func (api *RestAPI) deleteJobByRequest(rw http.ResponseWriter, r *http.Request)
|
||||
// @description Remove all jobs with start time before timestamp. The jobs will not be removed from the job archive.
|
||||
// @produce json
|
||||
// @param ts path int true "Unix epoch timestamp"
|
||||
// @success 200 {object} api.DefaultApiResponse "Success message"
|
||||
// @success 200 {object} api.DefaultAPIResponse "Success message"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
|
||||
@@ -15,8 +15,8 @@ import (
|
||||
"strconv"
|
||||
"strings"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/memorystore"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
|
||||
"github.com/influxdata/line-protocol/v2/lineprotocol"
|
||||
)
|
||||
@@ -58,7 +58,7 @@ func freeMetrics(rw http.ResponseWriter, r *http.Request) {
|
||||
return
|
||||
}
|
||||
|
||||
ms := memorystore.GetMemoryStore()
|
||||
ms := metricstore.GetMemoryStore()
|
||||
n := 0
|
||||
for _, sel := range selectors {
|
||||
bn, err := ms.Free(sel, to)
|
||||
@@ -97,9 +97,9 @@ func writeMetrics(rw http.ResponseWriter, r *http.Request) {
|
||||
return
|
||||
}
|
||||
|
||||
ms := memorystore.GetMemoryStore()
|
||||
ms := metricstore.GetMemoryStore()
|
||||
dec := lineprotocol.NewDecoderWithBytes(bytes)
|
||||
if err := memorystore.DecodeLine(dec, ms, r.URL.Query().Get("cluster")); err != nil {
|
||||
if err := metricstore.DecodeLine(dec, ms, r.URL.Query().Get("cluster")); err != nil {
|
||||
cclog.Errorf("/api/write error: %s", err.Error())
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
@@ -129,7 +129,7 @@ func debugMetrics(rw http.ResponseWriter, r *http.Request) {
|
||||
selector = strings.Split(raw, ":")
|
||||
}
|
||||
|
||||
ms := memorystore.GetMemoryStore()
|
||||
ms := metricstore.GetMemoryStore()
|
||||
if err := ms.DebugDump(bufio.NewWriter(rw), selector); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
@@ -162,7 +162,7 @@ func metricsHealth(rw http.ResponseWriter, r *http.Request) {
|
||||
|
||||
selector := []string{rawCluster, rawNode}
|
||||
|
||||
ms := memorystore.GetMemoryStore()
|
||||
ms := metricstore.GetMemoryStore()
|
||||
if err := ms.HealthCheck(bufio.NewWriter(rw), selector); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
@@ -6,9 +6,9 @@
|
||||
package api
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
@@ -17,12 +17,48 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/importer"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/nats"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-lib/v2/ccMessage"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/receivers"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
influx "github.com/influxdata/line-protocol/v2/lineprotocol"
|
||||
)
|
||||
|
||||
// NatsAPI provides NATS subscription-based handlers for Job and Node operations.
|
||||
// It mirrors the functionality of the REST API but uses NATS messaging.
|
||||
// It mirrors the functionality of the REST API but uses NATS messaging with
|
||||
// InfluxDB line protocol as the message format.
|
||||
//
|
||||
// # Message Format
|
||||
//
|
||||
// All NATS messages use InfluxDB line protocol format (https://docs.influxdata.com/influxdb/v2.0/reference/syntax/line-protocol/)
|
||||
// with the following structure:
|
||||
//
|
||||
// measurement,tag1=value1,tag2=value2 field1=value1,field2=value2 timestamp
|
||||
//
|
||||
// # Job Events
|
||||
//
|
||||
// Job start/stop events use the "job" measurement with a "function" tag to distinguish operations:
|
||||
//
|
||||
// job,function=start_job event="{...JSON payload...}" <timestamp>
|
||||
// job,function=stop_job event="{...JSON payload...}" <timestamp>
|
||||
//
|
||||
// The JSON payload in the "event" field follows the schema.Job or StopJobAPIRequest structure.
|
||||
//
|
||||
// Example job start message:
|
||||
//
|
||||
// job,function=start_job event="{\"jobId\":1001,\"user\":\"testuser\",\"cluster\":\"testcluster\",...}" 1234567890000000000
|
||||
//
|
||||
// # Node State Events
|
||||
//
|
||||
// Node state updates use the "nodestate" measurement with cluster information:
|
||||
//
|
||||
// nodestate event="{...JSON payload...}" <timestamp>
|
||||
//
|
||||
// The JSON payload follows the UpdateNodeStatesRequest structure.
|
||||
//
|
||||
// Example node state message:
|
||||
//
|
||||
// nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[{\"hostname\":\"node01\",\"states\":[\"idle\"]}]}" 1234567890000000000
|
||||
type NatsAPI struct {
|
||||
JobRepository *repository.JobRepository
|
||||
// RepositoryMutex protects job creation operations from race conditions
|
||||
@@ -50,11 +86,7 @@ func (api *NatsAPI) StartSubscriptions() error {
|
||||
|
||||
s := config.Keys.APISubjects
|
||||
|
||||
if err := client.Subscribe(s.SubjectJobStart, api.handleStartJob); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if err := client.Subscribe(s.SubjectJobStop, api.handleStopJob); err != nil {
|
||||
if err := client.Subscribe(s.SubjectJobEvent, api.handleJobEvent); err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -67,26 +99,96 @@ func (api *NatsAPI) StartSubscriptions() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
// processJobEvent routes job event messages to the appropriate handler based on the "function" tag.
|
||||
// Validates that required tags and fields are present before processing.
|
||||
func (api *NatsAPI) processJobEvent(msg lp.CCMessage) {
|
||||
function, ok := msg.GetTag("function")
|
||||
if !ok {
|
||||
cclog.Errorf("Job event is missing required tag 'function': measurement=%s", msg.Name())
|
||||
return
|
||||
}
|
||||
|
||||
switch function {
|
||||
case "start_job":
|
||||
v, ok := msg.GetEventValue()
|
||||
if !ok {
|
||||
cclog.Errorf("Job start event is missing event field with JSON payload")
|
||||
return
|
||||
}
|
||||
api.handleStartJob(v)
|
||||
|
||||
case "stop_job":
|
||||
v, ok := msg.GetEventValue()
|
||||
if !ok {
|
||||
cclog.Errorf("Job stop event is missing event field with JSON payload")
|
||||
return
|
||||
}
|
||||
api.handleStopJob(v)
|
||||
|
||||
default:
|
||||
cclog.Warnf("Unknown job event function '%s', expected 'start_job' or 'stop_job'", function)
|
||||
}
|
||||
}
|
||||
|
||||
// handleJobEvent processes job-related messages received via NATS using InfluxDB line protocol.
|
||||
// The message must be in line protocol format with measurement="job" and include:
|
||||
// - tag "function" with value "start_job" or "stop_job"
|
||||
// - field "event" containing JSON payload (schema.Job or StopJobAPIRequest)
|
||||
//
|
||||
// Example: job,function=start_job event="{\"jobId\":1001,...}" 1234567890000000000
|
||||
func (api *NatsAPI) handleJobEvent(subject string, data []byte) {
|
||||
if len(data) == 0 {
|
||||
cclog.Warnf("NATS %s: received empty message", subject)
|
||||
return
|
||||
}
|
||||
|
||||
d := influx.NewDecoderWithBytes(data)
|
||||
|
||||
for d.Next() {
|
||||
m, err := receivers.DecodeInfluxMessage(d)
|
||||
if err != nil {
|
||||
cclog.Errorf("NATS %s: failed to decode InfluxDB line protocol message: %v", subject, err)
|
||||
return
|
||||
}
|
||||
|
||||
if !m.IsEvent() {
|
||||
cclog.Debugf("NATS %s: received non-event message, skipping", subject)
|
||||
continue
|
||||
}
|
||||
|
||||
if m.Name() == "job" {
|
||||
api.processJobEvent(m)
|
||||
} else {
|
||||
cclog.Debugf("NATS %s: unexpected measurement name '%s', expected 'job'", subject, m.Name())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// handleStartJob processes job start messages received via NATS.
|
||||
// Expected JSON payload follows the schema.Job structure.
|
||||
func (api *NatsAPI) handleStartJob(subject string, data []byte) {
|
||||
// The payload parameter contains JSON following the schema.Job structure.
|
||||
// Jobs are validated, checked for duplicates, and inserted into the database.
|
||||
func (api *NatsAPI) handleStartJob(payload string) {
|
||||
if payload == "" {
|
||||
cclog.Error("NATS start job: payload is empty")
|
||||
return
|
||||
}
|
||||
req := schema.Job{
|
||||
Shared: "none",
|
||||
MonitoringStatus: schema.MonitoringStatusRunningOrArchiving,
|
||||
}
|
||||
|
||||
dec := json.NewDecoder(bytes.NewReader(data))
|
||||
dec := json.NewDecoder(strings.NewReader(payload))
|
||||
dec.DisallowUnknownFields()
|
||||
if err := dec.Decode(&req); err != nil {
|
||||
cclog.Errorf("NATS %s: parsing request failed: %v", subject, err)
|
||||
cclog.Errorf("NATS start job: parsing request failed: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
cclog.Debugf("NATS %s: %s", subject, req.GoString())
|
||||
cclog.Debugf("NATS start job: %s", req.GoString())
|
||||
req.State = schema.JobStateRunning
|
||||
|
||||
if err := importer.SanityChecks(&req); err != nil {
|
||||
cclog.Errorf("NATS %s: sanity check failed: %v", subject, err)
|
||||
cclog.Errorf("NATS start job: sanity check failed: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -96,14 +198,14 @@ func (api *NatsAPI) handleStartJob(subject string, data []byte) {
|
||||
|
||||
jobs, err := api.JobRepository.FindAll(&req.JobID, &req.Cluster, nil)
|
||||
if err != nil && err != sql.ErrNoRows {
|
||||
cclog.Errorf("NATS %s: checking for duplicate failed: %v", subject, err)
|
||||
cclog.Errorf("NATS start job: checking for duplicate failed: %v", err)
|
||||
return
|
||||
}
|
||||
if err == nil {
|
||||
for _, job := range jobs {
|
||||
if (req.StartTime - job.StartTime) < secondsPerDay {
|
||||
cclog.Errorf("NATS %s: job with jobId %d, cluster %s already exists (dbid: %d)",
|
||||
subject, req.JobID, req.Cluster, job.ID)
|
||||
cclog.Errorf("NATS start job: job with jobId %d, cluster %s already exists (dbid: %d)",
|
||||
req.JobID, req.Cluster, job.ID)
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -111,14 +213,14 @@ func (api *NatsAPI) handleStartJob(subject string, data []byte) {
|
||||
|
||||
id, err := api.JobRepository.Start(&req)
|
||||
if err != nil {
|
||||
cclog.Errorf("NATS %s: insert into database failed: %v", subject, err)
|
||||
cclog.Errorf("NATS start job: insert into database failed: %v", err)
|
||||
return
|
||||
}
|
||||
unlockOnce.Do(api.RepositoryMutex.Unlock)
|
||||
|
||||
for _, tag := range req.Tags {
|
||||
if _, err := api.JobRepository.AddTagOrCreate(nil, id, tag.Type, tag.Name, tag.Scope); err != nil {
|
||||
cclog.Errorf("NATS %s: adding tag to new job %d failed: %v", subject, id, err)
|
||||
cclog.Errorf("NATS start job: adding tag to new job %d failed: %v", id, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -128,19 +230,24 @@ func (api *NatsAPI) handleStartJob(subject string, data []byte) {
|
||||
}
|
||||
|
||||
// handleStopJob processes job stop messages received via NATS.
|
||||
// Expected JSON payload follows the StopJobAPIRequest structure.
|
||||
func (api *NatsAPI) handleStopJob(subject string, data []byte) {
|
||||
// The payload parameter contains JSON following the StopJobAPIRequest structure.
|
||||
// The job is marked as stopped in the database and archiving is triggered if monitoring is enabled.
|
||||
func (api *NatsAPI) handleStopJob(payload string) {
|
||||
if payload == "" {
|
||||
cclog.Error("NATS stop job: payload is empty")
|
||||
return
|
||||
}
|
||||
var req StopJobAPIRequest
|
||||
|
||||
dec := json.NewDecoder(bytes.NewReader(data))
|
||||
dec := json.NewDecoder(strings.NewReader(payload))
|
||||
dec.DisallowUnknownFields()
|
||||
if err := dec.Decode(&req); err != nil {
|
||||
cclog.Errorf("NATS %s: parsing request failed: %v", subject, err)
|
||||
cclog.Errorf("NATS job stop: parsing request failed: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
if req.JobID == nil {
|
||||
cclog.Errorf("NATS %s: the field 'jobId' is required", subject)
|
||||
cclog.Errorf("NATS job stop: the field 'jobId' is required")
|
||||
return
|
||||
}
|
||||
|
||||
@@ -148,28 +255,28 @@ func (api *NatsAPI) handleStopJob(subject string, data []byte) {
|
||||
if err != nil {
|
||||
cachedJob, cachedErr := api.JobRepository.FindCached(req.JobID, req.Cluster, req.StartTime)
|
||||
if cachedErr != nil {
|
||||
cclog.Errorf("NATS %s: finding job failed: %v (cached lookup also failed: %v)",
|
||||
subject, err, cachedErr)
|
||||
cclog.Errorf("NATS job stop: finding job failed: %v (cached lookup also failed: %v)",
|
||||
err, cachedErr)
|
||||
return
|
||||
}
|
||||
job = cachedJob
|
||||
}
|
||||
|
||||
if job.State != schema.JobStateRunning {
|
||||
cclog.Errorf("NATS %s: jobId %d (id %d) on %s: job has already been stopped (state is: %s)",
|
||||
subject, job.JobID, job.ID, job.Cluster, job.State)
|
||||
cclog.Errorf("NATS job stop: jobId %d (id %d) on %s: job has already been stopped (state is: %s)",
|
||||
job.JobID, job.ID, job.Cluster, job.State)
|
||||
return
|
||||
}
|
||||
|
||||
if job.StartTime > req.StopTime {
|
||||
cclog.Errorf("NATS %s: jobId %d (id %d) on %s: stopTime %d must be >= startTime %d",
|
||||
subject, job.JobID, job.ID, job.Cluster, req.StopTime, job.StartTime)
|
||||
cclog.Errorf("NATS job stop: jobId %d (id %d) on %s: stopTime %d must be >= startTime %d",
|
||||
job.JobID, job.ID, job.Cluster, req.StopTime, job.StartTime)
|
||||
return
|
||||
}
|
||||
|
||||
if req.State != "" && !req.State.Valid() {
|
||||
cclog.Errorf("NATS %s: jobId %d (id %d) on %s: invalid job state: %#v",
|
||||
subject, job.JobID, job.ID, job.Cluster, req.State)
|
||||
cclog.Errorf("NATS job stop: jobId %d (id %d) on %s: invalid job state: %#v",
|
||||
job.JobID, job.ID, job.Cluster, req.State)
|
||||
return
|
||||
} else if req.State == "" {
|
||||
req.State = schema.JobStateCompleted
|
||||
@@ -182,8 +289,8 @@ func (api *NatsAPI) handleStopJob(subject string, data []byte) {
|
||||
|
||||
if err := api.JobRepository.Stop(*job.ID, job.Duration, job.State, job.MonitoringStatus); err != nil {
|
||||
if err := api.JobRepository.StopCached(*job.ID, job.Duration, job.State, job.MonitoringStatus); err != nil {
|
||||
cclog.Errorf("NATS %s: jobId %d (id %d) on %s: marking job as '%s' failed: %v",
|
||||
subject, job.JobID, job.ID, job.Cluster, job.State, err)
|
||||
cclog.Errorf("NATS job stop: jobId %d (id %d) on %s: marking job as '%s' failed: %v",
|
||||
job.JobID, job.ID, job.Cluster, job.State, err)
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -198,15 +305,21 @@ func (api *NatsAPI) handleStopJob(subject string, data []byte) {
|
||||
archiver.TriggerArchiving(job)
|
||||
}
|
||||
|
||||
// handleNodeState processes node state update messages received via NATS.
|
||||
// Expected JSON payload follows the UpdateNodeStatesRequest structure.
|
||||
func (api *NatsAPI) handleNodeState(subject string, data []byte) {
|
||||
// processNodestateEvent extracts and processes node state data from the InfluxDB message.
|
||||
// Updates node states in the repository for all nodes in the payload.
|
||||
func (api *NatsAPI) processNodestateEvent(msg lp.CCMessage) {
|
||||
v, ok := msg.GetEventValue()
|
||||
if !ok {
|
||||
cclog.Errorf("Nodestate event is missing event field with JSON payload")
|
||||
return
|
||||
}
|
||||
|
||||
var req UpdateNodeStatesRequest
|
||||
|
||||
dec := json.NewDecoder(bytes.NewReader(data))
|
||||
dec := json.NewDecoder(strings.NewReader(v))
|
||||
dec.DisallowUnknownFields()
|
||||
if err := dec.Decode(&req); err != nil {
|
||||
cclog.Errorf("NATS %s: parsing request failed: %v", subject, err)
|
||||
cclog.Errorf("NATS nodestate: parsing request failed: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -224,8 +337,44 @@ func (api *NatsAPI) handleNodeState(subject string, data []byte) {
|
||||
JobsRunning: node.JobsRunning,
|
||||
}
|
||||
|
||||
repo.UpdateNodeState(node.Hostname, req.Cluster, &nodeState)
|
||||
if err := repo.UpdateNodeState(node.Hostname, req.Cluster, &nodeState); err != nil {
|
||||
cclog.Errorf("NATS nodestate: updating node state for %s on %s failed: %v",
|
||||
node.Hostname, req.Cluster, err)
|
||||
}
|
||||
}
|
||||
|
||||
cclog.Debugf("NATS %s: updated %d node states for cluster %s", subject, len(req.Nodes), req.Cluster)
|
||||
cclog.Debugf("NATS nodestate: updated %d node states for cluster %s", len(req.Nodes), req.Cluster)
|
||||
}
|
||||
|
||||
// handleNodeState processes node state update messages received via NATS using InfluxDB line protocol.
|
||||
// The message must be in line protocol format with measurement="nodestate" and include:
|
||||
// - field "event" containing JSON payload (UpdateNodeStatesRequest)
|
||||
//
|
||||
// Example: nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[...]}" 1234567890000000000
|
||||
func (api *NatsAPI) handleNodeState(subject string, data []byte) {
|
||||
if len(data) == 0 {
|
||||
cclog.Warnf("NATS %s: received empty message", subject)
|
||||
return
|
||||
}
|
||||
|
||||
d := influx.NewDecoderWithBytes(data)
|
||||
|
||||
for d.Next() {
|
||||
m, err := receivers.DecodeInfluxMessage(d)
|
||||
if err != nil {
|
||||
cclog.Errorf("NATS %s: failed to decode InfluxDB line protocol message: %v", subject, err)
|
||||
return
|
||||
}
|
||||
|
||||
if !m.IsEvent() {
|
||||
cclog.Warnf("NATS %s: received non-event message, skipping", subject)
|
||||
continue
|
||||
}
|
||||
|
||||
if m.Name() == "nodestate" {
|
||||
api.processNodestateEvent(m)
|
||||
} else {
|
||||
cclog.Warnf("NATS %s: unexpected measurement name '%s', expected 'nodestate'", subject, m.Name())
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
947
internal/api/nats_test.go
Normal file
947
internal/api/nats_test.go
Normal file
@@ -0,0 +1,947 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
package api
|
||||
|
||||
import (
|
||||
"context"
|
||||
"database/sql"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"testing"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/archiver"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/auth"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
lp "github.com/ClusterCockpit/cc-lib/v2/ccMessage"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
|
||||
_ "github.com/mattn/go-sqlite3"
|
||||
)
|
||||
|
||||
func setupNatsTest(t *testing.T) *NatsAPI {
|
||||
repository.ResetConnection()
|
||||
|
||||
const testconfig = `{
|
||||
"main": {
|
||||
"addr": "0.0.0.0:8080",
|
||||
"validate": false,
|
||||
"apiAllowedIPs": [
|
||||
"*"
|
||||
]
|
||||
},
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"auth": {
|
||||
"jwts": {
|
||||
"max-age": "2m"
|
||||
}
|
||||
}
|
||||
}`
|
||||
const testclusterJSON = `{
|
||||
"name": "testcluster",
|
||||
"subClusters": [
|
||||
{
|
||||
"name": "sc1",
|
||||
"nodes": "host123,host124,host125",
|
||||
"processorType": "Intel Core i7-4770",
|
||||
"socketsPerNode": 1,
|
||||
"coresPerSocket": 4,
|
||||
"threadsPerCore": 2,
|
||||
"flopRateScalar": {
|
||||
"unit": {
|
||||
"prefix": "G",
|
||||
"base": "F/s"
|
||||
},
|
||||
"value": 14
|
||||
},
|
||||
"flopRateSimd": {
|
||||
"unit": {
|
||||
"prefix": "G",
|
||||
"base": "F/s"
|
||||
},
|
||||
"value": 112
|
||||
},
|
||||
"memoryBandwidth": {
|
||||
"unit": {
|
||||
"prefix": "G",
|
||||
"base": "B/s"
|
||||
},
|
||||
"value": 24
|
||||
},
|
||||
"numberOfNodes": 70,
|
||||
"topology": {
|
||||
"node": [0, 1, 2, 3, 4, 5, 6, 7],
|
||||
"socket": [[0, 1, 2, 3, 4, 5, 6, 7]],
|
||||
"memoryDomain": [[0, 1, 2, 3, 4, 5, 6, 7]],
|
||||
"die": [[0, 1, 2, 3, 4, 5, 6, 7]],
|
||||
"core": [[0], [1], [2], [3], [4], [5], [6], [7]]
|
||||
}
|
||||
}
|
||||
],
|
||||
"metricConfig": [
|
||||
{
|
||||
"name": "load_one",
|
||||
"unit": { "base": ""},
|
||||
"scope": "node",
|
||||
"timestep": 60,
|
||||
"aggregation": "avg",
|
||||
"peak": 8,
|
||||
"normal": 0,
|
||||
"caution": 0,
|
||||
"alert": 0
|
||||
}
|
||||
]
|
||||
}`
|
||||
|
||||
cclog.Init("info", true)
|
||||
tmpdir := t.TempDir()
|
||||
jobarchive := filepath.Join(tmpdir, "job-archive")
|
||||
if err := os.Mkdir(jobarchive, 0o777); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
if err := os.WriteFile(filepath.Join(jobarchive, "version.txt"), fmt.Appendf(nil, "%d", 3), 0o666); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
if err := os.Mkdir(filepath.Join(jobarchive, "testcluster"), 0o777); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
if err := os.WriteFile(filepath.Join(jobarchive, "testcluster", "cluster.json"), []byte(testclusterJSON), 0o666); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
dbfilepath := filepath.Join(tmpdir, "test.db")
|
||||
err := repository.MigrateDB(dbfilepath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
cfgFilePath := filepath.Join(tmpdir, "config.json")
|
||||
if err := os.WriteFile(cfgFilePath, []byte(testconfig), 0o666); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
ccconf.Init(cfgFilePath)
|
||||
|
||||
// Load and check main configuration
|
||||
if cfg := ccconf.GetPackageConfig("main"); cfg != nil {
|
||||
config.Init(cfg)
|
||||
} else {
|
||||
cclog.Abort("Main configuration must be present")
|
||||
}
|
||||
archiveCfg := fmt.Sprintf("{\"kind\": \"file\",\"path\": \"%s\"}", jobarchive)
|
||||
|
||||
repository.Connect("sqlite3", dbfilepath)
|
||||
|
||||
if err := archive.Init(json.RawMessage(archiveCfg), config.Keys.DisableArchive); err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
|
||||
// metricstore initialization removed - it's initialized via callback in tests
|
||||
|
||||
archiver.Start(repository.GetJobRepository(), context.Background())
|
||||
|
||||
if cfg := ccconf.GetPackageConfig("auth"); cfg != nil {
|
||||
auth.Init(&cfg)
|
||||
} else {
|
||||
cclog.Warn("Authentication disabled due to missing configuration")
|
||||
auth.Init(nil)
|
||||
}
|
||||
|
||||
graph.Init()
|
||||
|
||||
return NewNatsAPI()
|
||||
}
|
||||
|
||||
func cleanupNatsTest() {
|
||||
if err := archiver.Shutdown(5 * time.Second); err != nil {
|
||||
cclog.Warnf("Archiver shutdown timeout in tests: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleStartJob(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
payload string
|
||||
expectError bool
|
||||
validateJob func(t *testing.T, job *schema.Job)
|
||||
shouldFindJob bool
|
||||
}{
|
||||
{
|
||||
name: "valid job start",
|
||||
payload: `{
|
||||
"jobId": 1001,
|
||||
"user": "testuser1",
|
||||
"project": "testproj1",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 7200,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3, 4, 5, 6, 7]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567890
|
||||
}`,
|
||||
expectError: false,
|
||||
shouldFindJob: true,
|
||||
validateJob: func(t *testing.T, job *schema.Job) {
|
||||
if job.JobID != 1001 {
|
||||
t.Errorf("expected JobID 1001, got %d", job.JobID)
|
||||
}
|
||||
if job.User != "testuser1" {
|
||||
t.Errorf("expected user testuser1, got %s", job.User)
|
||||
}
|
||||
if job.State != schema.JobStateRunning {
|
||||
t.Errorf("expected state running, got %s", job.State)
|
||||
}
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "invalid JSON",
|
||||
payload: `{
|
||||
"jobId": "not a number",
|
||||
"user": "testuser2"
|
||||
}`,
|
||||
expectError: true,
|
||||
shouldFindJob: false,
|
||||
},
|
||||
{
|
||||
name: "missing required fields",
|
||||
payload: `{
|
||||
"jobId": 1002
|
||||
}`,
|
||||
expectError: true,
|
||||
shouldFindJob: false,
|
||||
},
|
||||
{
|
||||
name: "job with unknown fields (should fail due to DisallowUnknownFields)",
|
||||
payload: `{
|
||||
"jobId": 1003,
|
||||
"user": "testuser3",
|
||||
"project": "testproj3",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"unknownField": "should cause error",
|
||||
"startTime": 1234567900
|
||||
}`,
|
||||
expectError: true,
|
||||
shouldFindJob: false,
|
||||
},
|
||||
{
|
||||
name: "job with tags",
|
||||
payload: `{
|
||||
"jobId": 1004,
|
||||
"user": "testuser4",
|
||||
"project": "testproj4",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3]
|
||||
}
|
||||
],
|
||||
"tags": [
|
||||
{
|
||||
"type": "test",
|
||||
"name": "testtag",
|
||||
"scope": "testuser4"
|
||||
}
|
||||
],
|
||||
"startTime": 1234567910
|
||||
}`,
|
||||
expectError: false,
|
||||
shouldFindJob: true,
|
||||
validateJob: func(t *testing.T, job *schema.Job) {
|
||||
if job.JobID != 1004 {
|
||||
t.Errorf("expected JobID 1004, got %d", job.JobID)
|
||||
}
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
natsAPI.handleStartJob(tt.payload)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
|
||||
// Allow some time for async operations
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
if tt.shouldFindJob {
|
||||
// Extract jobId from payload
|
||||
var payloadMap map[string]any
|
||||
json.Unmarshal([]byte(tt.payload), &payloadMap)
|
||||
jobID := int64(payloadMap["jobId"].(float64))
|
||||
cluster := payloadMap["cluster"].(string)
|
||||
startTime := int64(payloadMap["startTime"].(float64))
|
||||
|
||||
job, err := natsAPI.JobRepository.Find(&jobID, &cluster, &startTime)
|
||||
if err != nil {
|
||||
if !tt.expectError {
|
||||
t.Fatalf("expected to find job, but got error: %v", err)
|
||||
}
|
||||
return
|
||||
}
|
||||
|
||||
if tt.validateJob != nil {
|
||||
tt.validateJob(t, job)
|
||||
}
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleStopJob(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
// First, create a running job
|
||||
startPayload := `{
|
||||
"jobId": 2001,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3, 4, 5, 6, 7]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567890
|
||||
}`
|
||||
|
||||
natsAPI.handleStartJob(startPayload)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
payload string
|
||||
expectError bool
|
||||
validateJob func(t *testing.T, job *schema.Job)
|
||||
setupJobFunc func() // Optional: create specific test job
|
||||
}{
|
||||
{
|
||||
name: "valid job stop - completed",
|
||||
payload: `{
|
||||
"jobId": 2001,
|
||||
"cluster": "testcluster",
|
||||
"startTime": 1234567890,
|
||||
"jobState": "completed",
|
||||
"stopTime": 1234571490
|
||||
}`,
|
||||
expectError: false,
|
||||
validateJob: func(t *testing.T, job *schema.Job) {
|
||||
if job.State != schema.JobStateCompleted {
|
||||
t.Errorf("expected state completed, got %s", job.State)
|
||||
}
|
||||
expectedDuration := int32(1234571490 - 1234567890)
|
||||
if job.Duration != expectedDuration {
|
||||
t.Errorf("expected duration %d, got %d", expectedDuration, job.Duration)
|
||||
}
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "valid job stop - failed",
|
||||
setupJobFunc: func() {
|
||||
startPayloadFailed := `{
|
||||
"jobId": 2002,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567900
|
||||
}`
|
||||
natsAPI.handleStartJob(startPayloadFailed)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
},
|
||||
payload: `{
|
||||
"jobId": 2002,
|
||||
"cluster": "testcluster",
|
||||
"startTime": 1234567900,
|
||||
"jobState": "failed",
|
||||
"stopTime": 1234569900
|
||||
}`,
|
||||
expectError: false,
|
||||
validateJob: func(t *testing.T, job *schema.Job) {
|
||||
if job.State != schema.JobStateFailed {
|
||||
t.Errorf("expected state failed, got %s", job.State)
|
||||
}
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "invalid JSON",
|
||||
payload: `{
|
||||
"jobId": "not a number"
|
||||
}`,
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "missing jobId",
|
||||
payload: `{
|
||||
"cluster": "testcluster",
|
||||
"jobState": "completed",
|
||||
"stopTime": 1234571490
|
||||
}`,
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "invalid job state",
|
||||
setupJobFunc: func() {
|
||||
startPayloadInvalid := `{
|
||||
"jobId": 2003,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567910
|
||||
}`
|
||||
natsAPI.handleStartJob(startPayloadInvalid)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
},
|
||||
payload: `{
|
||||
"jobId": 2003,
|
||||
"cluster": "testcluster",
|
||||
"startTime": 1234567910,
|
||||
"jobState": "invalid_state",
|
||||
"stopTime": 1234571510
|
||||
}`,
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "stopTime before startTime",
|
||||
setupJobFunc: func() {
|
||||
startPayloadTime := `{
|
||||
"jobId": 2004,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567920
|
||||
}`
|
||||
natsAPI.handleStartJob(startPayloadTime)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
},
|
||||
payload: `{
|
||||
"jobId": 2004,
|
||||
"cluster": "testcluster",
|
||||
"startTime": 1234567920,
|
||||
"jobState": "completed",
|
||||
"stopTime": 1234567900
|
||||
}`,
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "job not found",
|
||||
payload: `{
|
||||
"jobId": 99999,
|
||||
"cluster": "testcluster",
|
||||
"startTime": 1234567890,
|
||||
"jobState": "completed",
|
||||
"stopTime": 1234571490
|
||||
}`,
|
||||
expectError: true,
|
||||
},
|
||||
}
|
||||
|
||||
testData := schema.JobData{
|
||||
"load_one": map[schema.MetricScope]*schema.JobMetric{
|
||||
schema.MetricScopeNode: {
|
||||
Unit: schema.Unit{Base: "load"},
|
||||
Timestep: 60,
|
||||
Series: []schema.Series{
|
||||
{
|
||||
Hostname: "host123",
|
||||
Statistics: schema.MetricStatistics{Min: 0.1, Avg: 0.2, Max: 0.3},
|
||||
Data: []schema.Float{0.1, 0.1, 0.1, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
metricstore.TestLoadDataCallback = func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error) {
|
||||
return testData, nil
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
if tt.setupJobFunc != nil {
|
||||
tt.setupJobFunc()
|
||||
}
|
||||
|
||||
natsAPI.handleStopJob(tt.payload)
|
||||
|
||||
// Allow some time for async operations
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
if !tt.expectError && tt.validateJob != nil {
|
||||
// Extract job details from payload
|
||||
var payloadMap map[string]any
|
||||
json.Unmarshal([]byte(tt.payload), &payloadMap)
|
||||
jobID := int64(payloadMap["jobId"].(float64))
|
||||
cluster := payloadMap["cluster"].(string)
|
||||
|
||||
var startTime *int64
|
||||
if st, ok := payloadMap["startTime"]; ok {
|
||||
t := int64(st.(float64))
|
||||
startTime = &t
|
||||
}
|
||||
|
||||
job, err := natsAPI.JobRepository.Find(&jobID, &cluster, startTime)
|
||||
if err != nil {
|
||||
t.Fatalf("expected to find job, but got error: %v", err)
|
||||
}
|
||||
|
||||
tt.validateJob(t, job)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleNodeState(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
data []byte
|
||||
expectError bool
|
||||
validateFn func(t *testing.T)
|
||||
}{
|
||||
{
|
||||
name: "valid node state update",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[{\"hostname\":\"host123\",\"states\":[\"allocated\"],\"cpusAllocated\":8,\"memoryAllocated\":16384,\"gpusAllocated\":0,\"jobsRunning\":1}]}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
validateFn: func(t *testing.T) {
|
||||
// In a full test, we would verify the node state was updated in the database
|
||||
// For now, just ensure no error occurred
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "multiple nodes",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[{\"hostname\":\"host123\",\"states\":[\"idle\"],\"cpusAllocated\":0,\"memoryAllocated\":0,\"gpusAllocated\":0,\"jobsRunning\":0},{\"hostname\":\"host124\",\"states\":[\"allocated\"],\"cpusAllocated\":4,\"memoryAllocated\":8192,\"gpusAllocated\":1,\"jobsRunning\":1}]}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
},
|
||||
{
|
||||
name: "invalid JSON in event field",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":\"not an array\"}" 1234567890000000000`),
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "empty nodes array",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[]}" 1234567890000000000`),
|
||||
expectError: false, // Empty array should not cause error
|
||||
},
|
||||
{
|
||||
name: "invalid line protocol format",
|
||||
data: []byte(`invalid line protocol format`),
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "empty data",
|
||||
data: []byte(``),
|
||||
expectError: false, // Should be handled gracefully with warning
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
natsAPI.handleNodeState("test.subject", tt.data)
|
||||
|
||||
// Allow some time for async operations
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
|
||||
if tt.validateFn != nil {
|
||||
tt.validateFn(t)
|
||||
}
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsProcessJobEvent(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
msgStartJob, err := lp.NewMessage(
|
||||
"job",
|
||||
map[string]string{"function": "start_job"},
|
||||
nil,
|
||||
map[string]any{
|
||||
"event": `{
|
||||
"jobId": 3001,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567890
|
||||
}`,
|
||||
},
|
||||
time.Now(),
|
||||
)
|
||||
if err != nil {
|
||||
t.Fatalf("failed to create test message: %v", err)
|
||||
}
|
||||
|
||||
msgMissingTag, err := lp.NewMessage(
|
||||
"job",
|
||||
map[string]string{},
|
||||
nil,
|
||||
map[string]any{
|
||||
"event": `{}`,
|
||||
},
|
||||
time.Now(),
|
||||
)
|
||||
if err != nil {
|
||||
t.Fatalf("failed to create test message: %v", err)
|
||||
}
|
||||
|
||||
msgUnknownFunc, err := lp.NewMessage(
|
||||
"job",
|
||||
map[string]string{"function": "unknown_function"},
|
||||
nil,
|
||||
map[string]any{
|
||||
"event": `{}`,
|
||||
},
|
||||
time.Now(),
|
||||
)
|
||||
if err != nil {
|
||||
t.Fatalf("failed to create test message: %v", err)
|
||||
}
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
message lp.CCMessage
|
||||
expectError bool
|
||||
}{
|
||||
{
|
||||
name: "start_job function",
|
||||
message: msgStartJob,
|
||||
expectError: false,
|
||||
},
|
||||
{
|
||||
name: "missing function tag",
|
||||
message: msgMissingTag,
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "unknown function",
|
||||
message: msgUnknownFunc,
|
||||
expectError: false,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
natsAPI.processJobEvent(tt.message)
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleJobEvent(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
data []byte
|
||||
expectError bool
|
||||
}{
|
||||
{
|
||||
name: "valid influx line protocol",
|
||||
data: []byte(`job,function=start_job event="{\"jobId\":4001,\"user\":\"testuser\",\"project\":\"testproj\",\"cluster\":\"testcluster\",\"partition\":\"main\",\"walltime\":3600,\"numNodes\":1,\"numHwthreads\":8,\"numAcc\":0,\"shared\":\"none\",\"monitoringStatus\":1,\"smt\":1,\"resources\":[{\"hostname\":\"host123\",\"hwthreads\":[0,1,2,3]}],\"startTime\":1234567890}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
},
|
||||
{
|
||||
name: "invalid influx line protocol",
|
||||
data: []byte(`invalid line protocol format`),
|
||||
expectError: true,
|
||||
},
|
||||
{
|
||||
name: "empty data",
|
||||
data: []byte(``),
|
||||
expectError: false, // Decoder should handle empty input gracefully
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
// HandleJobEvent doesn't return errors, it logs them
|
||||
// We're just ensuring it doesn't panic
|
||||
natsAPI.handleJobEvent("test.subject", tt.data)
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleJobEventEdgeCases(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
data []byte
|
||||
expectError bool
|
||||
description string
|
||||
}{
|
||||
{
|
||||
name: "non-event message (metric data)",
|
||||
data: []byte(`job,function=start_job value=123.45 1234567890000000000`),
|
||||
expectError: false,
|
||||
description: "Should skip non-event messages gracefully",
|
||||
},
|
||||
{
|
||||
name: "wrong measurement name",
|
||||
data: []byte(`wrongmeasurement,function=start_job event="{}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
description: "Should warn about unexpected measurement but not fail",
|
||||
},
|
||||
{
|
||||
name: "missing event field",
|
||||
data: []byte(`job,function=start_job other_field="value" 1234567890000000000`),
|
||||
expectError: true,
|
||||
description: "Should error when event field is missing",
|
||||
},
|
||||
{
|
||||
name: "multiple measurements in one message",
|
||||
data: []byte("job,function=start_job event=\"{}\" 1234567890000000000\njob,function=stop_job event=\"{}\" 1234567890000000000"),
|
||||
expectError: false,
|
||||
description: "Should process multiple lines",
|
||||
},
|
||||
{
|
||||
name: "escaped quotes in JSON payload",
|
||||
data: []byte(`job,function=start_job event="{\"jobId\":6001,\"user\":\"test\\\"user\",\"cluster\":\"test\"}" 1234567890000000000`),
|
||||
expectError: true,
|
||||
description: "Should handle escaped quotes (though JSON parsing may fail)",
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
natsAPI.handleJobEvent("test.subject", tt.data)
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleNodeStateEdgeCases(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
data []byte
|
||||
expectError bool
|
||||
description string
|
||||
}{
|
||||
{
|
||||
name: "missing cluster field in JSON",
|
||||
data: []byte(`nodestate event="{\"nodes\":[]}" 1234567890000000000`),
|
||||
expectError: true,
|
||||
description: "Should fail when cluster is missing",
|
||||
},
|
||||
{
|
||||
name: "malformed JSON with unescaped quotes",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"test"cluster\",\"nodes\":[]}" 1234567890000000000`),
|
||||
expectError: true,
|
||||
description: "Should fail on malformed JSON",
|
||||
},
|
||||
{
|
||||
name: "unicode characters in hostname",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[{\"hostname\":\"host-ñ123\",\"states\":[\"idle\"],\"cpusAllocated\":0,\"memoryAllocated\":0,\"gpusAllocated\":0,\"jobsRunning\":0}]}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
description: "Should handle unicode characters",
|
||||
},
|
||||
{
|
||||
name: "very large node count",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[{\"hostname\":\"node1\",\"states\":[\"idle\"],\"cpusAllocated\":0,\"memoryAllocated\":0,\"gpusAllocated\":0,\"jobsRunning\":0},{\"hostname\":\"node2\",\"states\":[\"idle\"],\"cpusAllocated\":0,\"memoryAllocated\":0,\"gpusAllocated\":0,\"jobsRunning\":0},{\"hostname\":\"node3\",\"states\":[\"idle\"],\"cpusAllocated\":0,\"memoryAllocated\":0,\"gpusAllocated\":0,\"jobsRunning\":0}]}" 1234567890000000000`),
|
||||
expectError: false,
|
||||
description: "Should handle multiple nodes efficiently",
|
||||
},
|
||||
{
|
||||
name: "timestamp in past",
|
||||
data: []byte(`nodestate event="{\"cluster\":\"testcluster\",\"nodes\":[]}" 1000000000000000000`),
|
||||
expectError: false,
|
||||
description: "Should accept any valid timestamp",
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
natsAPI.handleNodeState("test.subject", tt.data)
|
||||
time.Sleep(50 * time.Millisecond)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestNatsHandleStartJobDuplicatePrevention(t *testing.T) {
|
||||
natsAPI := setupNatsTest(t)
|
||||
t.Cleanup(cleanupNatsTest)
|
||||
|
||||
// Start a job
|
||||
payload := `{
|
||||
"jobId": 5001,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567890
|
||||
}`
|
||||
|
||||
natsAPI.handleStartJob(payload)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
// Try to start the same job again (within 24 hours)
|
||||
duplicatePayload := `{
|
||||
"jobId": 5001,
|
||||
"user": "testuser",
|
||||
"project": "testproj",
|
||||
"cluster": "testcluster",
|
||||
"partition": "main",
|
||||
"walltime": 3600,
|
||||
"numNodes": 1,
|
||||
"numHwthreads": 8,
|
||||
"numAcc": 0,
|
||||
"shared": "none",
|
||||
"monitoringStatus": 1,
|
||||
"smt": 1,
|
||||
"resources": [
|
||||
{
|
||||
"hostname": "host123",
|
||||
"hwthreads": [0, 1, 2, 3]
|
||||
}
|
||||
],
|
||||
"startTime": 1234567900
|
||||
}`
|
||||
|
||||
natsAPI.handleStartJob(duplicatePayload)
|
||||
natsAPI.JobRepository.SyncJobs()
|
||||
time.Sleep(100 * time.Millisecond)
|
||||
|
||||
// Verify only one job exists
|
||||
jobID := int64(5001)
|
||||
cluster := "testcluster"
|
||||
jobs, err := natsAPI.JobRepository.FindAll(&jobID, &cluster, nil)
|
||||
if err != nil && err != sql.ErrNoRows {
|
||||
t.Fatalf("unexpected error: %v", err)
|
||||
}
|
||||
|
||||
if len(jobs) != 1 {
|
||||
t.Errorf("expected 1 job, got %d", len(jobs))
|
||||
}
|
||||
}
|
||||
@@ -12,7 +12,7 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
type UpdateNodeStatesRequest struct {
|
||||
@@ -47,7 +47,7 @@ func determineState(states []string) schema.SchedulerState {
|
||||
// @description Required query-parameter defines if all users or only users with additional special roles are returned.
|
||||
// @produce json
|
||||
// @param request body UpdateNodeStatesRequest true "Request body containing nodes and their states"
|
||||
// @success 200 {object} api.DefaultApiResponse "Success message"
|
||||
// @success 200 {object} api.DefaultAPIResponse "Success message"
|
||||
// @failure 400 {object} api.ErrorResponse "Bad Request"
|
||||
// @failure 401 {object} api.ErrorResponse "Unauthorized"
|
||||
// @failure 403 {object} api.ErrorResponse "Forbidden"
|
||||
|
||||
@@ -22,9 +22,9 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/auth"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
"github.com/gorilla/mux"
|
||||
)
|
||||
|
||||
@@ -48,6 +48,7 @@ import (
|
||||
const (
|
||||
noticeFilePath = "./var/notice.txt"
|
||||
noticeFilePerms = 0o644
|
||||
maxNoticeLength = 10000 // Maximum allowed notice content length in characters
|
||||
)
|
||||
|
||||
type RestAPI struct {
|
||||
@@ -61,6 +62,7 @@ type RestAPI struct {
|
||||
RepositoryMutex sync.Mutex
|
||||
}
|
||||
|
||||
// New creates and initializes a new RestAPI instance with configured dependencies.
|
||||
func New() *RestAPI {
|
||||
return &RestAPI{
|
||||
JobRepository: repository.GetJobRepository(),
|
||||
@@ -69,6 +71,8 @@ func New() *RestAPI {
|
||||
}
|
||||
}
|
||||
|
||||
// MountAPIRoutes registers REST API endpoints for job and cluster management.
|
||||
// These routes use JWT token authentication via the X-Auth-Token header.
|
||||
func (api *RestAPI) MountAPIRoutes(r *mux.Router) {
|
||||
r.StrictSlash(true)
|
||||
// REST API Uses TokenAuth
|
||||
@@ -79,8 +83,11 @@ func (api *RestAPI) MountAPIRoutes(r *mux.Router) {
|
||||
// Slurm node state
|
||||
r.HandleFunc("/nodestate/", api.updateNodeStates).Methods(http.MethodPost, http.MethodPut)
|
||||
// Job Handler
|
||||
r.HandleFunc("/jobs/start_job/", api.startJob).Methods(http.MethodPost, http.MethodPut)
|
||||
r.HandleFunc("/jobs/stop_job/", api.stopJobByRequest).Methods(http.MethodPost, http.MethodPut)
|
||||
if config.Keys.APISubjects == nil {
|
||||
cclog.Info("Enabling REST start/stop job API")
|
||||
r.HandleFunc("/jobs/start_job/", api.startJob).Methods(http.MethodPost, http.MethodPut)
|
||||
r.HandleFunc("/jobs/stop_job/", api.stopJobByRequest).Methods(http.MethodPost, http.MethodPut)
|
||||
}
|
||||
r.HandleFunc("/jobs/", api.getJobs).Methods(http.MethodGet)
|
||||
r.HandleFunc("/jobs/{id}", api.getJobByID).Methods(http.MethodPost)
|
||||
r.HandleFunc("/jobs/{id}", api.getCompleteJobByID).Methods(http.MethodGet)
|
||||
@@ -100,6 +107,8 @@ func (api *RestAPI) MountAPIRoutes(r *mux.Router) {
|
||||
}
|
||||
}
|
||||
|
||||
// MountUserAPIRoutes registers user-accessible REST API endpoints.
|
||||
// These are limited endpoints for regular users with JWT token authentication.
|
||||
func (api *RestAPI) MountUserAPIRoutes(r *mux.Router) {
|
||||
r.StrictSlash(true)
|
||||
// REST API Uses TokenAuth
|
||||
@@ -109,6 +118,8 @@ func (api *RestAPI) MountUserAPIRoutes(r *mux.Router) {
|
||||
r.HandleFunc("/jobs/metrics/{id}", api.getJobMetrics).Methods(http.MethodGet)
|
||||
}
|
||||
|
||||
// MountMetricStoreAPIRoutes registers metric storage API endpoints.
|
||||
// These endpoints handle metric data ingestion and health checks with JWT token authentication.
|
||||
func (api *RestAPI) MountMetricStoreAPIRoutes(r *mux.Router) {
|
||||
// REST API Uses TokenAuth
|
||||
// Note: StrictSlash handles trailing slash variations automatically
|
||||
@@ -123,6 +134,8 @@ func (api *RestAPI) MountMetricStoreAPIRoutes(r *mux.Router) {
|
||||
r.HandleFunc("/api/healthcheck/", metricsHealth).Methods(http.MethodGet)
|
||||
}
|
||||
|
||||
// MountConfigAPIRoutes registers configuration and user management endpoints.
|
||||
// These routes use session-based authentication and require admin privileges.
|
||||
func (api *RestAPI) MountConfigAPIRoutes(r *mux.Router) {
|
||||
r.StrictSlash(true)
|
||||
// Settings Frontend Uses SessionAuth
|
||||
@@ -136,6 +149,8 @@ func (api *RestAPI) MountConfigAPIRoutes(r *mux.Router) {
|
||||
}
|
||||
}
|
||||
|
||||
// MountFrontendAPIRoutes registers frontend-specific API endpoints.
|
||||
// These routes support JWT generation and user configuration updates with session authentication.
|
||||
func (api *RestAPI) MountFrontendAPIRoutes(r *mux.Router) {
|
||||
r.StrictSlash(true)
|
||||
// Settings Frontend Uses SessionAuth
|
||||
@@ -157,6 +172,8 @@ type DefaultAPIResponse struct {
|
||||
Message string `json:"msg"`
|
||||
}
|
||||
|
||||
// handleError writes a standardized JSON error response with the given status code.
|
||||
// It logs the error at WARN level and ensures proper Content-Type headers are set.
|
||||
func handleError(err error, statusCode int, rw http.ResponseWriter) {
|
||||
cclog.Warnf("REST ERROR : %s", err.Error())
|
||||
rw.Header().Add("Content-Type", "application/json")
|
||||
@@ -169,15 +186,38 @@ func handleError(err error, statusCode int, rw http.ResponseWriter) {
|
||||
}
|
||||
}
|
||||
|
||||
// decode reads JSON from r into val with strict validation that rejects unknown fields.
|
||||
func decode(r io.Reader, val any) error {
|
||||
dec := json.NewDecoder(r)
|
||||
dec.DisallowUnknownFields()
|
||||
return dec.Decode(val)
|
||||
}
|
||||
|
||||
func (api *RestAPI) editNotice(rw http.ResponseWriter, r *http.Request) {
|
||||
// SecuredCheck() only worked with TokenAuth: Removed
|
||||
// validatePathComponent checks if a path component contains potentially malicious patterns
|
||||
// that could be used for path traversal attacks. Returns an error if validation fails.
|
||||
func validatePathComponent(component, componentName string) error {
|
||||
if strings.Contains(component, "..") ||
|
||||
strings.Contains(component, "/") ||
|
||||
strings.Contains(component, "\\") {
|
||||
return fmt.Errorf("invalid %s", componentName)
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// editNotice godoc
|
||||
// @summary Update system notice
|
||||
// @tags Config
|
||||
// @description Updates the notice.txt file content. Only admins are allowed. Content is limited to 10000 characters.
|
||||
// @accept mpfd
|
||||
// @produce plain
|
||||
// @param new-content formData string true "New notice content (max 10000 characters)"
|
||||
// @success 200 {string} string "Update Notice Content Success"
|
||||
// @failure 400 {object} ErrorResponse "Bad Request"
|
||||
// @failure 403 {object} ErrorResponse "Forbidden"
|
||||
// @failure 500 {object} ErrorResponse "Internal Server Error"
|
||||
// @security ApiKeyAuth
|
||||
// @router /notice/ [post]
|
||||
func (api *RestAPI) editNotice(rw http.ResponseWriter, r *http.Request) {
|
||||
if user := repository.GetUserFromContext(r.Context()); !user.HasRole(schema.RoleAdmin) {
|
||||
handleError(fmt.Errorf("only admins are allowed to update the notice.txt file"), http.StatusForbidden, rw)
|
||||
return
|
||||
@@ -186,9 +226,8 @@ func (api *RestAPI) editNotice(rw http.ResponseWriter, r *http.Request) {
|
||||
// Get Value
|
||||
newContent := r.FormValue("new-content")
|
||||
|
||||
// Validate content length to prevent DoS
|
||||
if len(newContent) > 10000 {
|
||||
handleError(fmt.Errorf("notice content exceeds maximum length of 10000 characters"), http.StatusBadRequest, rw)
|
||||
if len(newContent) > maxNoticeLength {
|
||||
handleError(fmt.Errorf("notice content exceeds maximum length of %d characters", maxNoticeLength), http.StatusBadRequest, rw)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -200,7 +239,9 @@ func (api *RestAPI) editNotice(rw http.ResponseWriter, r *http.Request) {
|
||||
handleError(fmt.Errorf("creating notice file failed: %w", err), http.StatusInternalServerError, rw)
|
||||
return
|
||||
}
|
||||
ntxt.Close()
|
||||
if err := ntxt.Close(); err != nil {
|
||||
cclog.Warnf("Failed to close notice file: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
if err := os.WriteFile(noticeFilePath, []byte(newContent), noticeFilePerms); err != nil {
|
||||
@@ -210,13 +251,30 @@ func (api *RestAPI) editNotice(rw http.ResponseWriter, r *http.Request) {
|
||||
|
||||
rw.Header().Set("Content-Type", "text/plain")
|
||||
rw.WriteHeader(http.StatusOK)
|
||||
var msg []byte
|
||||
if newContent != "" {
|
||||
rw.Write([]byte("Update Notice Content Success"))
|
||||
msg = []byte("Update Notice Content Success")
|
||||
} else {
|
||||
rw.Write([]byte("Empty Notice Content Success"))
|
||||
msg = []byte("Empty Notice Content Success")
|
||||
}
|
||||
if _, err := rw.Write(msg); err != nil {
|
||||
cclog.Errorf("Failed to write response: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// getJWT godoc
|
||||
// @summary Generate JWT token
|
||||
// @tags Frontend
|
||||
// @description Generates a JWT token for a user. Admins can generate tokens for any user, regular users only for themselves.
|
||||
// @accept mpfd
|
||||
// @produce plain
|
||||
// @param username formData string true "Username to generate JWT for"
|
||||
// @success 200 {string} string "JWT token"
|
||||
// @failure 403 {object} ErrorResponse "Forbidden"
|
||||
// @failure 404 {object} ErrorResponse "User Not Found"
|
||||
// @failure 500 {object} ErrorResponse "Internal Server Error"
|
||||
// @security ApiKeyAuth
|
||||
// @router /jwt/ [get]
|
||||
func (api *RestAPI) getJWT(rw http.ResponseWriter, r *http.Request) {
|
||||
rw.Header().Set("Content-Type", "text/plain")
|
||||
username := r.FormValue("username")
|
||||
@@ -241,12 +299,22 @@ func (api *RestAPI) getJWT(rw http.ResponseWriter, r *http.Request) {
|
||||
}
|
||||
|
||||
rw.WriteHeader(http.StatusOK)
|
||||
rw.Write([]byte(jwt))
|
||||
if _, err := rw.Write([]byte(jwt)); err != nil {
|
||||
cclog.Errorf("Failed to write JWT response: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// getRoles godoc
|
||||
// @summary Get available roles
|
||||
// @tags Config
|
||||
// @description Returns a list of valid user roles. Only admins are allowed.
|
||||
// @produce json
|
||||
// @success 200 {array} string "List of role names"
|
||||
// @failure 403 {object} ErrorResponse "Forbidden"
|
||||
// @failure 500 {object} ErrorResponse "Internal Server Error"
|
||||
// @security ApiKeyAuth
|
||||
// @router /roles/ [get]
|
||||
func (api *RestAPI) getRoles(rw http.ResponseWriter, r *http.Request) {
|
||||
// SecuredCheck() only worked with TokenAuth: Removed
|
||||
|
||||
user := repository.GetUserFromContext(r.Context())
|
||||
if !user.HasRole(schema.RoleAdmin) {
|
||||
handleError(fmt.Errorf("only admins are allowed to fetch a list of roles"), http.StatusForbidden, rw)
|
||||
@@ -265,6 +333,18 @@ func (api *RestAPI) getRoles(rw http.ResponseWriter, r *http.Request) {
|
||||
}
|
||||
}
|
||||
|
||||
// updateConfiguration godoc
|
||||
// @summary Update user configuration
|
||||
// @tags Frontend
|
||||
// @description Updates a user's configuration key-value pair.
|
||||
// @accept mpfd
|
||||
// @produce plain
|
||||
// @param key formData string true "Configuration key"
|
||||
// @param value formData string true "Configuration value"
|
||||
// @success 200 {string} string "success"
|
||||
// @failure 500 {object} ErrorResponse "Internal Server Error"
|
||||
// @security ApiKeyAuth
|
||||
// @router /configuration/ [post]
|
||||
func (api *RestAPI) updateConfiguration(rw http.ResponseWriter, r *http.Request) {
|
||||
rw.Header().Set("Content-Type", "text/plain")
|
||||
key, value := r.FormValue("key"), r.FormValue("value")
|
||||
@@ -275,9 +355,25 @@ func (api *RestAPI) updateConfiguration(rw http.ResponseWriter, r *http.Request)
|
||||
}
|
||||
|
||||
rw.WriteHeader(http.StatusOK)
|
||||
rw.Write([]byte("success"))
|
||||
if _, err := rw.Write([]byte("success")); err != nil {
|
||||
cclog.Errorf("Failed to write response: %v", err)
|
||||
}
|
||||
}
|
||||
|
||||
// putMachineState godoc
|
||||
// @summary Store machine state
|
||||
// @tags Machine State
|
||||
// @description Stores machine state data for a specific cluster node. Validates cluster and host names to prevent path traversal.
|
||||
// @accept json
|
||||
// @produce plain
|
||||
// @param cluster path string true "Cluster name"
|
||||
// @param host path string true "Host name"
|
||||
// @success 201 "Created"
|
||||
// @failure 400 {object} ErrorResponse "Bad Request"
|
||||
// @failure 404 {object} ErrorResponse "Machine state not enabled"
|
||||
// @failure 500 {object} ErrorResponse "Internal Server Error"
|
||||
// @security ApiKeyAuth
|
||||
// @router /machine_state/{cluster}/{host} [put]
|
||||
func (api *RestAPI) putMachineState(rw http.ResponseWriter, r *http.Request) {
|
||||
if api.MachineStateDir == "" {
|
||||
handleError(fmt.Errorf("machine state not enabled"), http.StatusNotFound, rw)
|
||||
@@ -288,13 +384,12 @@ func (api *RestAPI) putMachineState(rw http.ResponseWriter, r *http.Request) {
|
||||
cluster := vars["cluster"]
|
||||
host := vars["host"]
|
||||
|
||||
// Validate cluster and host to prevent path traversal attacks
|
||||
if strings.Contains(cluster, "..") || strings.Contains(cluster, "/") || strings.Contains(cluster, "\\") {
|
||||
handleError(fmt.Errorf("invalid cluster name"), http.StatusBadRequest, rw)
|
||||
if err := validatePathComponent(cluster, "cluster name"); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
}
|
||||
if strings.Contains(host, "..") || strings.Contains(host, "/") || strings.Contains(host, "\\") {
|
||||
handleError(fmt.Errorf("invalid host name"), http.StatusBadRequest, rw)
|
||||
if err := validatePathComponent(host, "host name"); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
}
|
||||
|
||||
@@ -320,6 +415,18 @@ func (api *RestAPI) putMachineState(rw http.ResponseWriter, r *http.Request) {
|
||||
rw.WriteHeader(http.StatusCreated)
|
||||
}
|
||||
|
||||
// getMachineState godoc
|
||||
// @summary Retrieve machine state
|
||||
// @tags Machine State
|
||||
// @description Retrieves stored machine state data for a specific cluster node. Validates cluster and host names to prevent path traversal.
|
||||
// @produce json
|
||||
// @param cluster path string true "Cluster name"
|
||||
// @param host path string true "Host name"
|
||||
// @success 200 {object} object "Machine state JSON data"
|
||||
// @failure 400 {object} ErrorResponse "Bad Request"
|
||||
// @failure 404 {object} ErrorResponse "Machine state not enabled or file not found"
|
||||
// @security ApiKeyAuth
|
||||
// @router /machine_state/{cluster}/{host} [get]
|
||||
func (api *RestAPI) getMachineState(rw http.ResponseWriter, r *http.Request) {
|
||||
if api.MachineStateDir == "" {
|
||||
handleError(fmt.Errorf("machine state not enabled"), http.StatusNotFound, rw)
|
||||
@@ -330,13 +437,12 @@ func (api *RestAPI) getMachineState(rw http.ResponseWriter, r *http.Request) {
|
||||
cluster := vars["cluster"]
|
||||
host := vars["host"]
|
||||
|
||||
// Validate cluster and host to prevent path traversal attacks
|
||||
if strings.Contains(cluster, "..") || strings.Contains(cluster, "/") || strings.Contains(cluster, "\\") {
|
||||
handleError(fmt.Errorf("invalid cluster name"), http.StatusBadRequest, rw)
|
||||
if err := validatePathComponent(cluster, "cluster name"); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
}
|
||||
if strings.Contains(host, "..") || strings.Contains(host, "/") || strings.Contains(host, "\\") {
|
||||
handleError(fmt.Errorf("invalid host name"), http.StatusBadRequest, rw)
|
||||
if err := validatePathComponent(host, "host name"); err != nil {
|
||||
handleError(err, http.StatusBadRequest, rw)
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
@@ -11,8 +11,8 @@ import (
|
||||
"net/http"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/gorilla/mux"
|
||||
)
|
||||
|
||||
@@ -31,7 +31,7 @@ type APIReturnedUser struct {
|
||||
// @description Required query-parameter defines if all users or only users with additional special roles are returned.
|
||||
// @produce json
|
||||
// @param not-just-user query bool true "If returned list should contain all users or only users with additional special roles"
|
||||
// @success 200 {array} api.ApiReturnedUser "List of users returned successfully"
|
||||
// @success 200 {array} api.APIReturnedUser "List of users returned successfully"
|
||||
// @failure 400 {string} string "Bad Request"
|
||||
// @failure 401 {string} string "Unauthorized"
|
||||
// @failure 403 {string} string "Forbidden"
|
||||
|
||||
@@ -106,7 +106,7 @@ Data is archived at the highest available resolution (typically 60s intervals).
|
||||
|
||||
```go
|
||||
// In archiver.go ArchiveJob() function
|
||||
jobData, err := metricDataDispatcher.LoadData(job, allMetrics, scopes, ctx, 300)
|
||||
jobData, err := metricdispatch.LoadData(job, allMetrics, scopes, ctx, 300)
|
||||
// 0 = highest resolution
|
||||
// 300 = 5-minute resolution
|
||||
```
|
||||
@@ -185,6 +185,6 @@ Internal state is protected by:
|
||||
## Dependencies
|
||||
|
||||
- `internal/repository`: Database operations for job metadata
|
||||
- `internal/metricDataDispatcher`: Loading metric data from various backends
|
||||
- `internal/metricdispatch`: Loading metric data from various backends
|
||||
- `pkg/archive`: Archive backend abstraction (filesystem, S3, SQLite)
|
||||
- `cc-lib/schema`: Job and metric data structures
|
||||
|
||||
@@ -54,8 +54,8 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
sq "github.com/Masterminds/squirrel"
|
||||
)
|
||||
|
||||
|
||||
@@ -10,10 +10,10 @@ import (
|
||||
"math"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricDataDispatcher"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdispatch"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// ArchiveJob archives a completed job's metric data to the configured archive backend.
|
||||
@@ -60,7 +60,7 @@ func ArchiveJob(job *schema.Job, ctx context.Context) (*schema.Job, error) {
|
||||
scopes = append(scopes, schema.MetricScopeAccelerator)
|
||||
}
|
||||
|
||||
jobData, err := metricDataDispatcher.LoadData(job, allMetrics, scopes, ctx, 0) // 0 Resulotion-Value retrieves highest res (60s)
|
||||
jobData, err := metricdispatch.LoadData(job, allMetrics, scopes, ctx, 0) // 0 Resulotion-Value retrieves highest res (60s)
|
||||
if err != nil {
|
||||
cclog.Error("Error wile loading job data for archiving")
|
||||
return nil, err
|
||||
|
||||
@@ -25,9 +25,9 @@ import (
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
"github.com/gorilla/sessions"
|
||||
)
|
||||
|
||||
@@ -40,7 +40,7 @@ type Authenticator interface {
|
||||
// authenticator should attempt the login. This method should not perform
|
||||
// expensive operations or actual authentication.
|
||||
CanLogin(user *schema.User, username string, rw http.ResponseWriter, r *http.Request) (*schema.User, bool)
|
||||
|
||||
|
||||
// Login performs the actually authentication for the user.
|
||||
// It returns the authenticated user or an error if authentication fails.
|
||||
// The user parameter may be nil if the user doesn't exist in the database yet.
|
||||
@@ -65,13 +65,13 @@ var ipUserLimiters sync.Map
|
||||
func getIPUserLimiter(ip, username string) *rate.Limiter {
|
||||
key := ip + ":" + username
|
||||
now := time.Now()
|
||||
|
||||
|
||||
if entry, ok := ipUserLimiters.Load(key); ok {
|
||||
rle := entry.(*rateLimiterEntry)
|
||||
rle.lastUsed = now
|
||||
return rle.limiter
|
||||
}
|
||||
|
||||
|
||||
// More aggressive rate limiting: 5 attempts per 15 minutes
|
||||
newLimiter := rate.NewLimiter(rate.Every(15*time.Minute/5), 5)
|
||||
ipUserLimiters.Store(key, &rateLimiterEntry{
|
||||
@@ -176,7 +176,7 @@ func (auth *Authentication) AuthViaSession(
|
||||
func Init(authCfg *json.RawMessage) {
|
||||
initOnce.Do(func() {
|
||||
authInstance = &Authentication{}
|
||||
|
||||
|
||||
// Start background cleanup of rate limiters
|
||||
startRateLimiterCleanup()
|
||||
|
||||
@@ -272,7 +272,7 @@ func handleUserSync(user *schema.User, syncUserOnLogin, updateUserOnLogin bool)
|
||||
cclog.Errorf("Error while loading user '%s': %v", user.Username, err)
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
if err == sql.ErrNoRows && syncUserOnLogin { // Add new user
|
||||
if err := r.AddUser(user); err != nil {
|
||||
cclog.Errorf("Error while adding user '%s' to DB: %v", user.Username, err)
|
||||
|
||||
@@ -15,25 +15,25 @@ import (
|
||||
func TestGetIPUserLimiter(t *testing.T) {
|
||||
ip := "192.168.1.1"
|
||||
username := "testuser"
|
||||
|
||||
|
||||
// Get limiter for the first time
|
||||
limiter1 := getIPUserLimiter(ip, username)
|
||||
if limiter1 == nil {
|
||||
t.Fatal("Expected limiter to be created")
|
||||
}
|
||||
|
||||
|
||||
// Get the same limiter again
|
||||
limiter2 := getIPUserLimiter(ip, username)
|
||||
if limiter1 != limiter2 {
|
||||
t.Error("Expected to get the same limiter instance")
|
||||
}
|
||||
|
||||
|
||||
// Get a different limiter for different user
|
||||
limiter3 := getIPUserLimiter(ip, "otheruser")
|
||||
if limiter1 == limiter3 {
|
||||
t.Error("Expected different limiter for different user")
|
||||
}
|
||||
|
||||
|
||||
// Get a different limiter for different IP
|
||||
limiter4 := getIPUserLimiter("192.168.1.2", username)
|
||||
if limiter1 == limiter4 {
|
||||
@@ -45,16 +45,16 @@ func TestGetIPUserLimiter(t *testing.T) {
|
||||
func TestRateLimiterBehavior(t *testing.T) {
|
||||
ip := "10.0.0.1"
|
||||
username := "ratelimituser"
|
||||
|
||||
|
||||
limiter := getIPUserLimiter(ip, username)
|
||||
|
||||
|
||||
// Should allow first 5 attempts
|
||||
for i := 0; i < 5; i++ {
|
||||
if !limiter.Allow() {
|
||||
t.Errorf("Request %d should be allowed within rate limit", i+1)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// 6th attempt should be blocked
|
||||
if limiter.Allow() {
|
||||
t.Error("Request 6 should be blocked by rate limiter")
|
||||
@@ -65,19 +65,19 @@ func TestRateLimiterBehavior(t *testing.T) {
|
||||
func TestCleanupOldRateLimiters(t *testing.T) {
|
||||
// Clear all existing limiters first to avoid interference from other tests
|
||||
cleanupOldRateLimiters(time.Now().Add(24 * time.Hour))
|
||||
|
||||
|
||||
// Create some new rate limiters
|
||||
limiter1 := getIPUserLimiter("1.1.1.1", "user1")
|
||||
limiter2 := getIPUserLimiter("2.2.2.2", "user2")
|
||||
|
||||
|
||||
if limiter1 == nil || limiter2 == nil {
|
||||
t.Fatal("Failed to create test limiters")
|
||||
}
|
||||
|
||||
|
||||
// Cleanup limiters older than 1 second from now (should keep both)
|
||||
time.Sleep(10 * time.Millisecond) // Small delay to ensure timestamp difference
|
||||
cleanupOldRateLimiters(time.Now().Add(-1 * time.Second))
|
||||
|
||||
|
||||
// Verify they still exist (should get same instance)
|
||||
if getIPUserLimiter("1.1.1.1", "user1") != limiter1 {
|
||||
t.Error("Limiter 1 was incorrectly cleaned up")
|
||||
@@ -85,10 +85,10 @@ func TestCleanupOldRateLimiters(t *testing.T) {
|
||||
if getIPUserLimiter("2.2.2.2", "user2") != limiter2 {
|
||||
t.Error("Limiter 2 was incorrectly cleaned up")
|
||||
}
|
||||
|
||||
|
||||
// Cleanup limiters older than 1 hour from now (should remove both)
|
||||
cleanupOldRateLimiters(time.Now().Add(2 * time.Hour))
|
||||
|
||||
|
||||
// Getting them again should create new instances
|
||||
newLimiter1 := getIPUserLimiter("1.1.1.1", "user1")
|
||||
if newLimiter1 == limiter1 {
|
||||
@@ -107,14 +107,14 @@ func TestIPv4Extraction(t *testing.T) {
|
||||
{"IPv4 without port", "192.168.1.1", "192.168.1.1"},
|
||||
{"Localhost with port", "127.0.0.1:3000", "127.0.0.1"},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := tt.input
|
||||
if host, _, err := net.SplitHostPort(result); err == nil {
|
||||
result = host
|
||||
}
|
||||
|
||||
|
||||
if result != tt.expected {
|
||||
t.Errorf("Expected %s, got %s", tt.expected, result)
|
||||
}
|
||||
@@ -122,7 +122,7 @@ func TestIPv4Extraction(t *testing.T) {
|
||||
}
|
||||
}
|
||||
|
||||
// TestIPv6Extraction tests extracting IPv6 addresses
|
||||
// TestIPv6Extraction tests extracting IPv6 addresses
|
||||
func TestIPv6Extraction(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
@@ -134,14 +134,14 @@ func TestIPv6Extraction(t *testing.T) {
|
||||
{"IPv6 without port", "2001:db8::1", "2001:db8::1"},
|
||||
{"IPv6 localhost", "::1", "::1"},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := tt.input
|
||||
if host, _, err := net.SplitHostPort(result); err == nil {
|
||||
result = host
|
||||
}
|
||||
|
||||
|
||||
if result != tt.expected {
|
||||
t.Errorf("Expected %s, got %s", tt.expected, result)
|
||||
}
|
||||
@@ -160,14 +160,14 @@ func TestIPExtractionEdgeCases(t *testing.T) {
|
||||
{"Empty string", "", ""},
|
||||
{"Just port", ":8080", ""},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := tt.input
|
||||
if host, _, err := net.SplitHostPort(result); err == nil {
|
||||
result = host
|
||||
}
|
||||
|
||||
|
||||
if result != tt.expected {
|
||||
t.Errorf("Expected %s, got %s", tt.expected, result)
|
||||
}
|
||||
|
||||
@@ -14,8 +14,8 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/golang-jwt/jwt/v5"
|
||||
)
|
||||
|
||||
@@ -101,20 +101,20 @@ func (ja *JWTAuthenticator) AuthViaJWT(
|
||||
|
||||
// Token is valid, extract payload
|
||||
claims := token.Claims.(jwt.MapClaims)
|
||||
|
||||
|
||||
// Use shared helper to get user from JWT claims
|
||||
var user *schema.User
|
||||
user, err = getUserFromJWT(claims, Keys.JwtConfig.ValidateUser, schema.AuthToken, -1)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
|
||||
// If not validating user, we only get roles from JWT (no projects for this auth method)
|
||||
if !Keys.JwtConfig.ValidateUser {
|
||||
user.Roles = extractRolesFromClaims(claims, false)
|
||||
user.Projects = nil // Standard JWT auth doesn't include projects
|
||||
}
|
||||
|
||||
|
||||
return user, nil
|
||||
}
|
||||
|
||||
|
||||
@@ -12,8 +12,8 @@ import (
|
||||
"net/http"
|
||||
"os"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/golang-jwt/jwt/v5"
|
||||
)
|
||||
|
||||
@@ -146,13 +146,13 @@ func (ja *JWTCookieSessionAuthenticator) Login(
|
||||
}
|
||||
|
||||
claims := token.Claims.(jwt.MapClaims)
|
||||
|
||||
|
||||
// Use shared helper to get user from JWT claims
|
||||
user, err = getUserFromJWT(claims, jc.ValidateUser, schema.AuthSession, schema.AuthViaToken)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
|
||||
// Sync or update user if configured
|
||||
if !jc.ValidateUser && (jc.SyncUserOnLogin || jc.UpdateUserOnLogin) {
|
||||
handleTokenUser(user)
|
||||
|
||||
@@ -11,8 +11,8 @@ import (
|
||||
"fmt"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/golang-jwt/jwt/v5"
|
||||
)
|
||||
|
||||
@@ -28,7 +28,7 @@ func extractStringFromClaims(claims jwt.MapClaims, key string) string {
|
||||
// If validateRoles is true, only valid roles are returned
|
||||
func extractRolesFromClaims(claims jwt.MapClaims, validateRoles bool) []string {
|
||||
var roles []string
|
||||
|
||||
|
||||
if rawroles, ok := claims["roles"].([]any); ok {
|
||||
for _, rr := range rawroles {
|
||||
if r, ok := rr.(string); ok {
|
||||
@@ -42,14 +42,14 @@ func extractRolesFromClaims(claims jwt.MapClaims, validateRoles bool) []string {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
return roles
|
||||
}
|
||||
|
||||
// extractProjectsFromClaims extracts projects from JWT claims
|
||||
func extractProjectsFromClaims(claims jwt.MapClaims) []string {
|
||||
projects := make([]string, 0)
|
||||
|
||||
|
||||
if rawprojs, ok := claims["projects"].([]any); ok {
|
||||
for _, pp := range rawprojs {
|
||||
if p, ok := pp.(string); ok {
|
||||
@@ -61,7 +61,7 @@ func extractProjectsFromClaims(claims jwt.MapClaims) []string {
|
||||
projects = append(projects, projSlice...)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
return projects
|
||||
}
|
||||
|
||||
@@ -72,14 +72,14 @@ func extractNameFromClaims(claims jwt.MapClaims) string {
|
||||
if name, ok := claims["name"].(string); ok {
|
||||
return name
|
||||
}
|
||||
|
||||
|
||||
// Try nested structure: {name: {values: [...]}}
|
||||
if wrap, ok := claims["name"].(map[string]any); ok {
|
||||
if vals, ok := wrap["values"].([]any); ok {
|
||||
if len(vals) == 0 {
|
||||
return ""
|
||||
}
|
||||
|
||||
|
||||
name := fmt.Sprintf("%v", vals[0])
|
||||
for i := 1; i < len(vals); i++ {
|
||||
name += fmt.Sprintf(" %v", vals[i])
|
||||
@@ -87,7 +87,7 @@ func extractNameFromClaims(claims jwt.MapClaims) string {
|
||||
return name
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
return ""
|
||||
}
|
||||
|
||||
@@ -100,7 +100,7 @@ func getUserFromJWT(claims jwt.MapClaims, validateUser bool, authType schema.Aut
|
||||
if sub == "" {
|
||||
return nil, errors.New("missing 'sub' claim in JWT")
|
||||
}
|
||||
|
||||
|
||||
if validateUser {
|
||||
// Validate user against database
|
||||
ur := repository.GetUserRepository()
|
||||
@@ -109,22 +109,22 @@ func getUserFromJWT(claims jwt.MapClaims, validateUser bool, authType schema.Aut
|
||||
cclog.Errorf("Error while loading user '%v': %v", sub, err)
|
||||
return nil, fmt.Errorf("database error: %w", err)
|
||||
}
|
||||
|
||||
|
||||
// Deny any logins for unknown usernames
|
||||
if user == nil || err == sql.ErrNoRows {
|
||||
cclog.Warn("Could not find user from JWT in internal database.")
|
||||
return nil, errors.New("unknown user")
|
||||
}
|
||||
|
||||
|
||||
// Return database user (with database roles)
|
||||
return user, nil
|
||||
}
|
||||
|
||||
|
||||
// Create user from JWT claims
|
||||
name := extractNameFromClaims(claims)
|
||||
roles := extractRolesFromClaims(claims, true) // Validate roles
|
||||
projects := extractProjectsFromClaims(claims)
|
||||
|
||||
|
||||
return &schema.User{
|
||||
Username: sub,
|
||||
Name: name,
|
||||
|
||||
@@ -8,7 +8,7 @@ package auth
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/golang-jwt/jwt/v5"
|
||||
)
|
||||
|
||||
@@ -19,7 +19,7 @@ func TestExtractStringFromClaims(t *testing.T) {
|
||||
"email": "test@example.com",
|
||||
"age": 25, // not a string
|
||||
}
|
||||
|
||||
|
||||
tests := []struct {
|
||||
name string
|
||||
key string
|
||||
@@ -30,7 +30,7 @@ func TestExtractStringFromClaims(t *testing.T) {
|
||||
{"Non-existent key", "missing", ""},
|
||||
{"Non-string value", "age", ""},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := extractStringFromClaims(claims, tt.key)
|
||||
@@ -88,16 +88,16 @@ func TestExtractRolesFromClaims(t *testing.T) {
|
||||
expected: []string{},
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := extractRolesFromClaims(tt.claims, tt.validateRoles)
|
||||
|
||||
|
||||
if len(result) != len(tt.expected) {
|
||||
t.Errorf("Expected %d roles, got %d", len(tt.expected), len(result))
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
for i, role := range result {
|
||||
if i >= len(tt.expected) || role != tt.expected[i] {
|
||||
t.Errorf("Expected role %s at position %d, got %s", tt.expected[i], i, role)
|
||||
@@ -141,16 +141,16 @@ func TestExtractProjectsFromClaims(t *testing.T) {
|
||||
expected: []string{"project1", "project2"}, // Should skip non-strings
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := extractProjectsFromClaims(tt.claims)
|
||||
|
||||
|
||||
if len(result) != len(tt.expected) {
|
||||
t.Errorf("Expected %d projects, got %d", len(tt.expected), len(result))
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
for i, project := range result {
|
||||
if i >= len(tt.expected) || project != tt.expected[i] {
|
||||
t.Errorf("Expected project %s at position %d, got %s", tt.expected[i], i, project)
|
||||
@@ -216,7 +216,7 @@ func TestExtractNameFromClaims(t *testing.T) {
|
||||
expected: "123 Smith", // Should convert to string
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := extractNameFromClaims(tt.claims)
|
||||
@@ -235,29 +235,28 @@ func TestGetUserFromJWT_NoValidation(t *testing.T) {
|
||||
"roles": []any{"user", "admin"},
|
||||
"projects": []any{"project1", "project2"},
|
||||
}
|
||||
|
||||
|
||||
user, err := getUserFromJWT(claims, false, schema.AuthToken, -1)
|
||||
|
||||
if err != nil {
|
||||
t.Fatalf("Unexpected error: %v", err)
|
||||
}
|
||||
|
||||
|
||||
if user.Username != "testuser" {
|
||||
t.Errorf("Expected username 'testuser', got '%s'", user.Username)
|
||||
}
|
||||
|
||||
|
||||
if user.Name != "Test User" {
|
||||
t.Errorf("Expected name 'Test User', got '%s'", user.Name)
|
||||
}
|
||||
|
||||
|
||||
if len(user.Roles) != 2 {
|
||||
t.Errorf("Expected 2 roles, got %d", len(user.Roles))
|
||||
}
|
||||
|
||||
|
||||
if len(user.Projects) != 2 {
|
||||
t.Errorf("Expected 2 projects, got %d", len(user.Projects))
|
||||
}
|
||||
|
||||
|
||||
if user.AuthType != schema.AuthToken {
|
||||
t.Errorf("Expected AuthType %v, got %v", schema.AuthToken, user.AuthType)
|
||||
}
|
||||
@@ -268,13 +267,13 @@ func TestGetUserFromJWT_MissingSub(t *testing.T) {
|
||||
claims := jwt.MapClaims{
|
||||
"name": "Test User",
|
||||
}
|
||||
|
||||
|
||||
_, err := getUserFromJWT(claims, false, schema.AuthToken, -1)
|
||||
|
||||
|
||||
if err == nil {
|
||||
t.Error("Expected error for missing sub claim")
|
||||
}
|
||||
|
||||
|
||||
if err.Error() != "missing 'sub' claim in JWT" {
|
||||
t.Errorf("Expected specific error message, got: %v", err)
|
||||
}
|
||||
|
||||
@@ -13,8 +13,8 @@ import (
|
||||
"os"
|
||||
"strings"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/golang-jwt/jwt/v5"
|
||||
)
|
||||
|
||||
@@ -75,13 +75,13 @@ func (ja *JWTSessionAuthenticator) Login(
|
||||
}
|
||||
|
||||
claims := token.Claims.(jwt.MapClaims)
|
||||
|
||||
|
||||
// Use shared helper to get user from JWT claims
|
||||
user, err = getUserFromJWT(claims, Keys.JwtConfig.ValidateUser, schema.AuthSession, schema.AuthViaToken)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
|
||||
// Sync or update user if configured
|
||||
if !Keys.JwtConfig.ValidateUser && (Keys.JwtConfig.SyncUserOnLogin || Keys.JwtConfig.UpdateUserOnLogin) {
|
||||
handleTokenUser(user)
|
||||
|
||||
@@ -13,8 +13,8 @@ import (
|
||||
"strings"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/go-ldap/ldap/v3"
|
||||
)
|
||||
|
||||
|
||||
@@ -9,8 +9,8 @@ import (
|
||||
"fmt"
|
||||
"net/http"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"golang.org/x/crypto/bcrypt"
|
||||
)
|
||||
|
||||
|
||||
@@ -15,8 +15,8 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/coreos/go-oidc/v3/oidc"
|
||||
"github.com/gorilla/mux"
|
||||
"golang.org/x/oauth2"
|
||||
@@ -59,7 +59,7 @@ func NewOIDC(a *Authentication) *OIDC {
|
||||
// Use context with timeout for provider initialization
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
|
||||
provider, err := oidc.NewProvider(ctx, Keys.OpenIDConfig.Provider)
|
||||
if err != nil {
|
||||
cclog.Fatal(err)
|
||||
@@ -119,7 +119,7 @@ func (oa *OIDC) OAuth2Callback(rw http.ResponseWriter, r *http.Request) {
|
||||
// Exchange authorization code for token with timeout
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
|
||||
defer cancel()
|
||||
|
||||
|
||||
token, err := oa.client.Exchange(ctx, code, oauth2.VerifierOption(codeVerifier))
|
||||
if err != nil {
|
||||
http.Error(rw, "Failed to exchange token: "+err.Error(), http.StatusInternalServerError)
|
||||
|
||||
@@ -11,8 +11,8 @@ import (
|
||||
"encoding/json"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/resampler"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/resampler"
|
||||
)
|
||||
|
||||
type ProgramConfig struct {
|
||||
@@ -37,10 +37,10 @@ type ProgramConfig struct {
|
||||
EmbedStaticFiles bool `json:"embed-static-files"`
|
||||
StaticFiles string `json:"static-files"`
|
||||
|
||||
// 'sqlite3' or 'mysql' (mysql will work for mariadb as well)
|
||||
// Database driver - only 'sqlite3' is supported
|
||||
DBDriver string `json:"db-driver"`
|
||||
|
||||
// For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!).
|
||||
// Path to SQLite database file
|
||||
DB string `json:"db"`
|
||||
|
||||
// Keep all metric data in the metric data repositories,
|
||||
@@ -90,8 +90,7 @@ type ResampleConfig struct {
|
||||
}
|
||||
|
||||
type NATSConfig struct {
|
||||
SubjectJobStart string `json:"subjectJobStart"`
|
||||
SubjectJobStop string `json:"subjectJobStop"`
|
||||
SubjectJobEvent string `json:"subjectJobEvent"`
|
||||
SubjectNodeState string `json:"subjectNodeState"`
|
||||
}
|
||||
|
||||
@@ -112,14 +111,6 @@ type FilterRanges struct {
|
||||
StartTime *TimeRange `json:"startTime"`
|
||||
}
|
||||
|
||||
type ClusterConfig struct {
|
||||
Name string `json:"name"`
|
||||
FilterRanges *FilterRanges `json:"filterRanges"`
|
||||
MetricDataRepository json.RawMessage `json:"metricDataRepository"`
|
||||
}
|
||||
|
||||
var Clusters []*ClusterConfig
|
||||
|
||||
var Keys ProgramConfig = ProgramConfig{
|
||||
Addr: "localhost:8080",
|
||||
DisableAuthentication: false,
|
||||
@@ -133,7 +124,7 @@ var Keys ProgramConfig = ProgramConfig{
|
||||
ShortRunningJobsDuration: 5 * 60,
|
||||
}
|
||||
|
||||
func Init(mainConfig json.RawMessage, clusterConfig json.RawMessage) {
|
||||
func Init(mainConfig json.RawMessage) {
|
||||
Validate(configSchema, mainConfig)
|
||||
dec := json.NewDecoder(bytes.NewReader(mainConfig))
|
||||
dec.DisallowUnknownFields()
|
||||
@@ -141,17 +132,6 @@ func Init(mainConfig json.RawMessage, clusterConfig json.RawMessage) {
|
||||
cclog.Abortf("Config Init: Could not decode config file '%s'.\nError: %s\n", mainConfig, err.Error())
|
||||
}
|
||||
|
||||
Validate(clustersSchema, clusterConfig)
|
||||
dec = json.NewDecoder(bytes.NewReader(clusterConfig))
|
||||
dec.DisallowUnknownFields()
|
||||
if err := dec.Decode(&Clusters); err != nil {
|
||||
cclog.Abortf("Config Init: Could not decode config file '%s'.\nError: %s\n", mainConfig, err.Error())
|
||||
}
|
||||
|
||||
if len(Clusters) < 1 {
|
||||
cclog.Abort("Config Init: At least one cluster required in config. Exited with error.")
|
||||
}
|
||||
|
||||
if Keys.EnableResampling != nil && Keys.EnableResampling.MinimumPoints > 0 {
|
||||
resampler.SetMinimumRequiredPoints(Keys.EnableResampling.MinimumPoints)
|
||||
}
|
||||
|
||||
@@ -8,19 +8,15 @@ package config
|
||||
import (
|
||||
"testing"
|
||||
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
)
|
||||
|
||||
func TestInit(t *testing.T) {
|
||||
fp := "../../configs/config.json"
|
||||
ccconf.Init(fp)
|
||||
if cfg := ccconf.GetPackageConfig("main"); cfg != nil {
|
||||
if clustercfg := ccconf.GetPackageConfig("clusters"); clustercfg != nil {
|
||||
Init(cfg, clustercfg)
|
||||
} else {
|
||||
cclog.Abort("Cluster configuration must be present")
|
||||
}
|
||||
Init(cfg)
|
||||
} else {
|
||||
cclog.Abort("Main configuration must be present")
|
||||
}
|
||||
@@ -34,11 +30,7 @@ func TestInitMinimal(t *testing.T) {
|
||||
fp := "../../configs/config-demo.json"
|
||||
ccconf.Init(fp)
|
||||
if cfg := ccconf.GetPackageConfig("main"); cfg != nil {
|
||||
if clustercfg := ccconf.GetPackageConfig("clusters"); clustercfg != nil {
|
||||
Init(cfg, clustercfg)
|
||||
} else {
|
||||
cclog.Abort("Cluster configuration must be present")
|
||||
}
|
||||
Init(cfg)
|
||||
} else {
|
||||
cclog.Abort("Main configuration must be present")
|
||||
}
|
||||
|
||||
@@ -41,7 +41,7 @@ var configSchema = `
|
||||
"type": "string"
|
||||
},
|
||||
"db": {
|
||||
"description": "For sqlite3 a filename, for mysql a DSN in this format: https://github.com/go-sql-driver/mysql#dsn-data-source-name (Without query parameters!).",
|
||||
"description": "Path to SQLite database file (e.g., './var/job.db')",
|
||||
"type": "string"
|
||||
},
|
||||
"disable-archive": {
|
||||
@@ -119,87 +119,22 @@ var configSchema = `
|
||||
}
|
||||
},
|
||||
"required": ["trigger", "resolutions"]
|
||||
},
|
||||
"apiSubjects": {
|
||||
"description": "NATS subjects configuration for subscribing to job and node events.",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"subjectJobEvent": {
|
||||
"description": "NATS subject for job events (start_job, stop_job)",
|
||||
"type": "string"
|
||||
},
|
||||
"subjectNodeState": {
|
||||
"description": "NATS subject for node state updates",
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": ["subjectJobEvent", "subjectNodeState"]
|
||||
}
|
||||
},
|
||||
"required": ["apiAllowedIPs"]
|
||||
}`
|
||||
|
||||
var clustersSchema = `
|
||||
{
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"name": {
|
||||
"description": "The name of the cluster.",
|
||||
"type": "string"
|
||||
},
|
||||
"metricDataRepository": {
|
||||
"description": "Type of the metric data repository for this cluster",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"kind": {
|
||||
"type": "string",
|
||||
"enum": ["influxdb", "prometheus", "cc-metric-store", "cc-metric-store-internal", "test"]
|
||||
},
|
||||
"url": {
|
||||
"type": "string"
|
||||
},
|
||||
"token": {
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": ["kind"]
|
||||
},
|
||||
"filterRanges": {
|
||||
"description": "This option controls the slider ranges for the UI controls of numNodes, duration, and startTime.",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"numNodes": {
|
||||
"description": "UI slider range for number of nodes",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"from": {
|
||||
"type": "integer"
|
||||
},
|
||||
"to": {
|
||||
"type": "integer"
|
||||
}
|
||||
},
|
||||
"required": ["from", "to"]
|
||||
},
|
||||
"duration": {
|
||||
"description": "UI slider range for duration",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"from": {
|
||||
"type": "integer"
|
||||
},
|
||||
"to": {
|
||||
"type": "integer"
|
||||
}
|
||||
},
|
||||
"required": ["from", "to"]
|
||||
},
|
||||
"startTime": {
|
||||
"description": "UI slider range for start time",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"from": {
|
||||
"type": "string",
|
||||
"format": "date-time"
|
||||
},
|
||||
"to": {
|
||||
"type": "null"
|
||||
}
|
||||
},
|
||||
"required": ["from", "to"]
|
||||
}
|
||||
},
|
||||
"required": ["numNodes", "duration", "startTime"]
|
||||
}
|
||||
},
|
||||
"required": ["name", "metricDataRepository", "filterRanges"],
|
||||
"minItems": 1
|
||||
}
|
||||
}`
|
||||
|
||||
@@ -8,7 +8,7 @@ package config
|
||||
import (
|
||||
"encoding/json"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/santhosh-tekuri/jsonschema/v5"
|
||||
)
|
||||
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -10,7 +10,7 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
type ClusterMetricWithName struct {
|
||||
@@ -82,6 +82,7 @@ type JobFilter struct {
|
||||
State []schema.JobState `json:"state,omitempty"`
|
||||
MetricStats []*MetricStatItem `json:"metricStats,omitempty"`
|
||||
Shared *string `json:"shared,omitempty"`
|
||||
Schedule *string `json:"schedule,omitempty"`
|
||||
Node *StringInput `json:"node,omitempty"`
|
||||
}
|
||||
|
||||
|
||||
@@ -4,7 +4,7 @@ import (
|
||||
"sync"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/jmoiron/sqlx"
|
||||
)
|
||||
|
||||
|
||||
@@ -3,7 +3,7 @@ package graph
|
||||
// This file will be automatically regenerated based on the schema, any resolver
|
||||
// implementations
|
||||
// will be copied through when generating and any unknown code will be moved to the end.
|
||||
// Code generated by github.com/99designs/gqlgen version v0.17.84
|
||||
// Code generated by github.com/99designs/gqlgen version v0.17.85
|
||||
|
||||
import (
|
||||
"context"
|
||||
@@ -19,11 +19,11 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph/generated"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph/model"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricDataDispatcher"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdispatch"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// Partitions is the resolver for the partitions field.
|
||||
@@ -283,7 +283,7 @@ func (r *mutationResolver) RemoveTagFromList(ctx context.Context, tagIds []strin
|
||||
// Test Access: Admins && Admin Tag OR Everyone && Private Tag
|
||||
if user.HasRole(schema.RoleAdmin) && (tscope == "global" || tscope == "admin") || user.Username == tscope {
|
||||
// Remove from DB
|
||||
if err = r.Repo.RemoveTagById(tid); err != nil {
|
||||
if err = r.Repo.RemoveTagByID(tid); err != nil {
|
||||
cclog.Warn("Error while removing tag")
|
||||
return nil, err
|
||||
} else {
|
||||
@@ -484,7 +484,7 @@ func (r *queryResolver) JobMetrics(ctx context.Context, id string, metrics []str
|
||||
return nil, err
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadData(job, metrics, scopes, ctx, *resolution)
|
||||
data, err := metricdispatch.LoadData(job, metrics, scopes, ctx, *resolution)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while loading job data")
|
||||
return nil, err
|
||||
@@ -512,7 +512,7 @@ func (r *queryResolver) JobStats(ctx context.Context, id string, metrics []strin
|
||||
return nil, err
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadJobStats(job, metrics, ctx)
|
||||
data, err := metricdispatch.LoadJobStats(job, metrics, ctx)
|
||||
if err != nil {
|
||||
cclog.Warnf("Error while loading jobStats data for job id %s", id)
|
||||
return nil, err
|
||||
@@ -537,7 +537,7 @@ func (r *queryResolver) ScopedJobStats(ctx context.Context, id string, metrics [
|
||||
return nil, err
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadScopedJobStats(job, metrics, scopes, ctx)
|
||||
data, err := metricdispatch.LoadScopedJobStats(job, metrics, scopes, ctx)
|
||||
if err != nil {
|
||||
cclog.Warnf("Error while loading scopedJobStats data for job id %s", id)
|
||||
return nil, err
|
||||
@@ -590,21 +590,24 @@ func (r *queryResolver) Jobs(ctx context.Context, filter []*model.JobFilter, pag
|
||||
|
||||
// Note: Even if App-Default 'config.Keys.UiDefaults["job_list_usePaging"]' is set, always return hasNextPage boolean.
|
||||
// Users can decide in frontend to use continuous scroll, even if app-default is paging!
|
||||
// Skip if page.ItemsPerPage == -1 ("Load All" -> No Next Page required, Status Dashboards)
|
||||
/*
|
||||
Example Page 4 @ 10 IpP : Does item 41 exist?
|
||||
Minimal Page 41 @ 1 IpP : If len(result) is 1, Page 5 @ 10 IpP exists.
|
||||
*/
|
||||
nextPage := &model.PageRequest{
|
||||
ItemsPerPage: 1,
|
||||
Page: ((page.Page * page.ItemsPerPage) + 1),
|
||||
hasNextPage := false
|
||||
if page.ItemsPerPage != -1 {
|
||||
nextPage := &model.PageRequest{
|
||||
ItemsPerPage: 1,
|
||||
Page: ((page.Page * page.ItemsPerPage) + 1),
|
||||
}
|
||||
nextJobs, err := r.Repo.QueryJobs(ctx, filter, nextPage, order)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while querying next jobs")
|
||||
return nil, err
|
||||
}
|
||||
hasNextPage = len(nextJobs) == 1
|
||||
}
|
||||
nextJobs, err := r.Repo.QueryJobs(ctx, filter, nextPage, order)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while querying next jobs")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
hasNextPage := len(nextJobs) == 1
|
||||
|
||||
return &model.JobResultList{Items: jobs, Count: &count, HasNextPage: &hasNextPage}, nil
|
||||
}
|
||||
@@ -702,7 +705,7 @@ func (r *queryResolver) JobsMetricStats(ctx context.Context, filter []*model.Job
|
||||
|
||||
res := []*model.JobStats{}
|
||||
for _, job := range jobs {
|
||||
data, err := metricDataDispatcher.LoadJobStats(job, metrics, ctx)
|
||||
data, err := metricdispatch.LoadJobStats(job, metrics, ctx)
|
||||
if err != nil {
|
||||
cclog.Warnf("Error while loading comparison jobStats data for job id %d", job.JobID)
|
||||
continue
|
||||
@@ -753,13 +756,19 @@ func (r *queryResolver) NodeMetrics(ctx context.Context, cluster string, nodes [
|
||||
return nil, errors.New("you need to be administrator or support staff for this query")
|
||||
}
|
||||
|
||||
defaultMetrics := make([]string, 0)
|
||||
for _, mc := range archive.GetCluster(cluster).MetricConfig {
|
||||
defaultMetrics = append(defaultMetrics, mc.Name)
|
||||
}
|
||||
if metrics == nil {
|
||||
for _, mc := range archive.GetCluster(cluster).MetricConfig {
|
||||
metrics = append(metrics, mc.Name)
|
||||
}
|
||||
metrics = defaultMetrics
|
||||
} else {
|
||||
metrics = slices.DeleteFunc(metrics, func(metric string) bool {
|
||||
return !slices.Contains(defaultMetrics, metric) // Remove undefined metrics.
|
||||
})
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadNodeData(cluster, metrics, nodes, scopes, from, to, ctx)
|
||||
data, err := metricdispatch.LoadNodeData(cluster, metrics, nodes, scopes, from, to, ctx)
|
||||
if err != nil {
|
||||
cclog.Warn("error while loading node data")
|
||||
return nil, err
|
||||
@@ -825,7 +834,7 @@ func (r *queryResolver) NodeMetricsList(ctx context.Context, cluster string, sub
|
||||
}
|
||||
}
|
||||
|
||||
data, err := metricDataDispatcher.LoadNodeListData(cluster, subCluster, nodes, metrics, scopes, *resolution, from, to, ctx)
|
||||
data, err := metricdispatch.LoadNodeListData(cluster, subCluster, nodes, metrics, scopes, *resolution, from, to, ctx)
|
||||
if err != nil {
|
||||
cclog.Warn("error while loading node data (Resolver.NodeMetricsList")
|
||||
return nil, err
|
||||
@@ -880,7 +889,7 @@ func (r *queryResolver) ClusterMetrics(ctx context.Context, cluster string, metr
|
||||
|
||||
// 'nodes' == nil -> Defaults to all nodes of cluster for existing query workflow
|
||||
scopes := []schema.MetricScope{"node"}
|
||||
data, err := metricDataDispatcher.LoadNodeData(cluster, metrics, nil, scopes, from, to, ctx)
|
||||
data, err := metricdispatch.LoadNodeData(cluster, metrics, nil, scopes, from, to, ctx)
|
||||
if err != nil {
|
||||
cclog.Warn("error while loading node data")
|
||||
return nil, err
|
||||
@@ -972,12 +981,10 @@ func (r *Resolver) Query() generated.QueryResolver { return &queryResolver{r} }
|
||||
// SubCluster returns generated.SubClusterResolver implementation.
|
||||
func (r *Resolver) SubCluster() generated.SubClusterResolver { return &subClusterResolver{r} }
|
||||
|
||||
type (
|
||||
clusterResolver struct{ *Resolver }
|
||||
jobResolver struct{ *Resolver }
|
||||
metricValueResolver struct{ *Resolver }
|
||||
mutationResolver struct{ *Resolver }
|
||||
nodeResolver struct{ *Resolver }
|
||||
queryResolver struct{ *Resolver }
|
||||
subClusterResolver struct{ *Resolver }
|
||||
)
|
||||
type clusterResolver struct{ *Resolver }
|
||||
type jobResolver struct{ *Resolver }
|
||||
type metricValueResolver struct{ *Resolver }
|
||||
type mutationResolver struct{ *Resolver }
|
||||
type nodeResolver struct{ *Resolver }
|
||||
type queryResolver struct{ *Resolver }
|
||||
type subClusterResolver struct{ *Resolver }
|
||||
|
||||
@@ -13,9 +13,9 @@ import (
|
||||
|
||||
"github.com/99designs/gqlgen/graphql"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/graph/model"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricDataDispatcher"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdispatch"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
const MAX_JOBS_FOR_ANALYSIS = 500
|
||||
@@ -55,7 +55,7 @@ func (r *queryResolver) rooflineHeatmap(
|
||||
// resolution = max(resolution, mc.Timestep)
|
||||
// }
|
||||
|
||||
jobdata, err := metricDataDispatcher.LoadData(job, []string{"flops_any", "mem_bw"}, []schema.MetricScope{schema.MetricScopeNode}, ctx, 0)
|
||||
jobdata, err := metricdispatch.LoadData(job, []string{"flops_any", "mem_bw"}, []schema.MetricScope{schema.MetricScopeNode}, ctx, 0)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while loading roofline metrics for job %d", job.ID)
|
||||
return nil, err
|
||||
@@ -128,7 +128,7 @@ func (r *queryResolver) jobsFootprints(ctx context.Context, filter []*model.JobF
|
||||
continue
|
||||
}
|
||||
|
||||
if err := metricDataDispatcher.LoadAverages(job, metrics, avgs, ctx); err != nil {
|
||||
if err := metricdispatch.LoadAverages(job, metrics, avgs, ctx); err != nil {
|
||||
cclog.Error("Error while loading averages for footprint")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
@@ -2,6 +2,7 @@
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package importer
|
||||
|
||||
import (
|
||||
@@ -14,8 +15,8 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// HandleImportFlag imports jobs from file pairs specified in a comma-separated flag string.
|
||||
|
||||
@@ -16,8 +16,8 @@ import (
|
||||
"github.com/ClusterCockpit/cc-backend/internal/importer"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
ccconf "github.com/ClusterCockpit/cc-lib/v2/ccConfig"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
)
|
||||
|
||||
// copyFile copies a file from source path to destination path.
|
||||
@@ -56,36 +56,8 @@ func setup(t *testing.T) *repository.JobRepository {
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"clusters": [
|
||||
{
|
||||
"name": "testcluster",
|
||||
"metricDataRepository": {"kind": "test", "url": "bla:8081"},
|
||||
"filterRanges": {
|
||||
"numNodes": { "from": 1, "to": 64 },
|
||||
"duration": { "from": 0, "to": 86400 },
|
||||
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "fritz",
|
||||
"metricDataRepository": {"kind": "test", "url": "bla:8081"},
|
||||
"filterRanges": {
|
||||
"numNodes": { "from": 1, "to": 944 },
|
||||
"duration": { "from": 0, "to": 86400 },
|
||||
"startTime": { "from": "2022-01-01T00:00:00Z", "to": null }
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "taurus",
|
||||
"metricDataRepository": {"kind": "test", "url": "bla:8081"},
|
||||
"filterRanges": {
|
||||
"numNodes": { "from": 1, "to": 4000 },
|
||||
"duration": { "from": 0, "to": 604800 },
|
||||
"startTime": { "from": "2010-01-01T00:00:00Z", "to": null }
|
||||
}
|
||||
}
|
||||
]}`
|
||||
}
|
||||
}`
|
||||
|
||||
cclog.Init("info", true)
|
||||
tmpdir := t.TempDir()
|
||||
@@ -107,7 +79,7 @@ func setup(t *testing.T) *repository.JobRepository {
|
||||
}
|
||||
|
||||
dbfilepath := filepath.Join(tmpdir, "test.db")
|
||||
err := repository.MigrateDB("sqlite3", dbfilepath)
|
||||
err := repository.MigrateDB(dbfilepath)
|
||||
if err != nil {
|
||||
t.Fatal(err)
|
||||
}
|
||||
@@ -121,11 +93,7 @@ func setup(t *testing.T) *repository.JobRepository {
|
||||
|
||||
// Load and check main configuration
|
||||
if cfg := ccconf.GetPackageConfig("main"); cfg != nil {
|
||||
if clustercfg := ccconf.GetPackageConfig("clusters"); clustercfg != nil {
|
||||
config.Init(cfg, clustercfg)
|
||||
} else {
|
||||
t.Fatal("Cluster configuration must be present")
|
||||
}
|
||||
config.Init(cfg)
|
||||
} else {
|
||||
t.Fatal("Main configuration must be present")
|
||||
}
|
||||
|
||||
@@ -22,8 +22,8 @@ import (
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/repository"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
const (
|
||||
|
||||
@@ -2,12 +2,13 @@
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package importer
|
||||
|
||||
import (
|
||||
"math"
|
||||
|
||||
ccunits "github.com/ClusterCockpit/cc-lib/ccUnits"
|
||||
ccunits "github.com/ClusterCockpit/cc-lib/v2/ccUnits"
|
||||
)
|
||||
|
||||
// getNormalizationFactor calculates the scaling factor needed to normalize a value
|
||||
|
||||
@@ -8,7 +8,7 @@ import (
|
||||
"fmt"
|
||||
"testing"
|
||||
|
||||
ccunits "github.com/ClusterCockpit/cc-lib/ccUnits"
|
||||
ccunits "github.com/ClusterCockpit/cc-lib/v2/ccUnits"
|
||||
)
|
||||
|
||||
// TestNormalizeFactor tests the normalization of large byte values to gigabyte prefix.
|
||||
|
||||
@@ -1,95 +0,0 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
|
||||
const configSchema = `{
|
||||
"type": "object",
|
||||
"description": "Configuration specific to built-in metric-store.",
|
||||
"properties": {
|
||||
"checkpoints": {
|
||||
"description": "Configuration for checkpointing the metrics within metric-store",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"file-format": {
|
||||
"description": "Specify the type of checkpoint file. There are 2 variants: 'avro' and 'json'. If nothing is specified, 'avro' is default.",
|
||||
"type": "string"
|
||||
},
|
||||
"interval": {
|
||||
"description": "Interval at which the metrics should be checkpointed.",
|
||||
"type": "string"
|
||||
},
|
||||
"directory": {
|
||||
"description": "Specify the parent directy in which the checkpointed files should be placed.",
|
||||
"type": "string"
|
||||
},
|
||||
"restore": {
|
||||
"description": "When cc-backend starts up, look for checkpointed files that are less than X hours old and load metrics from these selected checkpoint files.",
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"archive": {
|
||||
"description": "Configuration for archiving the already checkpointed files.",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"interval": {
|
||||
"description": "Interval at which the checkpointed files should be archived.",
|
||||
"type": "string"
|
||||
},
|
||||
"directory": {
|
||||
"description": "Specify the parent directy in which the archived files should be placed.",
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
},
|
||||
"retention-in-memory": {
|
||||
"description": "Keep the metrics within memory for given time interval. Retention for X hours, then the metrics would be freed.",
|
||||
"type": "string"
|
||||
},
|
||||
"nats": {
|
||||
"description": "Configuration for accepting published data through NATS.",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"address": {
|
||||
"description": "Address of the NATS server.",
|
||||
"type": "string"
|
||||
},
|
||||
"username": {
|
||||
"description": "Optional: If configured with username/password method.",
|
||||
"type": "string"
|
||||
},
|
||||
"password": {
|
||||
"description": "Optional: If configured with username/password method.",
|
||||
"type": "string"
|
||||
},
|
||||
"creds-file-path": {
|
||||
"description": "Optional: If configured with Credential File method. Path to your NATS cred file.",
|
||||
"type": "string"
|
||||
},
|
||||
"subscriptions": {
|
||||
"description": "Array of various subscriptions. Allows to subscibe to different subjects and publishers.",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"subscribe-to": {
|
||||
"description": "Channel name",
|
||||
"type": "string"
|
||||
},
|
||||
"cluster-tag": {
|
||||
"description": "Optional: Allow lines without a cluster tag, use this as default",
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}`
|
||||
@@ -1,381 +0,0 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
package metricDataDispatcher
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"math"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricdata"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/lrucache"
|
||||
"github.com/ClusterCockpit/cc-lib/resampler"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
)
|
||||
|
||||
var cache *lrucache.Cache = lrucache.New(128 * 1024 * 1024)
|
||||
|
||||
func cacheKey(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
) string {
|
||||
// Duration and StartTime do not need to be in the cache key as StartTime is less unique than
|
||||
// job.ID and the TTL of the cache entry makes sure it does not stay there forever.
|
||||
return fmt.Sprintf("%d(%s):[%v],[%v]-%d",
|
||||
job.ID, job.State, metrics, scopes, resolution)
|
||||
}
|
||||
|
||||
// Fetches the metric data for a job.
|
||||
func LoadData(job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
resolution int,
|
||||
) (schema.JobData, error) {
|
||||
data := cache.Get(cacheKey(job, metrics, scopes, resolution), func() (_ any, ttl time.Duration, size int) {
|
||||
var jd schema.JobData
|
||||
var err error
|
||||
|
||||
if job.State == schema.JobStateRunning ||
|
||||
job.MonitoringStatus == schema.MonitoringStatusRunningOrArchiving ||
|
||||
config.Keys.DisableArchive {
|
||||
|
||||
repo, err := metricdata.GetMetricDataRepo(job.Cluster)
|
||||
if err != nil {
|
||||
return fmt.Errorf("METRICDATA/METRICDATA > no metric data repository configured for '%s'", job.Cluster), 0, 0
|
||||
}
|
||||
|
||||
if scopes == nil {
|
||||
scopes = append(scopes, schema.MetricScopeNode)
|
||||
}
|
||||
|
||||
if metrics == nil {
|
||||
cluster := archive.GetCluster(job.Cluster)
|
||||
for _, mc := range cluster.MetricConfig {
|
||||
metrics = append(metrics, mc.Name)
|
||||
}
|
||||
}
|
||||
|
||||
jd, err = repo.LoadData(job, metrics, scopes, ctx, resolution)
|
||||
if err != nil {
|
||||
if len(jd) != 0 {
|
||||
cclog.Warnf("partial error: %s", err.Error())
|
||||
// return err, 0, 0 // Reactivating will block archiving on one partial error
|
||||
} else {
|
||||
cclog.Error("Error while loading job data from metric repository")
|
||||
return err, 0, 0
|
||||
}
|
||||
}
|
||||
size = jd.Size()
|
||||
} else {
|
||||
var jd_temp schema.JobData
|
||||
jd_temp, err = archive.GetHandle().LoadJobData(job)
|
||||
if err != nil {
|
||||
cclog.Error("Error while loading job data from archive")
|
||||
return err, 0, 0
|
||||
}
|
||||
|
||||
// Deep copy the cached archive hashmap
|
||||
jd = metricdata.DeepCopy(jd_temp)
|
||||
|
||||
// Resampling for archived data.
|
||||
// Pass the resolution from frontend here.
|
||||
for _, v := range jd {
|
||||
for _, v_ := range v {
|
||||
timestep := int64(0)
|
||||
for i := 0; i < len(v_.Series); i += 1 {
|
||||
v_.Series[i].Data, timestep, err = resampler.LargestTriangleThreeBucket(v_.Series[i].Data, int64(v_.Timestep), int64(resolution))
|
||||
if err != nil {
|
||||
return err, 0, 0
|
||||
}
|
||||
}
|
||||
v_.Timestep = int(timestep)
|
||||
}
|
||||
}
|
||||
|
||||
// Avoid sending unrequested data to the client:
|
||||
if metrics != nil || scopes != nil {
|
||||
if metrics == nil {
|
||||
metrics = make([]string, 0, len(jd))
|
||||
for k := range jd {
|
||||
metrics = append(metrics, k)
|
||||
}
|
||||
}
|
||||
|
||||
res := schema.JobData{}
|
||||
for _, metric := range metrics {
|
||||
if perscope, ok := jd[metric]; ok {
|
||||
if len(perscope) > 1 {
|
||||
subset := make(map[schema.MetricScope]*schema.JobMetric)
|
||||
for _, scope := range scopes {
|
||||
if jm, ok := perscope[scope]; ok {
|
||||
subset[scope] = jm
|
||||
}
|
||||
}
|
||||
|
||||
if len(subset) > 0 {
|
||||
perscope = subset
|
||||
}
|
||||
}
|
||||
|
||||
res[metric] = perscope
|
||||
}
|
||||
}
|
||||
jd = res
|
||||
}
|
||||
size = jd.Size()
|
||||
}
|
||||
|
||||
ttl = 5 * time.Hour
|
||||
if job.State == schema.JobStateRunning {
|
||||
ttl = 2 * time.Minute
|
||||
}
|
||||
|
||||
// FIXME: Review: Is this really necessary or correct.
|
||||
// Note: Lines 147-170 formerly known as prepareJobData(jobData, scopes)
|
||||
// For /monitoring/job/<job> and some other places, flops_any and mem_bw need
|
||||
// to be available at the scope 'node'. If a job has a lot of nodes,
|
||||
// statisticsSeries should be available so that a min/median/max Graph can be
|
||||
// used instead of a lot of single lines.
|
||||
// NOTE: New StatsSeries will always be calculated as 'min/median/max'
|
||||
// Existing (archived) StatsSeries can be 'min/mean/max'!
|
||||
const maxSeriesSize int = 15
|
||||
for _, scopes := range jd {
|
||||
for _, jm := range scopes {
|
||||
if jm.StatisticsSeries != nil || len(jm.Series) <= maxSeriesSize {
|
||||
continue
|
||||
}
|
||||
|
||||
jm.AddStatisticsSeries()
|
||||
}
|
||||
}
|
||||
|
||||
nodeScopeRequested := false
|
||||
for _, scope := range scopes {
|
||||
if scope == schema.MetricScopeNode {
|
||||
nodeScopeRequested = true
|
||||
}
|
||||
}
|
||||
|
||||
if nodeScopeRequested {
|
||||
jd.AddNodeScope("flops_any")
|
||||
jd.AddNodeScope("mem_bw")
|
||||
}
|
||||
|
||||
// Round Resulting Stat Values
|
||||
jd.RoundMetricStats()
|
||||
|
||||
return jd, ttl, size
|
||||
})
|
||||
|
||||
if err, ok := data.(error); ok {
|
||||
cclog.Error("Error in returned dataset")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return data.(schema.JobData), nil
|
||||
}
|
||||
|
||||
// Used for the jobsFootprint GraphQL-Query. TODO: Rename/Generalize.
|
||||
func LoadAverages(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
data [][]schema.Float,
|
||||
ctx context.Context,
|
||||
) error {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadAveragesFromArchive(job, metrics, data) // #166 change also here?
|
||||
}
|
||||
|
||||
repo, err := metricdata.GetMetricDataRepo(job.Cluster)
|
||||
if err != nil {
|
||||
return fmt.Errorf("METRICDATA/METRICDATA > no metric data repository configured for '%s'", job.Cluster)
|
||||
}
|
||||
|
||||
stats, err := repo.LoadStats(job, metrics, ctx) // #166 how to handle stats for acc normalizazion?
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while loading statistics for job %v (User %v, Project %v)", job.JobID, job.User, job.Project)
|
||||
return err
|
||||
}
|
||||
|
||||
for i, m := range metrics {
|
||||
nodes, ok := stats[m]
|
||||
if !ok {
|
||||
data[i] = append(data[i], schema.NaN)
|
||||
continue
|
||||
}
|
||||
|
||||
sum := 0.0
|
||||
for _, node := range nodes {
|
||||
sum += node.Avg
|
||||
}
|
||||
data[i] = append(data[i], schema.Float(sum))
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// Used for statsTable in frontend: Return scoped statistics by metric.
|
||||
func LoadScopedJobStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
) (schema.ScopedJobStats, error) {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadScopedStatsFromArchive(job, metrics, scopes)
|
||||
}
|
||||
|
||||
repo, err := metricdata.GetMetricDataRepo(job.Cluster)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("job %d: no metric data repository configured for '%s'", job.JobID, job.Cluster)
|
||||
}
|
||||
|
||||
scopedStats, err := repo.LoadScopedStats(job, metrics, scopes, ctx)
|
||||
if err != nil {
|
||||
cclog.Errorf("error while loading scoped statistics for job %d (User %s, Project %s)", job.JobID, job.User, job.Project)
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return scopedStats, nil
|
||||
}
|
||||
|
||||
// Used for polar plots in frontend: Aggregates statistics for all nodes to single values for job per metric.
|
||||
func LoadJobStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.MetricStatistics, error) {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadStatsFromArchive(job, metrics)
|
||||
}
|
||||
|
||||
data := make(map[string]schema.MetricStatistics, len(metrics))
|
||||
repo, err := metricdata.GetMetricDataRepo(job.Cluster)
|
||||
if err != nil {
|
||||
return data, fmt.Errorf("job %d: no metric data repository configured for '%s'", job.JobID, job.Cluster)
|
||||
}
|
||||
|
||||
stats, err := repo.LoadStats(job, metrics, ctx)
|
||||
if err != nil {
|
||||
cclog.Errorf("error while loading statistics for job %d (User %s, Project %s)", job.JobID, job.User, job.Project)
|
||||
return data, err
|
||||
}
|
||||
|
||||
for _, m := range metrics {
|
||||
sum, avg, min, max := 0.0, 0.0, 0.0, 0.0
|
||||
nodes, ok := stats[m]
|
||||
if !ok {
|
||||
data[m] = schema.MetricStatistics{Min: min, Avg: avg, Max: max}
|
||||
continue
|
||||
}
|
||||
|
||||
for _, node := range nodes {
|
||||
sum += node.Avg
|
||||
min = math.Min(min, node.Min)
|
||||
max = math.Max(max, node.Max)
|
||||
}
|
||||
|
||||
data[m] = schema.MetricStatistics{
|
||||
Avg: (math.Round((sum/float64(job.NumNodes))*100) / 100),
|
||||
Min: (math.Round(min*100) / 100),
|
||||
Max: (math.Round(max*100) / 100),
|
||||
}
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// Used for the classic node/system view. Returns a map of nodes to a map of metrics.
|
||||
func LoadNodeData(
|
||||
cluster string,
|
||||
metrics, nodes []string,
|
||||
scopes []schema.MetricScope,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string][]*schema.JobMetric, error) {
|
||||
repo, err := metricdata.GetMetricDataRepo(cluster)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("METRICDATA/METRICDATA > no metric data repository configured for '%s'", cluster)
|
||||
}
|
||||
|
||||
if metrics == nil {
|
||||
for _, m := range archive.GetCluster(cluster).MetricConfig {
|
||||
metrics = append(metrics, m.Name)
|
||||
}
|
||||
}
|
||||
|
||||
data, err := repo.LoadNodeData(cluster, metrics, nodes, scopes, from, to, ctx)
|
||||
if err != nil {
|
||||
if len(data) != 0 {
|
||||
cclog.Warnf("partial error: %s", err.Error())
|
||||
} else {
|
||||
cclog.Error("Error while loading node data from metric repository")
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
if data == nil {
|
||||
return nil, fmt.Errorf("METRICDATA/METRICDATA > the metric data repository for '%s' does not support this query", cluster)
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
|
||||
func LoadNodeListData(
|
||||
cluster, subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.JobData, error) {
|
||||
repo, err := metricdata.GetMetricDataRepo(cluster)
|
||||
if err != nil {
|
||||
return nil, fmt.Errorf("METRICDATA/METRICDATA > no metric data repository configured for '%s'", cluster)
|
||||
}
|
||||
|
||||
if metrics == nil {
|
||||
for _, m := range archive.GetCluster(cluster).MetricConfig {
|
||||
metrics = append(metrics, m.Name)
|
||||
}
|
||||
}
|
||||
|
||||
data, err := repo.LoadNodeListData(cluster, subCluster, nodes, metrics, scopes, resolution, from, to, ctx)
|
||||
if err != nil {
|
||||
if len(data) != 0 {
|
||||
cclog.Warnf("partial error: %s", err.Error())
|
||||
} else {
|
||||
cclog.Error("Error while loading node data from metric repository")
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
// NOTE: New StatsSeries will always be calculated as 'min/median/max'
|
||||
const maxSeriesSize int = 8
|
||||
for _, jd := range data {
|
||||
for _, scopes := range jd {
|
||||
for _, jm := range scopes {
|
||||
if jm.StatisticsSeries != nil || len(jm.Series) < maxSeriesSize {
|
||||
continue
|
||||
}
|
||||
jm.AddStatisticsSeries()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if data == nil {
|
||||
return nil, fmt.Errorf("METRICDATA/METRICDATA > the metric data repository for '%s' does not support this query", cluster)
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
File diff suppressed because it is too large
Load Diff
@@ -1,88 +0,0 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package metricdata
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/memorystore"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
)
|
||||
|
||||
type MetricDataRepository interface {
|
||||
// Initialize this MetricDataRepository. One instance of
|
||||
// this interface will only ever be responsible for one cluster.
|
||||
Init(rawConfig json.RawMessage) error
|
||||
|
||||
// Return the JobData for the given job, only with the requested metrics.
|
||||
LoadData(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error)
|
||||
|
||||
// Return a map of metrics to a map of nodes to the metric statistics of the job. node scope only.
|
||||
LoadStats(job *schema.Job, metrics []string, ctx context.Context) (map[string]map[string]schema.MetricStatistics, error)
|
||||
|
||||
// Return a map of metrics to a map of scopes to the scoped metric statistics of the job.
|
||||
LoadScopedStats(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context) (schema.ScopedJobStats, error)
|
||||
|
||||
// Return a map of hosts to a map of metrics at the requested scopes (currently only node) for that node.
|
||||
LoadNodeData(cluster string, metrics, nodes []string, scopes []schema.MetricScope, from, to time.Time, ctx context.Context) (map[string]map[string][]*schema.JobMetric, error)
|
||||
|
||||
// Return a map of hosts to a map of metrics to a map of scopes for multiple nodes.
|
||||
LoadNodeListData(cluster, subCluster string, nodes, metrics []string, scopes []schema.MetricScope, resolution int, from, to time.Time, ctx context.Context) (map[string]schema.JobData, error)
|
||||
}
|
||||
|
||||
var metricDataRepos map[string]MetricDataRepository = map[string]MetricDataRepository{}
|
||||
|
||||
func Init() error {
|
||||
for _, cluster := range config.Clusters {
|
||||
if cluster.MetricDataRepository != nil {
|
||||
var kind struct {
|
||||
Kind string `json:"kind"`
|
||||
}
|
||||
if err := json.Unmarshal(cluster.MetricDataRepository, &kind); err != nil {
|
||||
cclog.Warn("Error while unmarshaling raw json MetricDataRepository")
|
||||
return err
|
||||
}
|
||||
|
||||
var mdr MetricDataRepository
|
||||
switch kind.Kind {
|
||||
case "cc-metric-store":
|
||||
mdr = &CCMetricStore{}
|
||||
case "cc-metric-store-internal":
|
||||
mdr = &CCMetricStoreInternal{}
|
||||
memorystore.InternalCCMSFlag = true
|
||||
case "prometheus":
|
||||
mdr = &PrometheusDataRepository{}
|
||||
case "test":
|
||||
mdr = &TestMetricDataRepository{}
|
||||
default:
|
||||
return fmt.Errorf("METRICDATA/METRICDATA > Unknown MetricDataRepository %v for cluster %v", kind.Kind, cluster.Name)
|
||||
}
|
||||
|
||||
if err := mdr.Init(cluster.MetricDataRepository); err != nil {
|
||||
cclog.Errorf("Error initializing MetricDataRepository %v for cluster %v", kind.Kind, cluster.Name)
|
||||
return err
|
||||
}
|
||||
metricDataRepos[cluster.Name] = mdr
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func GetMetricDataRepo(cluster string) (MetricDataRepository, error) {
|
||||
var err error
|
||||
repo, ok := metricDataRepos[cluster]
|
||||
|
||||
if !ok {
|
||||
err = fmt.Errorf("METRICDATA/METRICDATA > no metric data repository configured for '%s'", cluster)
|
||||
}
|
||||
|
||||
return repo, err
|
||||
}
|
||||
@@ -1,587 +0,0 @@
|
||||
// Copyright (C) 2022 DKRZ
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
package metricdata
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"fmt"
|
||||
"math"
|
||||
"net/http"
|
||||
"os"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
"text/template"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
promapi "github.com/prometheus/client_golang/api"
|
||||
promv1 "github.com/prometheus/client_golang/api/prometheus/v1"
|
||||
promcfg "github.com/prometheus/common/config"
|
||||
promm "github.com/prometheus/common/model"
|
||||
)
|
||||
|
||||
type PrometheusDataRepositoryConfig struct {
|
||||
Url string `json:"url"`
|
||||
Username string `json:"username,omitempty"`
|
||||
Suffix string `json:"suffix,omitempty"`
|
||||
Templates map[string]string `json:"query-templates"`
|
||||
}
|
||||
|
||||
type PrometheusDataRepository struct {
|
||||
client promapi.Client
|
||||
queryClient promv1.API
|
||||
suffix string
|
||||
templates map[string]*template.Template
|
||||
}
|
||||
|
||||
type PromQLArgs struct {
|
||||
Nodes string
|
||||
}
|
||||
|
||||
type Trie map[rune]Trie
|
||||
|
||||
var logOnce sync.Once
|
||||
|
||||
func contains(s []schema.MetricScope, str schema.MetricScope) bool {
|
||||
for _, v := range s {
|
||||
if v == str {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func MinMaxMean(data []schema.Float) (float64, float64, float64) {
|
||||
if len(data) == 0 {
|
||||
return 0.0, 0.0, 0.0
|
||||
}
|
||||
min := math.MaxFloat64
|
||||
max := -math.MaxFloat64
|
||||
var sum float64
|
||||
var n float64
|
||||
for _, val := range data {
|
||||
if val.IsNaN() {
|
||||
continue
|
||||
}
|
||||
sum += float64(val)
|
||||
n += 1
|
||||
if float64(val) > max {
|
||||
max = float64(val)
|
||||
}
|
||||
if float64(val) < min {
|
||||
min = float64(val)
|
||||
}
|
||||
}
|
||||
return min, max, sum / n
|
||||
}
|
||||
|
||||
// Rewritten from
|
||||
// https://github.com/ermanh/trieregex/blob/master/trieregex/trieregex.py
|
||||
func nodeRegex(nodes []string) string {
|
||||
root := Trie{}
|
||||
// add runes of each compute node to trie
|
||||
for _, node := range nodes {
|
||||
_trie := root
|
||||
for _, c := range node {
|
||||
if _, ok := _trie[c]; !ok {
|
||||
_trie[c] = Trie{}
|
||||
}
|
||||
_trie = _trie[c]
|
||||
}
|
||||
_trie['*'] = Trie{}
|
||||
}
|
||||
// recursively build regex from rune trie
|
||||
var trieRegex func(trie Trie, reset bool) string
|
||||
trieRegex = func(trie Trie, reset bool) string {
|
||||
if reset == true {
|
||||
trie = root
|
||||
}
|
||||
if len(trie) == 0 {
|
||||
return ""
|
||||
}
|
||||
if len(trie) == 1 {
|
||||
for key, _trie := range trie {
|
||||
if key == '*' {
|
||||
return ""
|
||||
}
|
||||
return regexp.QuoteMeta(string(key)) + trieRegex(_trie, false)
|
||||
}
|
||||
} else {
|
||||
sequences := []string{}
|
||||
for key, _trie := range trie {
|
||||
if key != '*' {
|
||||
sequences = append(sequences, regexp.QuoteMeta(string(key))+trieRegex(_trie, false))
|
||||
}
|
||||
}
|
||||
sort.Slice(sequences, func(i, j int) bool {
|
||||
return (-len(sequences[i]) < -len(sequences[j])) || (sequences[i] < sequences[j])
|
||||
})
|
||||
var result string
|
||||
// single edge from this tree node
|
||||
if len(sequences) == 1 {
|
||||
result = sequences[0]
|
||||
if len(result) > 1 {
|
||||
result = "(?:" + result + ")"
|
||||
}
|
||||
// multiple edges, each length 1
|
||||
} else if s := strings.Join(sequences, ""); len(s) == len(sequences) {
|
||||
// char or numeric range
|
||||
if len(s)-1 == int(s[len(s)-1])-int(s[0]) {
|
||||
result = fmt.Sprintf("[%c-%c]", s[0], s[len(s)-1])
|
||||
// char or numeric set
|
||||
} else {
|
||||
result = "[" + s + "]"
|
||||
}
|
||||
// multiple edges of different lengths
|
||||
} else {
|
||||
result = "(?:" + strings.Join(sequences, "|") + ")"
|
||||
}
|
||||
if _, ok := trie['*']; ok {
|
||||
result += "?"
|
||||
}
|
||||
return result
|
||||
}
|
||||
return ""
|
||||
}
|
||||
return trieRegex(root, true)
|
||||
}
|
||||
|
||||
func (pdb *PrometheusDataRepository) Init(rawConfig json.RawMessage) error {
|
||||
var config PrometheusDataRepositoryConfig
|
||||
// parse config
|
||||
if err := json.Unmarshal(rawConfig, &config); err != nil {
|
||||
cclog.Warn("Error while unmarshaling raw json config")
|
||||
return err
|
||||
}
|
||||
// support basic authentication
|
||||
var rt http.RoundTripper = nil
|
||||
if prom_pw := os.Getenv("PROMETHEUS_PASSWORD"); prom_pw != "" && config.Username != "" {
|
||||
prom_pw := promcfg.Secret(prom_pw)
|
||||
rt = promcfg.NewBasicAuthRoundTripper(promcfg.NewInlineSecret(config.Username), promcfg.NewInlineSecret(string(prom_pw)), promapi.DefaultRoundTripper)
|
||||
} else {
|
||||
if config.Username != "" {
|
||||
return errors.New("METRICDATA/PROMETHEUS > Prometheus username provided, but PROMETHEUS_PASSWORD not set")
|
||||
}
|
||||
}
|
||||
// init client
|
||||
client, err := promapi.NewClient(promapi.Config{
|
||||
Address: config.Url,
|
||||
RoundTripper: rt,
|
||||
})
|
||||
if err != nil {
|
||||
cclog.Error("Error while initializing new prometheus client")
|
||||
return err
|
||||
}
|
||||
// init query client
|
||||
pdb.client = client
|
||||
pdb.queryClient = promv1.NewAPI(pdb.client)
|
||||
// site config
|
||||
pdb.suffix = config.Suffix
|
||||
// init query templates
|
||||
pdb.templates = make(map[string]*template.Template)
|
||||
for metric, templ := range config.Templates {
|
||||
pdb.templates[metric], err = template.New(metric).Parse(templ)
|
||||
if err == nil {
|
||||
cclog.Debugf("Added PromQL template for %s: %s", metric, templ)
|
||||
} else {
|
||||
cclog.Warnf("Failed to parse PromQL template %s for metric %s", templ, metric)
|
||||
}
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
// TODO: respect scope argument
|
||||
func (pdb *PrometheusDataRepository) FormatQuery(
|
||||
metric string,
|
||||
scope schema.MetricScope,
|
||||
nodes []string,
|
||||
cluster string,
|
||||
) (string, error) {
|
||||
args := PromQLArgs{}
|
||||
if len(nodes) > 0 {
|
||||
args.Nodes = fmt.Sprintf("(%s)%s", nodeRegex(nodes), pdb.suffix)
|
||||
} else {
|
||||
args.Nodes = fmt.Sprintf(".*%s", pdb.suffix)
|
||||
}
|
||||
|
||||
buf := &bytes.Buffer{}
|
||||
if templ, ok := pdb.templates[metric]; ok {
|
||||
err := templ.Execute(buf, args)
|
||||
if err != nil {
|
||||
return "", errors.New(fmt.Sprintf("METRICDATA/PROMETHEUS > Error compiling template %v", templ))
|
||||
} else {
|
||||
query := buf.String()
|
||||
cclog.Debugf("PromQL: %s", query)
|
||||
return query, nil
|
||||
}
|
||||
} else {
|
||||
return "", errors.New(fmt.Sprintf("METRICDATA/PROMETHEUS > No PromQL for metric %s configured.", metric))
|
||||
}
|
||||
}
|
||||
|
||||
// Convert PromAPI row to CC schema.Series
|
||||
func (pdb *PrometheusDataRepository) RowToSeries(
|
||||
from time.Time,
|
||||
step int64,
|
||||
steps int64,
|
||||
row *promm.SampleStream,
|
||||
) schema.Series {
|
||||
ts := from.Unix()
|
||||
hostname := strings.TrimSuffix(string(row.Metric["exported_instance"]), pdb.suffix)
|
||||
// init array of expected length with NaN
|
||||
values := make([]schema.Float, steps+1)
|
||||
for i := range values {
|
||||
values[i] = schema.NaN
|
||||
}
|
||||
// copy recorded values from prom sample pair
|
||||
for _, v := range row.Values {
|
||||
idx := (v.Timestamp.Unix() - ts) / step
|
||||
values[idx] = schema.Float(v.Value)
|
||||
}
|
||||
min, max, mean := MinMaxMean(values)
|
||||
// output struct
|
||||
return schema.Series{
|
||||
Hostname: hostname,
|
||||
Data: values,
|
||||
Statistics: schema.MetricStatistics{
|
||||
Avg: mean,
|
||||
Min: min,
|
||||
Max: max,
|
||||
},
|
||||
}
|
||||
}
|
||||
|
||||
func (pdb *PrometheusDataRepository) LoadData(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
resolution int,
|
||||
) (schema.JobData, error) {
|
||||
// TODO respect requested scope
|
||||
if len(scopes) == 0 || !contains(scopes, schema.MetricScopeNode) {
|
||||
scopes = append(scopes, schema.MetricScopeNode)
|
||||
}
|
||||
|
||||
jobData := make(schema.JobData)
|
||||
// parse job specs
|
||||
nodes := make([]string, len(job.Resources))
|
||||
for i, resource := range job.Resources {
|
||||
nodes[i] = resource.Hostname
|
||||
}
|
||||
from := time.Unix(job.StartTime, 0)
|
||||
to := time.Unix(job.StartTime+int64(job.Duration), 0)
|
||||
|
||||
for _, scope := range scopes {
|
||||
if scope != schema.MetricScopeNode {
|
||||
logOnce.Do(func() {
|
||||
cclog.Infof("Scope '%s' requested, but not yet supported: Will return 'node' scope only.", scope)
|
||||
})
|
||||
continue
|
||||
}
|
||||
|
||||
for _, metric := range metrics {
|
||||
metricConfig := archive.GetMetricConfig(job.Cluster, metric)
|
||||
if metricConfig == nil {
|
||||
cclog.Warnf("Error in LoadData: Metric %s for cluster %s not configured", metric, job.Cluster)
|
||||
return nil, errors.New("Prometheus config error")
|
||||
}
|
||||
query, err := pdb.FormatQuery(metric, scope, nodes, job.Cluster)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while formatting prometheus query")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// ranged query over all job nodes
|
||||
r := promv1.Range{
|
||||
Start: from,
|
||||
End: to,
|
||||
Step: time.Duration(metricConfig.Timestep * 1e9),
|
||||
}
|
||||
result, warnings, err := pdb.queryClient.QueryRange(ctx, query, r)
|
||||
if err != nil {
|
||||
cclog.Errorf("Prometheus query error in LoadData: %v\nQuery: %s", err, query)
|
||||
return nil, errors.New("Prometheus query error")
|
||||
}
|
||||
if len(warnings) > 0 {
|
||||
cclog.Warnf("Warnings: %v\n", warnings)
|
||||
}
|
||||
|
||||
// init data structures
|
||||
if _, ok := jobData[metric]; !ok {
|
||||
jobData[metric] = make(map[schema.MetricScope]*schema.JobMetric)
|
||||
}
|
||||
jobMetric, ok := jobData[metric][scope]
|
||||
if !ok {
|
||||
jobMetric = &schema.JobMetric{
|
||||
Unit: metricConfig.Unit,
|
||||
Timestep: metricConfig.Timestep,
|
||||
Series: make([]schema.Series, 0),
|
||||
}
|
||||
}
|
||||
step := int64(metricConfig.Timestep)
|
||||
steps := int64(to.Sub(from).Seconds()) / step
|
||||
// iter rows of host, metric, values
|
||||
for _, row := range result.(promm.Matrix) {
|
||||
jobMetric.Series = append(jobMetric.Series,
|
||||
pdb.RowToSeries(from, step, steps, row))
|
||||
}
|
||||
// only add metric if at least one host returned data
|
||||
if !ok && len(jobMetric.Series) > 0 {
|
||||
jobData[metric][scope] = jobMetric
|
||||
}
|
||||
// sort by hostname to get uniform coloring
|
||||
sort.Slice(jobMetric.Series, func(i, j int) bool {
|
||||
return (jobMetric.Series[i].Hostname < jobMetric.Series[j].Hostname)
|
||||
})
|
||||
}
|
||||
}
|
||||
return jobData, nil
|
||||
}
|
||||
|
||||
// TODO change implementation to precomputed/cached stats
|
||||
func (pdb *PrometheusDataRepository) LoadStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string]schema.MetricStatistics, error) {
|
||||
// map of metrics of nodes of stats
|
||||
stats := map[string]map[string]schema.MetricStatistics{}
|
||||
|
||||
data, err := pdb.LoadData(job, metrics, []schema.MetricScope{schema.MetricScopeNode}, ctx, 0 /*resolution here*/)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while loading job for stats")
|
||||
return nil, err
|
||||
}
|
||||
for metric, metricData := range data {
|
||||
stats[metric] = make(map[string]schema.MetricStatistics)
|
||||
for _, series := range metricData[schema.MetricScopeNode].Series {
|
||||
stats[metric][series.Hostname] = series.Statistics
|
||||
}
|
||||
}
|
||||
|
||||
return stats, nil
|
||||
}
|
||||
|
||||
func (pdb *PrometheusDataRepository) LoadNodeData(
|
||||
cluster string,
|
||||
metrics, nodes []string,
|
||||
scopes []schema.MetricScope,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string][]*schema.JobMetric, error) {
|
||||
t0 := time.Now()
|
||||
// Map of hosts of metrics of value slices
|
||||
data := make(map[string]map[string][]*schema.JobMetric)
|
||||
// query db for each metric
|
||||
// TODO: scopes seems to be always empty
|
||||
if len(scopes) == 0 || !contains(scopes, schema.MetricScopeNode) {
|
||||
scopes = append(scopes, schema.MetricScopeNode)
|
||||
}
|
||||
for _, scope := range scopes {
|
||||
if scope != schema.MetricScopeNode {
|
||||
logOnce.Do(func() {
|
||||
cclog.Infof("Note: Scope '%s' requested, but not yet supported: Will return 'node' scope only.", scope)
|
||||
})
|
||||
continue
|
||||
}
|
||||
for _, metric := range metrics {
|
||||
metricConfig := archive.GetMetricConfig(cluster, metric)
|
||||
if metricConfig == nil {
|
||||
cclog.Warnf("Error in LoadNodeData: Metric %s for cluster %s not configured", metric, cluster)
|
||||
return nil, errors.New("Prometheus config error")
|
||||
}
|
||||
query, err := pdb.FormatQuery(metric, scope, nodes, cluster)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while formatting prometheus query")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// ranged query over all nodes
|
||||
r := promv1.Range{
|
||||
Start: from,
|
||||
End: to,
|
||||
Step: time.Duration(metricConfig.Timestep * 1e9),
|
||||
}
|
||||
result, warnings, err := pdb.queryClient.QueryRange(ctx, query, r)
|
||||
if err != nil {
|
||||
cclog.Errorf("Prometheus query error in LoadNodeData: %v\n", err)
|
||||
return nil, errors.New("Prometheus query error")
|
||||
}
|
||||
if len(warnings) > 0 {
|
||||
cclog.Warnf("Warnings: %v\n", warnings)
|
||||
}
|
||||
|
||||
step := int64(metricConfig.Timestep)
|
||||
steps := int64(to.Sub(from).Seconds()) / step
|
||||
|
||||
// iter rows of host, metric, values
|
||||
for _, row := range result.(promm.Matrix) {
|
||||
hostname := strings.TrimSuffix(string(row.Metric["exported_instance"]), pdb.suffix)
|
||||
hostdata, ok := data[hostname]
|
||||
if !ok {
|
||||
hostdata = make(map[string][]*schema.JobMetric)
|
||||
data[hostname] = hostdata
|
||||
}
|
||||
// output per host and metric
|
||||
hostdata[metric] = append(hostdata[metric], &schema.JobMetric{
|
||||
Unit: metricConfig.Unit,
|
||||
Timestep: metricConfig.Timestep,
|
||||
Series: []schema.Series{pdb.RowToSeries(from, step, steps, row)},
|
||||
},
|
||||
)
|
||||
}
|
||||
}
|
||||
}
|
||||
t1 := time.Since(t0)
|
||||
cclog.Debugf("LoadNodeData of %v nodes took %s", len(data), t1)
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// Implemented by NHR@FAU; Used in Job-View StatsTable
|
||||
func (pdb *PrometheusDataRepository) LoadScopedStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
) (schema.ScopedJobStats, error) {
|
||||
// Assumption: pdb.loadData() only returns series node-scope - use node scope for statsTable
|
||||
scopedJobStats := make(schema.ScopedJobStats)
|
||||
data, err := pdb.LoadData(job, metrics, []schema.MetricScope{schema.MetricScopeNode}, ctx, 0 /*resolution here*/)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while loading job for scopedJobStats")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
for metric, metricData := range data {
|
||||
for _, scope := range scopes {
|
||||
if scope != schema.MetricScopeNode {
|
||||
logOnce.Do(func() {
|
||||
cclog.Infof("Note: Scope '%s' requested, but not yet supported: Will return 'node' scope only.", scope)
|
||||
})
|
||||
continue
|
||||
}
|
||||
|
||||
if _, ok := scopedJobStats[metric]; !ok {
|
||||
scopedJobStats[metric] = make(map[schema.MetricScope][]*schema.ScopedStats)
|
||||
}
|
||||
|
||||
if _, ok := scopedJobStats[metric][scope]; !ok {
|
||||
scopedJobStats[metric][scope] = make([]*schema.ScopedStats, 0)
|
||||
}
|
||||
|
||||
for _, series := range metricData[scope].Series {
|
||||
scopedJobStats[metric][scope] = append(scopedJobStats[metric][scope], &schema.ScopedStats{
|
||||
Hostname: series.Hostname,
|
||||
Data: &series.Statistics,
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return scopedJobStats, nil
|
||||
}
|
||||
|
||||
// Implemented by NHR@FAU; Used in NodeList-View
|
||||
func (pdb *PrometheusDataRepository) LoadNodeListData(
|
||||
cluster, subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.JobData, error) {
|
||||
// Assumption: pdb.loadData() only returns series node-scope - use node scope for NodeList
|
||||
|
||||
// Fetch Data, based on pdb.LoadNodeData()
|
||||
t0 := time.Now()
|
||||
// Map of hosts of jobData
|
||||
data := make(map[string]schema.JobData)
|
||||
|
||||
// query db for each metric
|
||||
// TODO: scopes seems to be always empty
|
||||
if len(scopes) == 0 || !contains(scopes, schema.MetricScopeNode) {
|
||||
scopes = append(scopes, schema.MetricScopeNode)
|
||||
}
|
||||
|
||||
for _, scope := range scopes {
|
||||
if scope != schema.MetricScopeNode {
|
||||
logOnce.Do(func() {
|
||||
cclog.Infof("Note: Scope '%s' requested, but not yet supported: Will return 'node' scope only.", scope)
|
||||
})
|
||||
continue
|
||||
}
|
||||
|
||||
for _, metric := range metrics {
|
||||
metricConfig := archive.GetMetricConfig(cluster, metric)
|
||||
if metricConfig == nil {
|
||||
cclog.Warnf("Error in LoadNodeListData: Metric %s for cluster %s not configured", metric, cluster)
|
||||
return nil, errors.New("Prometheus config error")
|
||||
}
|
||||
query, err := pdb.FormatQuery(metric, scope, nodes, cluster)
|
||||
if err != nil {
|
||||
cclog.Warn("Error while formatting prometheus query")
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// ranged query over all nodes
|
||||
r := promv1.Range{
|
||||
Start: from,
|
||||
End: to,
|
||||
Step: time.Duration(metricConfig.Timestep * 1e9),
|
||||
}
|
||||
result, warnings, err := pdb.queryClient.QueryRange(ctx, query, r)
|
||||
if err != nil {
|
||||
cclog.Errorf("Prometheus query error in LoadNodeData: %v\n", err)
|
||||
return nil, errors.New("Prometheus query error")
|
||||
}
|
||||
if len(warnings) > 0 {
|
||||
cclog.Warnf("Warnings: %v\n", warnings)
|
||||
}
|
||||
|
||||
step := int64(metricConfig.Timestep)
|
||||
steps := int64(to.Sub(from).Seconds()) / step
|
||||
|
||||
// iter rows of host, metric, values
|
||||
for _, row := range result.(promm.Matrix) {
|
||||
hostname := strings.TrimSuffix(string(row.Metric["exported_instance"]), pdb.suffix)
|
||||
|
||||
hostdata, ok := data[hostname]
|
||||
if !ok {
|
||||
hostdata = make(schema.JobData)
|
||||
data[hostname] = hostdata
|
||||
}
|
||||
|
||||
metricdata, ok := hostdata[metric]
|
||||
if !ok {
|
||||
metricdata = make(map[schema.MetricScope]*schema.JobMetric)
|
||||
data[hostname][metric] = metricdata
|
||||
}
|
||||
|
||||
// output per host, metric and scope
|
||||
scopeData, ok := metricdata[scope]
|
||||
if !ok {
|
||||
scopeData = &schema.JobMetric{
|
||||
Unit: metricConfig.Unit,
|
||||
Timestep: metricConfig.Timestep,
|
||||
Series: []schema.Series{pdb.RowToSeries(from, step, steps, row)},
|
||||
}
|
||||
data[hostname][metric][scope] = scopeData
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
t1 := time.Since(t0)
|
||||
cclog.Debugf("LoadNodeListData of %v nodes took %s", len(data), t1)
|
||||
return data, nil
|
||||
}
|
||||
@@ -1,118 +0,0 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package metricdata
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
)
|
||||
|
||||
var TestLoadDataCallback func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error) = func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error) {
|
||||
panic("TODO")
|
||||
}
|
||||
|
||||
// TestMetricDataRepository is only a mock for unit-testing.
|
||||
type TestMetricDataRepository struct{}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) Init(_ json.RawMessage) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) LoadData(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
resolution int,
|
||||
) (schema.JobData, error) {
|
||||
return TestLoadDataCallback(job, metrics, scopes, ctx, resolution)
|
||||
}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) LoadStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string]schema.MetricStatistics, error) {
|
||||
panic("TODO")
|
||||
}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) LoadScopedStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
) (schema.ScopedJobStats, error) {
|
||||
panic("TODO")
|
||||
}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) LoadNodeData(
|
||||
cluster string,
|
||||
metrics, nodes []string,
|
||||
scopes []schema.MetricScope,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string][]*schema.JobMetric, error) {
|
||||
panic("TODO")
|
||||
}
|
||||
|
||||
func (tmdr *TestMetricDataRepository) LoadNodeListData(
|
||||
cluster, subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.JobData, error) {
|
||||
panic("TODO")
|
||||
}
|
||||
|
||||
func DeepCopy(jdTemp schema.JobData) schema.JobData {
|
||||
jd := make(schema.JobData, len(jdTemp))
|
||||
for k, v := range jdTemp {
|
||||
jd[k] = make(map[schema.MetricScope]*schema.JobMetric, len(jdTemp[k]))
|
||||
for k_, v_ := range v {
|
||||
jd[k][k_] = new(schema.JobMetric)
|
||||
jd[k][k_].Series = make([]schema.Series, len(v_.Series))
|
||||
for i := 0; i < len(v_.Series); i += 1 {
|
||||
jd[k][k_].Series[i].Data = make([]schema.Float, len(v_.Series[i].Data))
|
||||
copy(jd[k][k_].Series[i].Data, v_.Series[i].Data)
|
||||
jd[k][k_].Series[i].Hostname = v_.Series[i].Hostname
|
||||
jd[k][k_].Series[i].Id = v_.Series[i].Id
|
||||
jd[k][k_].Series[i].Statistics.Avg = v_.Series[i].Statistics.Avg
|
||||
jd[k][k_].Series[i].Statistics.Min = v_.Series[i].Statistics.Min
|
||||
jd[k][k_].Series[i].Statistics.Max = v_.Series[i].Statistics.Max
|
||||
}
|
||||
jd[k][k_].Timestep = v_.Timestep
|
||||
jd[k][k_].Unit.Base = v_.Unit.Base
|
||||
jd[k][k_].Unit.Prefix = v_.Unit.Prefix
|
||||
if v_.StatisticsSeries != nil {
|
||||
// Init Slices
|
||||
jd[k][k_].StatisticsSeries = new(schema.StatsSeries)
|
||||
jd[k][k_].StatisticsSeries.Max = make([]schema.Float, len(v_.StatisticsSeries.Max))
|
||||
jd[k][k_].StatisticsSeries.Min = make([]schema.Float, len(v_.StatisticsSeries.Min))
|
||||
jd[k][k_].StatisticsSeries.Median = make([]schema.Float, len(v_.StatisticsSeries.Median))
|
||||
jd[k][k_].StatisticsSeries.Mean = make([]schema.Float, len(v_.StatisticsSeries.Mean))
|
||||
// Copy Data
|
||||
copy(jd[k][k_].StatisticsSeries.Max, v_.StatisticsSeries.Max)
|
||||
copy(jd[k][k_].StatisticsSeries.Min, v_.StatisticsSeries.Min)
|
||||
copy(jd[k][k_].StatisticsSeries.Median, v_.StatisticsSeries.Median)
|
||||
copy(jd[k][k_].StatisticsSeries.Mean, v_.StatisticsSeries.Mean)
|
||||
// Handle Percentiles
|
||||
for k__, v__ := range v_.StatisticsSeries.Percentiles {
|
||||
jd[k][k_].StatisticsSeries.Percentiles[k__] = make([]schema.Float, len(v__))
|
||||
copy(jd[k][k_].StatisticsSeries.Percentiles[k__], v__)
|
||||
}
|
||||
} else {
|
||||
jd[k][k_].StatisticsSeries = v_.StatisticsSeries
|
||||
}
|
||||
}
|
||||
}
|
||||
return jd
|
||||
}
|
||||
490
internal/metricdispatch/dataLoader.go
Normal file
490
internal/metricdispatch/dataLoader.go
Normal file
@@ -0,0 +1,490 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
// Package metricdispatch provides a unified interface for loading and caching job metric data.
|
||||
//
|
||||
// This package serves as a central dispatcher that routes metric data requests to the appropriate
|
||||
// backend based on job state. For running jobs, data is fetched from the metric store (e.g., cc-metric-store).
|
||||
// For completed jobs, data is retrieved from the file-based job archive.
|
||||
//
|
||||
// # Key Features
|
||||
//
|
||||
// - Automatic backend selection based on job state (running vs. archived)
|
||||
// - LRU cache for performance optimization (128 MB default cache size)
|
||||
// - Data resampling using Largest Triangle Three Bucket algorithm for archived data
|
||||
// - Automatic statistics series generation for jobs with many nodes
|
||||
// - Support for scoped metrics (node, socket, accelerator, core)
|
||||
//
|
||||
// # Cache Behavior
|
||||
//
|
||||
// Cached data has different TTL (time-to-live) values depending on job state:
|
||||
// - Running jobs: 2 minutes (data changes frequently)
|
||||
// - Completed jobs: 5 hours (data is static)
|
||||
//
|
||||
// The cache key is based on job ID, state, requested metrics, scopes, and resolution.
|
||||
//
|
||||
// # Usage
|
||||
//
|
||||
// The primary entry point is LoadData, which automatically handles both running and archived jobs:
|
||||
//
|
||||
// jobData, err := metricdispatch.LoadData(job, metrics, scopes, ctx, resolution)
|
||||
// if err != nil {
|
||||
// // Handle error
|
||||
// }
|
||||
//
|
||||
// For statistics only, use LoadJobStats, LoadScopedJobStats, or LoadAverages depending on the required format.
|
||||
package metricdispatch
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"math"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/internal/metricstore"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/lrucache"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/resampler"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// cache is an LRU cache with 128 MB capacity for storing loaded job metric data.
|
||||
// The cache reduces load on both the metric store and archive backends.
|
||||
var cache *lrucache.Cache = lrucache.New(128 * 1024 * 1024)
|
||||
|
||||
// cacheKey generates a unique cache key for a job's metric data based on job ID, state,
|
||||
// requested metrics, scopes, and resolution. Duration and StartTime are intentionally excluded
|
||||
// because job.ID is more unique and the cache TTL ensures entries don't persist indefinitely.
|
||||
func cacheKey(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
) string {
|
||||
return fmt.Sprintf("%d(%s):[%v],[%v]-%d",
|
||||
job.ID, job.State, metrics, scopes, resolution)
|
||||
}
|
||||
|
||||
// LoadData retrieves metric data for a job from the appropriate backend (memory store for running jobs,
|
||||
// archive for completed jobs) and applies caching, resampling, and statistics generation as needed.
|
||||
//
|
||||
// For running jobs or when archive is disabled, data is fetched from the metric store.
|
||||
// For completed archived jobs, data is loaded from the job archive and resampled if needed.
|
||||
//
|
||||
// Parameters:
|
||||
// - job: The job for which to load metric data
|
||||
// - metrics: List of metric names to load (nil loads all metrics for the cluster)
|
||||
// - scopes: Metric scopes to include (nil defaults to node scope)
|
||||
// - ctx: Context for cancellation and timeouts
|
||||
// - resolution: Target number of data points for resampling (only applies to archived data)
|
||||
//
|
||||
// Returns the loaded job data and any error encountered. For partial errors (some metrics failed),
|
||||
// the function returns the successfully loaded data with a warning logged.
|
||||
func LoadData(job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
resolution int,
|
||||
) (schema.JobData, error) {
|
||||
data := cache.Get(cacheKey(job, metrics, scopes, resolution), func() (_ any, ttl time.Duration, size int) {
|
||||
var jd schema.JobData
|
||||
var err error
|
||||
|
||||
if job.State == schema.JobStateRunning ||
|
||||
job.MonitoringStatus == schema.MonitoringStatusRunningOrArchiving ||
|
||||
config.Keys.DisableArchive {
|
||||
|
||||
if scopes == nil {
|
||||
scopes = append(scopes, schema.MetricScopeNode)
|
||||
}
|
||||
|
||||
if metrics == nil {
|
||||
cluster := archive.GetCluster(job.Cluster)
|
||||
for _, mc := range cluster.MetricConfig {
|
||||
metrics = append(metrics, mc.Name)
|
||||
}
|
||||
}
|
||||
|
||||
jd, err = metricstore.LoadData(job, metrics, scopes, ctx, resolution)
|
||||
if err != nil {
|
||||
if len(jd) != 0 {
|
||||
cclog.Warnf("partial error loading metrics from store for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
} else {
|
||||
cclog.Errorf("failed to load job data from metric store for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
return err, 0, 0
|
||||
}
|
||||
}
|
||||
size = jd.Size()
|
||||
} else {
|
||||
var jdTemp schema.JobData
|
||||
jdTemp, err = archive.GetHandle().LoadJobData(job)
|
||||
if err != nil {
|
||||
cclog.Errorf("failed to load job data from archive for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
return err, 0, 0
|
||||
}
|
||||
|
||||
jd = deepCopy(jdTemp)
|
||||
|
||||
// Resample archived data using Largest Triangle Three Bucket algorithm to reduce data points
|
||||
// to the requested resolution, improving transfer performance and client-side rendering.
|
||||
for _, v := range jd {
|
||||
for _, v_ := range v {
|
||||
timestep := int64(0)
|
||||
for i := 0; i < len(v_.Series); i += 1 {
|
||||
v_.Series[i].Data, timestep, err = resampler.LargestTriangleThreeBucket(v_.Series[i].Data, int64(v_.Timestep), int64(resolution))
|
||||
if err != nil {
|
||||
return err, 0, 0
|
||||
}
|
||||
}
|
||||
v_.Timestep = int(timestep)
|
||||
}
|
||||
}
|
||||
|
||||
// Filter job data to only include requested metrics and scopes, avoiding unnecessary data transfer.
|
||||
if metrics != nil || scopes != nil {
|
||||
if metrics == nil {
|
||||
metrics = make([]string, 0, len(jd))
|
||||
for k := range jd {
|
||||
metrics = append(metrics, k)
|
||||
}
|
||||
}
|
||||
|
||||
res := schema.JobData{}
|
||||
for _, metric := range metrics {
|
||||
if perscope, ok := jd[metric]; ok {
|
||||
if len(perscope) > 1 {
|
||||
subset := make(map[schema.MetricScope]*schema.JobMetric)
|
||||
for _, scope := range scopes {
|
||||
if jm, ok := perscope[scope]; ok {
|
||||
subset[scope] = jm
|
||||
}
|
||||
}
|
||||
|
||||
if len(subset) > 0 {
|
||||
perscope = subset
|
||||
}
|
||||
}
|
||||
|
||||
res[metric] = perscope
|
||||
}
|
||||
}
|
||||
jd = res
|
||||
}
|
||||
size = jd.Size()
|
||||
}
|
||||
|
||||
ttl = 5 * time.Hour
|
||||
if job.State == schema.JobStateRunning {
|
||||
ttl = 2 * time.Minute
|
||||
}
|
||||
|
||||
// Generate statistics series for jobs with many nodes to enable min/median/max graphs
|
||||
// instead of overwhelming the UI with individual node lines. Note that newly calculated
|
||||
// statistics use min/median/max, while archived statistics may use min/mean/max.
|
||||
const maxSeriesSize int = 15
|
||||
for _, scopes := range jd {
|
||||
for _, jm := range scopes {
|
||||
if jm.StatisticsSeries != nil || len(jm.Series) <= maxSeriesSize {
|
||||
continue
|
||||
}
|
||||
|
||||
jm.AddStatisticsSeries()
|
||||
}
|
||||
}
|
||||
|
||||
nodeScopeRequested := false
|
||||
for _, scope := range scopes {
|
||||
if scope == schema.MetricScopeNode {
|
||||
nodeScopeRequested = true
|
||||
}
|
||||
}
|
||||
|
||||
if nodeScopeRequested {
|
||||
jd.AddNodeScope("flops_any")
|
||||
jd.AddNodeScope("mem_bw")
|
||||
}
|
||||
|
||||
// Round Resulting Stat Values
|
||||
jd.RoundMetricStats()
|
||||
|
||||
return jd, ttl, size
|
||||
})
|
||||
|
||||
if err, ok := data.(error); ok {
|
||||
cclog.Errorf("error in cached dataset for job %d: %s", job.JobID, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return data.(schema.JobData), nil
|
||||
}
|
||||
|
||||
// LoadAverages computes average values for the specified metrics across all nodes of a job.
|
||||
// For running jobs, it loads statistics from the metric store. For completed jobs, it uses
|
||||
// the pre-calculated averages from the job archive. The results are appended to the data slice.
|
||||
func LoadAverages(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
data [][]schema.Float,
|
||||
ctx context.Context,
|
||||
) error {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadAveragesFromArchive(job, metrics, data) // #166 change also here?
|
||||
}
|
||||
|
||||
stats, err := metricstore.LoadStats(job, metrics, ctx)
|
||||
if err != nil {
|
||||
cclog.Errorf("failed to load statistics from metric store for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
return err
|
||||
}
|
||||
|
||||
for i, m := range metrics {
|
||||
nodes, ok := stats[m]
|
||||
if !ok {
|
||||
data[i] = append(data[i], schema.NaN)
|
||||
continue
|
||||
}
|
||||
|
||||
sum := 0.0
|
||||
for _, node := range nodes {
|
||||
sum += node.Avg
|
||||
}
|
||||
data[i] = append(data[i], schema.Float(sum))
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
// LoadScopedJobStats retrieves job statistics organized by metric scope (node, socket, core, accelerator).
|
||||
// For running jobs, statistics are computed from the metric store. For completed jobs, pre-calculated
|
||||
// statistics are loaded from the job archive.
|
||||
func LoadScopedJobStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
) (schema.ScopedJobStats, error) {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadScopedStatsFromArchive(job, metrics, scopes)
|
||||
}
|
||||
|
||||
scopedStats, err := metricstore.LoadScopedStats(job, metrics, scopes, ctx)
|
||||
if err != nil {
|
||||
cclog.Errorf("failed to load scoped statistics from metric store for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
return scopedStats, nil
|
||||
}
|
||||
|
||||
// LoadJobStats retrieves aggregated statistics (min/avg/max) for each requested metric across all job nodes.
|
||||
// For running jobs, statistics are computed from the metric store. For completed jobs, pre-calculated
|
||||
// statistics are loaded from the job archive.
|
||||
func LoadJobStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.MetricStatistics, error) {
|
||||
if job.State != schema.JobStateRunning && !config.Keys.DisableArchive {
|
||||
return archive.LoadStatsFromArchive(job, metrics)
|
||||
}
|
||||
|
||||
data := make(map[string]schema.MetricStatistics, len(metrics))
|
||||
|
||||
stats, err := metricstore.LoadStats(job, metrics, ctx)
|
||||
if err != nil {
|
||||
cclog.Errorf("failed to load statistics from metric store for job %d (user: %s, project: %s): %s",
|
||||
job.JobID, job.User, job.Project, err.Error())
|
||||
return data, err
|
||||
}
|
||||
|
||||
for _, m := range metrics {
|
||||
sum, avg, min, max := 0.0, 0.0, 0.0, 0.0
|
||||
nodes, ok := stats[m]
|
||||
if !ok {
|
||||
data[m] = schema.MetricStatistics{Min: min, Avg: avg, Max: max}
|
||||
continue
|
||||
}
|
||||
|
||||
for _, node := range nodes {
|
||||
sum += node.Avg
|
||||
min = math.Min(min, node.Min)
|
||||
max = math.Max(max, node.Max)
|
||||
}
|
||||
|
||||
data[m] = schema.MetricStatistics{
|
||||
Avg: (math.Round((sum/float64(job.NumNodes))*100) / 100),
|
||||
Min: (math.Round(min*100) / 100),
|
||||
Max: (math.Round(max*100) / 100),
|
||||
}
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// LoadNodeData retrieves metric data for specific nodes in a cluster within a time range.
|
||||
// This is used for node monitoring views and system status pages. Data is always fetched from
|
||||
// the metric store (not the archive) since it's for current/recent node status monitoring.
|
||||
//
|
||||
// Returns a nested map structure: node -> metric -> scoped data.
|
||||
func LoadNodeData(
|
||||
cluster string,
|
||||
metrics, nodes []string,
|
||||
scopes []schema.MetricScope,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string][]*schema.JobMetric, error) {
|
||||
if metrics == nil {
|
||||
for _, m := range archive.GetCluster(cluster).MetricConfig {
|
||||
metrics = append(metrics, m.Name)
|
||||
}
|
||||
}
|
||||
|
||||
data, err := metricstore.LoadNodeData(cluster, metrics, nodes, scopes, from, to, ctx)
|
||||
if err != nil {
|
||||
if len(data) != 0 {
|
||||
cclog.Warnf("partial error loading node data from metric store for cluster %s: %s", cluster, err.Error())
|
||||
} else {
|
||||
cclog.Errorf("failed to load node data from metric store for cluster %s: %s", cluster, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
if data == nil {
|
||||
return nil, fmt.Errorf("metric store for cluster '%s' does not support node data queries", cluster)
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// LoadNodeListData retrieves time-series metric data for multiple nodes within a time range,
|
||||
// with optional resampling and automatic statistics generation for large datasets.
|
||||
// This is used for comparing multiple nodes or displaying node status over time.
|
||||
//
|
||||
// Returns a map of node names to their job-like metric data structures.
|
||||
func LoadNodeListData(
|
||||
cluster, subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.JobData, error) {
|
||||
if metrics == nil {
|
||||
for _, m := range archive.GetCluster(cluster).MetricConfig {
|
||||
metrics = append(metrics, m.Name)
|
||||
}
|
||||
}
|
||||
|
||||
data, err := metricstore.LoadNodeListData(cluster, subCluster, nodes, metrics, scopes, resolution, from, to, ctx)
|
||||
if err != nil {
|
||||
if len(data) != 0 {
|
||||
cclog.Warnf("partial error loading node list data from metric store for cluster %s, subcluster %s: %s",
|
||||
cluster, subCluster, err.Error())
|
||||
} else {
|
||||
cclog.Errorf("failed to load node list data from metric store for cluster %s, subcluster %s: %s",
|
||||
cluster, subCluster, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
}
|
||||
|
||||
// Generate statistics series for datasets with many series to improve visualization performance.
|
||||
// Statistics are calculated as min/median/max.
|
||||
const maxSeriesSize int = 8
|
||||
for _, jd := range data {
|
||||
for _, scopes := range jd {
|
||||
for _, jm := range scopes {
|
||||
if jm.StatisticsSeries != nil || len(jm.Series) < maxSeriesSize {
|
||||
continue
|
||||
}
|
||||
jm.AddStatisticsSeries()
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if data == nil {
|
||||
return nil, fmt.Errorf("metric store for cluster '%s' does not support node list queries", cluster)
|
||||
}
|
||||
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// deepCopy creates a deep copy of JobData to prevent cache corruption when modifying
|
||||
// archived data (e.g., during resampling). This ensures the cached archive data remains
|
||||
// immutable while allowing per-request transformations.
|
||||
func deepCopy(source schema.JobData) schema.JobData {
|
||||
result := make(schema.JobData, len(source))
|
||||
|
||||
for metricName, scopeMap := range source {
|
||||
result[metricName] = make(map[schema.MetricScope]*schema.JobMetric, len(scopeMap))
|
||||
|
||||
for scope, jobMetric := range scopeMap {
|
||||
result[metricName][scope] = copyJobMetric(jobMetric)
|
||||
}
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
func copyJobMetric(src *schema.JobMetric) *schema.JobMetric {
|
||||
dst := &schema.JobMetric{
|
||||
Timestep: src.Timestep,
|
||||
Unit: src.Unit,
|
||||
Series: make([]schema.Series, len(src.Series)),
|
||||
}
|
||||
|
||||
for i := range src.Series {
|
||||
dst.Series[i] = copySeries(&src.Series[i])
|
||||
}
|
||||
|
||||
if src.StatisticsSeries != nil {
|
||||
dst.StatisticsSeries = copyStatisticsSeries(src.StatisticsSeries)
|
||||
}
|
||||
|
||||
return dst
|
||||
}
|
||||
|
||||
func copySeries(src *schema.Series) schema.Series {
|
||||
dst := schema.Series{
|
||||
Hostname: src.Hostname,
|
||||
Id: src.Id,
|
||||
Statistics: src.Statistics,
|
||||
Data: make([]schema.Float, len(src.Data)),
|
||||
}
|
||||
|
||||
copy(dst.Data, src.Data)
|
||||
return dst
|
||||
}
|
||||
|
||||
func copyStatisticsSeries(src *schema.StatsSeries) *schema.StatsSeries {
|
||||
dst := &schema.StatsSeries{
|
||||
Min: make([]schema.Float, len(src.Min)),
|
||||
Mean: make([]schema.Float, len(src.Mean)),
|
||||
Median: make([]schema.Float, len(src.Median)),
|
||||
Max: make([]schema.Float, len(src.Max)),
|
||||
}
|
||||
|
||||
copy(dst.Min, src.Min)
|
||||
copy(dst.Mean, src.Mean)
|
||||
copy(dst.Median, src.Median)
|
||||
copy(dst.Max, src.Max)
|
||||
|
||||
if len(src.Percentiles) > 0 {
|
||||
dst.Percentiles = make(map[int][]schema.Float, len(src.Percentiles))
|
||||
for percentile, values := range src.Percentiles {
|
||||
dst.Percentiles[percentile] = make([]schema.Float, len(values))
|
||||
copy(dst.Percentiles[percentile], values)
|
||||
}
|
||||
}
|
||||
|
||||
return dst
|
||||
}
|
||||
125
internal/metricdispatch/dataLoader_test.go
Normal file
125
internal/metricdispatch/dataLoader_test.go
Normal file
@@ -0,0 +1,125 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package metricdispatch
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
func TestDeepCopy(t *testing.T) {
|
||||
nodeId := "0"
|
||||
original := schema.JobData{
|
||||
"cpu_load": {
|
||||
schema.MetricScopeNode: &schema.JobMetric{
|
||||
Timestep: 60,
|
||||
Unit: schema.Unit{Base: "load", Prefix: ""},
|
||||
Series: []schema.Series{
|
||||
{
|
||||
Hostname: "node001",
|
||||
Id: &nodeId,
|
||||
Data: []schema.Float{1.0, 2.0, 3.0},
|
||||
Statistics: schema.MetricStatistics{
|
||||
Min: 1.0,
|
||||
Avg: 2.0,
|
||||
Max: 3.0,
|
||||
},
|
||||
},
|
||||
},
|
||||
StatisticsSeries: &schema.StatsSeries{
|
||||
Min: []schema.Float{1.0, 1.5, 2.0},
|
||||
Mean: []schema.Float{2.0, 2.5, 3.0},
|
||||
Median: []schema.Float{2.0, 2.5, 3.0},
|
||||
Max: []schema.Float{3.0, 3.5, 4.0},
|
||||
Percentiles: map[int][]schema.Float{
|
||||
25: {1.5, 2.0, 2.5},
|
||||
75: {2.5, 3.0, 3.5},
|
||||
},
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
copied := deepCopy(original)
|
||||
|
||||
original["cpu_load"][schema.MetricScopeNode].Series[0].Data[0] = 999.0
|
||||
original["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Min[0] = 888.0
|
||||
original["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Percentiles[25][0] = 777.0
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].Series[0].Data[0] != 1.0 {
|
||||
t.Errorf("Series data was not deeply copied: got %v, want 1.0",
|
||||
copied["cpu_load"][schema.MetricScopeNode].Series[0].Data[0])
|
||||
}
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Min[0] != 1.0 {
|
||||
t.Errorf("StatisticsSeries was not deeply copied: got %v, want 1.0",
|
||||
copied["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Min[0])
|
||||
}
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Percentiles[25][0] != 1.5 {
|
||||
t.Errorf("Percentiles was not deeply copied: got %v, want 1.5",
|
||||
copied["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Percentiles[25][0])
|
||||
}
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].Timestep != 60 {
|
||||
t.Errorf("Timestep not copied correctly: got %v, want 60",
|
||||
copied["cpu_load"][schema.MetricScopeNode].Timestep)
|
||||
}
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].Series[0].Hostname != "node001" {
|
||||
t.Errorf("Hostname not copied correctly: got %v, want node001",
|
||||
copied["cpu_load"][schema.MetricScopeNode].Series[0].Hostname)
|
||||
}
|
||||
}
|
||||
|
||||
func TestDeepCopyNilStatisticsSeries(t *testing.T) {
|
||||
original := schema.JobData{
|
||||
"mem_used": {
|
||||
schema.MetricScopeNode: &schema.JobMetric{
|
||||
Timestep: 60,
|
||||
Series: []schema.Series{
|
||||
{
|
||||
Hostname: "node001",
|
||||
Data: []schema.Float{1.0, 2.0},
|
||||
},
|
||||
},
|
||||
StatisticsSeries: nil,
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
copied := deepCopy(original)
|
||||
|
||||
if copied["mem_used"][schema.MetricScopeNode].StatisticsSeries != nil {
|
||||
t.Errorf("StatisticsSeries should be nil, got %v",
|
||||
copied["mem_used"][schema.MetricScopeNode].StatisticsSeries)
|
||||
}
|
||||
}
|
||||
|
||||
func TestDeepCopyEmptyPercentiles(t *testing.T) {
|
||||
original := schema.JobData{
|
||||
"cpu_load": {
|
||||
schema.MetricScopeNode: &schema.JobMetric{
|
||||
Timestep: 60,
|
||||
Series: []schema.Series{},
|
||||
StatisticsSeries: &schema.StatsSeries{
|
||||
Min: []schema.Float{1.0},
|
||||
Mean: []schema.Float{2.0},
|
||||
Median: []schema.Float{2.0},
|
||||
Max: []schema.Float{3.0},
|
||||
Percentiles: nil,
|
||||
},
|
||||
},
|
||||
},
|
||||
}
|
||||
|
||||
copied := deepCopy(original)
|
||||
|
||||
if copied["cpu_load"][schema.MetricScopeNode].StatisticsSeries.Percentiles != nil {
|
||||
t.Errorf("Percentiles should be nil when source is nil/empty")
|
||||
}
|
||||
}
|
||||
@@ -3,14 +3,15 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"fmt"
|
||||
"math"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
)
|
||||
|
||||
var (
|
||||
@@ -124,6 +125,9 @@ func FetchData(req APIQueryRequest) (*APIQueryResponse, error) {
|
||||
|
||||
req.WithData = true
|
||||
ms := GetMemoryStore()
|
||||
if ms == nil {
|
||||
return nil, fmt.Errorf("memorystore not initialized")
|
||||
}
|
||||
|
||||
response := APIQueryResponse{
|
||||
Results: make([][]APIMetricData, 0, len(req.Queries)),
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"archive/zip"
|
||||
@@ -18,13 +18,13 @@ import (
|
||||
"sync/atomic"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
)
|
||||
|
||||
func Archiving(wg *sync.WaitGroup, ctx context.Context) {
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
d, err := time.ParseDuration(Keys.Archive.Interval)
|
||||
d, err := time.ParseDuration(Keys.Archive.ArchiveInterval)
|
||||
if err != nil {
|
||||
cclog.Fatalf("[METRICSTORE]> error parsing archive interval duration: %v\n", err)
|
||||
}
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
@@ -19,13 +19,15 @@ import (
|
||||
"sync/atomic"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/linkedin/goavro/v2"
|
||||
)
|
||||
|
||||
var NumAvroWorkers int = DefaultAvroWorkers
|
||||
var startUp bool = true
|
||||
var (
|
||||
NumAvroWorkers int = DefaultAvroWorkers
|
||||
startUp bool = true
|
||||
)
|
||||
|
||||
func (as *AvroStore) ToCheckpoint(dir string, dumpAll bool) (int, error) {
|
||||
levels := make([]*AvroLevel, 0)
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
@@ -11,7 +11,7 @@ import (
|
||||
"strconv"
|
||||
"sync"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
)
|
||||
|
||||
func DataStaging(wg *sync.WaitGroup, ctx context.Context) {
|
||||
@@ -30,8 +30,51 @@ func DataStaging(wg *sync.WaitGroup, ctx context.Context) {
|
||||
for {
|
||||
select {
|
||||
case <-ctx.Done():
|
||||
return
|
||||
case val := <-LineProtocolMessages:
|
||||
// Drain any remaining messages in channel before exiting
|
||||
for {
|
||||
select {
|
||||
case val, ok := <-LineProtocolMessages:
|
||||
if !ok {
|
||||
// Channel closed
|
||||
return
|
||||
}
|
||||
// Process remaining message
|
||||
freq, err := GetMetricFrequency(val.MetricName)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
metricName := ""
|
||||
for _, selectorName := range val.Selector {
|
||||
metricName += selectorName + SelectorDelimiter
|
||||
}
|
||||
metricName += val.MetricName
|
||||
|
||||
var selector []string
|
||||
selector = append(selector, val.Cluster, val.Node, strconv.FormatInt(freq, 10))
|
||||
|
||||
if !stringSlicesEqual(oldSelector, selector) {
|
||||
avroLevel = avroStore.root.findAvroLevelOrCreate(selector)
|
||||
if avroLevel == nil {
|
||||
cclog.Errorf("Error creating or finding the level with cluster : %s, node : %s, metric : %s\n", val.Cluster, val.Node, val.MetricName)
|
||||
}
|
||||
oldSelector = slices.Clone(selector)
|
||||
}
|
||||
|
||||
if avroLevel != nil {
|
||||
avroLevel.addMetric(metricName, val.Value, val.Timestamp, int(freq))
|
||||
}
|
||||
default:
|
||||
// No more messages, exit
|
||||
return
|
||||
}
|
||||
}
|
||||
case val, ok := <-LineProtocolMessages:
|
||||
if !ok {
|
||||
// Channel closed, exit gracefully
|
||||
return
|
||||
}
|
||||
|
||||
// Fetch the frequency of the metric from the global configuration
|
||||
freq, err := GetMetricFrequency(val.MetricName)
|
||||
if err != nil {
|
||||
@@ -65,7 +108,9 @@ func DataStaging(wg *sync.WaitGroup, ctx context.Context) {
|
||||
oldSelector = slices.Clone(selector)
|
||||
}
|
||||
|
||||
avroLevel.addMetric(metricName, val.Value, val.Timestamp, int(freq))
|
||||
if avroLevel != nil {
|
||||
avroLevel.addMetric(metricName, val.Value, val.Timestamp, int(freq))
|
||||
}
|
||||
}
|
||||
}
|
||||
}()
|
||||
@@ -3,12 +3,12 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"sync"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
var (
|
||||
@@ -3,13 +3,13 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"sync"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// BufferCap is the default buffer capacity.
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
@@ -23,8 +23,8 @@ import (
|
||||
"sync/atomic"
|
||||
"time"
|
||||
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/linkedin/goavro/v2"
|
||||
)
|
||||
|
||||
@@ -408,7 +408,6 @@ func (m *MemoryStore) FromCheckpointFiles(dir string, from int64) (int, error) {
|
||||
return m.FromCheckpoint(dir, from, altFormat)
|
||||
}
|
||||
|
||||
cclog.Print("[METRICSTORE]> No valid checkpoint files found in the directory")
|
||||
return 0, nil
|
||||
}
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
@@ -19,38 +19,49 @@ const (
|
||||
DefaultAvroCheckpointInterval = time.Minute
|
||||
)
|
||||
|
||||
var InternalCCMSFlag bool = false
|
||||
type Checkpoints struct {
|
||||
FileFormat string `json:"file-format"`
|
||||
Interval string `json:"interval"`
|
||||
RootDir string `json:"directory"`
|
||||
}
|
||||
|
||||
type Debug struct {
|
||||
DumpToFile string `json:"dump-to-file"`
|
||||
EnableGops bool `json:"gops"`
|
||||
}
|
||||
|
||||
type Archive struct {
|
||||
ArchiveInterval string `json:"interval"`
|
||||
RootDir string `json:"directory"`
|
||||
DeleteInstead bool `json:"delete-instead"`
|
||||
}
|
||||
|
||||
type Subscriptions []struct {
|
||||
// Channel name
|
||||
SubscribeTo string `json:"subscribe-to"`
|
||||
|
||||
// Allow lines without a cluster tag, use this as default, optional
|
||||
ClusterTag string `json:"cluster-tag"`
|
||||
}
|
||||
|
||||
type MetricStoreConfig struct {
|
||||
// Number of concurrent workers for checkpoint and archive operations.
|
||||
// If not set or 0, defaults to min(runtime.NumCPU()/2+1, 10)
|
||||
NumWorkers int `json:"num-workers"`
|
||||
Checkpoints struct {
|
||||
FileFormat string `json:"file-format"`
|
||||
Interval string `json:"interval"`
|
||||
RootDir string `json:"directory"`
|
||||
Restore string `json:"restore"`
|
||||
} `json:"checkpoints"`
|
||||
Debug struct {
|
||||
DumpToFile string `json:"dump-to-file"`
|
||||
EnableGops bool `json:"gops"`
|
||||
} `json:"debug"`
|
||||
RetentionInMemory string `json:"retention-in-memory"`
|
||||
Archive struct {
|
||||
Interval string `json:"interval"`
|
||||
RootDir string `json:"directory"`
|
||||
DeleteInstead bool `json:"delete-instead"`
|
||||
} `json:"archive"`
|
||||
Subscriptions []struct {
|
||||
// Channel name
|
||||
SubscribeTo string `json:"subscribe-to"`
|
||||
|
||||
// Allow lines without a cluster tag, use this as default, optional
|
||||
ClusterTag string `json:"cluster-tag"`
|
||||
} `json:"subscriptions"`
|
||||
NumWorkers int `json:"num-workers"`
|
||||
RetentionInMemory string `json:"retention-in-memory"`
|
||||
MemoryCap int `json:"memory-cap"`
|
||||
Checkpoints Checkpoints `json:"checkpoints"`
|
||||
Debug *Debug `json:"debug"`
|
||||
Archive *Archive `json:"archive"`
|
||||
Subscriptions *Subscriptions `json:"nats-subscriptions"`
|
||||
}
|
||||
|
||||
var Keys MetricStoreConfig
|
||||
var Keys MetricStoreConfig = MetricStoreConfig{
|
||||
Checkpoints: Checkpoints{
|
||||
FileFormat: "avro",
|
||||
RootDir: "./var/checkpoints",
|
||||
},
|
||||
}
|
||||
|
||||
// AggregationStrategy for aggregation over multiple values at different cpus/sockets/..., not time!
|
||||
type AggregationStrategy int
|
||||
77
internal/metricstore/configSchema.go
Normal file
77
internal/metricstore/configSchema.go
Normal file
@@ -0,0 +1,77 @@
|
||||
// Copyright (C) NHR@FAU, University Erlangen-Nuremberg.
|
||||
// All rights reserved. This file is part of cc-backend.
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package metricstore
|
||||
|
||||
const configSchema = `{
|
||||
"type": "object",
|
||||
"description": "Configuration specific to built-in metric-store.",
|
||||
"properties": {
|
||||
"num-workers": {
|
||||
"description": "Number of concurrent workers for checkpoint and archive operations",
|
||||
"type": "integer"
|
||||
},
|
||||
"checkpoints": {
|
||||
"description": "Configuration for checkpointing the metrics within metric-store",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"file-format": {
|
||||
"description": "Specify the type of checkpoint file. There are 2 variants: 'avro' and 'json'. If nothing is specified, 'avro' is default.",
|
||||
"type": "string"
|
||||
},
|
||||
"interval": {
|
||||
"description": "Interval at which the metrics should be checkpointed.",
|
||||
"type": "string"
|
||||
},
|
||||
"directory": {
|
||||
"description": "Specify the parent directy in which the checkpointed files should be placed.",
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": ["interval"]
|
||||
},
|
||||
"archive": {
|
||||
"description": "Configuration for archiving the already checkpointed files.",
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"interval": {
|
||||
"description": "Interval at which the checkpointed files should be archived.",
|
||||
"type": "string"
|
||||
},
|
||||
"directory": {
|
||||
"description": "Specify the directy in which the archived files should be placed.",
|
||||
"type": "string"
|
||||
}
|
||||
},
|
||||
"required": ["interval", "directory"]
|
||||
},
|
||||
"retention-in-memory": {
|
||||
"description": "Keep the metrics within memory for given time interval. Retention for X hours, then the metrics would be freed.",
|
||||
"type": "string"
|
||||
},
|
||||
"memory-cap": {
|
||||
"description": "Upper memory capacity limit used by metricstore in GB",
|
||||
"type": "integer"
|
||||
},
|
||||
"nats-subscriptions": {
|
||||
"description": "Array of various subscriptions. Allows to subscibe to different subjects and publishers.",
|
||||
"type": "array",
|
||||
"items": {
|
||||
"type": "object",
|
||||
"properties": {
|
||||
"subscribe-to": {
|
||||
"description": "Channel name",
|
||||
"type": "string"
|
||||
},
|
||||
"cluster-tag": {
|
||||
"description": "Optional: Allow lines without a cluster tag, use this as default",
|
||||
"type": "string"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
},
|
||||
"required": ["checkpoints", "retention-in-memory"]
|
||||
}`
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"bufio"
|
||||
@@ -3,13 +3,13 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"sync"
|
||||
"unsafe"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
)
|
||||
|
||||
// Could also be called "node" as this forms a node in a tree structure.
|
||||
@@ -72,6 +72,29 @@ func (l *Level) findLevelOrCreate(selector []string, nMetrics int) *Level {
|
||||
return child.findLevelOrCreate(selector[1:], nMetrics)
|
||||
}
|
||||
|
||||
func (l *Level) collectPaths(currentDepth, targetDepth int, currentPath []string, results *[][]string) {
|
||||
l.lock.RLock()
|
||||
defer l.lock.RUnlock()
|
||||
|
||||
for key, child := range l.children {
|
||||
if child == nil {
|
||||
continue
|
||||
}
|
||||
|
||||
// We explicitly make a new slice and copy data to avoid sharing underlying arrays between siblings
|
||||
newPath := make([]string, len(currentPath))
|
||||
copy(newPath, currentPath)
|
||||
newPath = append(newPath, key)
|
||||
|
||||
// Check depth, and just return if depth reached
|
||||
if currentDepth+1 == targetDepth {
|
||||
*results = append(*results, newPath)
|
||||
} else {
|
||||
child.collectPaths(currentDepth+1, targetDepth, newPath, results)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func (l *Level) free(t int64) (int, error) {
|
||||
l.lock.Lock()
|
||||
defer l.lock.Unlock()
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
@@ -12,8 +12,8 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/nats"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/influxdata/line-protocol/v2/lineprotocol"
|
||||
)
|
||||
|
||||
@@ -29,29 +29,30 @@ func ReceiveNats(ms *MemoryStore,
|
||||
}
|
||||
|
||||
var wg sync.WaitGroup
|
||||
|
||||
msgs := make(chan []byte, workers*2)
|
||||
|
||||
for _, sc := range Keys.Subscriptions {
|
||||
for _, sc := range *Keys.Subscriptions {
|
||||
clusterTag := sc.ClusterTag
|
||||
if workers > 1 {
|
||||
wg.Add(workers)
|
||||
|
||||
for range workers {
|
||||
go func() {
|
||||
defer wg.Done()
|
||||
for m := range msgs {
|
||||
dec := lineprotocol.NewDecoderWithBytes(m)
|
||||
if err := DecodeLine(dec, ms, clusterTag); err != nil {
|
||||
cclog.Errorf("error: %s", err.Error())
|
||||
}
|
||||
}
|
||||
|
||||
wg.Done()
|
||||
}()
|
||||
}
|
||||
|
||||
nc.Subscribe(sc.SubscribeTo, func(subject string, data []byte) {
|
||||
msgs <- data
|
||||
select {
|
||||
case msgs <- data:
|
||||
case <-ctx.Done():
|
||||
}
|
||||
})
|
||||
} else {
|
||||
nc.Subscribe(sc.SubscribeTo, func(subject string, data []byte) {
|
||||
@@ -64,7 +65,11 @@ func ReceiveNats(ms *MemoryStore,
|
||||
cclog.Infof("NATS subscription to '%s' established", sc.SubscribeTo)
|
||||
}
|
||||
|
||||
close(msgs)
|
||||
go func() {
|
||||
<-ctx.Done()
|
||||
close(msgs)
|
||||
}()
|
||||
|
||||
wg.Wait()
|
||||
|
||||
return nil
|
||||
@@ -3,7 +3,7 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
// Package memorystore provides an efficient in-memory time-series metric storage system
|
||||
// Package metricstore provides an efficient in-memory time-series metric storage system
|
||||
// with support for hierarchical data organization, checkpointing, and archiving.
|
||||
//
|
||||
// The package organizes metrics in a tree structure (cluster → host → component) and
|
||||
@@ -17,7 +17,7 @@
|
||||
// - Concurrent checkpoint/archive workers
|
||||
// - Support for sum and average aggregation
|
||||
// - NATS integration for metric ingestion
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
@@ -25,15 +25,16 @@ import (
|
||||
"encoding/json"
|
||||
"errors"
|
||||
"runtime"
|
||||
"slices"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/config"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/resampler"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/resampler"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
)
|
||||
|
||||
var (
|
||||
@@ -44,6 +45,15 @@ var (
|
||||
shutdownFunc context.CancelFunc
|
||||
)
|
||||
|
||||
// NodeProvider provides information about nodes currently in use by running jobs.
|
||||
// This interface allows metricstore to query job information without directly
|
||||
// depending on the repository package, breaking the import cycle.
|
||||
type NodeProvider interface {
|
||||
// GetUsedNodes returns a map of cluster names to sorted lists of unique hostnames
|
||||
// that are currently in use by jobs that started before the given timestamp.
|
||||
GetUsedNodes(ts int64) (map[string][]string, error)
|
||||
}
|
||||
|
||||
type Metric struct {
|
||||
Name string
|
||||
Value schema.Float
|
||||
@@ -51,8 +61,9 @@ type Metric struct {
|
||||
}
|
||||
|
||||
type MemoryStore struct {
|
||||
Metrics map[string]MetricConfig
|
||||
root Level
|
||||
Metrics map[string]MetricConfig
|
||||
root Level
|
||||
nodeProvider NodeProvider // Injected dependency for querying running jobs
|
||||
}
|
||||
|
||||
func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
@@ -61,7 +72,7 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
if rawConfig != nil {
|
||||
config.Validate(configSchema, rawConfig)
|
||||
dec := json.NewDecoder(bytes.NewReader(rawConfig))
|
||||
// dec.DisallowUnknownFields()
|
||||
dec.DisallowUnknownFields()
|
||||
if err := dec.Decode(&Keys); err != nil {
|
||||
cclog.Abortf("[METRICSTORE]> Metric Store Config Init: Could not decode config file '%s'.\nError: %s\n", rawConfig, err.Error())
|
||||
}
|
||||
@@ -74,7 +85,7 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
cclog.Debugf("[METRICSTORE]> Using %d workers for checkpoint/archive operations\n", Keys.NumWorkers)
|
||||
|
||||
// Helper function to add metric configuration
|
||||
addMetricConfig := func(mc schema.MetricConfig) {
|
||||
addMetricConfig := func(mc *schema.MetricConfig) {
|
||||
agg, err := AssignAggregationStrategy(mc.Aggregation)
|
||||
if err != nil {
|
||||
cclog.Warnf("Could not find aggregation strategy for metric config '%s': %s", mc.Name, err.Error())
|
||||
@@ -88,7 +99,7 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
|
||||
for _, c := range archive.Clusters {
|
||||
for _, mc := range c.MetricConfig {
|
||||
addMetricConfig(*mc)
|
||||
addMetricConfig(mc)
|
||||
}
|
||||
|
||||
for _, sc := range c.SubClusters {
|
||||
@@ -103,7 +114,7 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
|
||||
ms := GetMemoryStore()
|
||||
|
||||
d, err := time.ParseDuration(Keys.Checkpoints.Restore)
|
||||
d, err := time.ParseDuration(Keys.RetentionInMemory)
|
||||
if err != nil {
|
||||
cclog.Fatal(err)
|
||||
}
|
||||
@@ -128,11 +139,21 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
|
||||
ctx, shutdown := context.WithCancel(context.Background())
|
||||
|
||||
wg.Add(4)
|
||||
retentionGoroutines := 1
|
||||
checkpointingGoroutines := 1
|
||||
dataStagingGoroutines := 1
|
||||
archivingGoroutines := 0
|
||||
if Keys.Archive != nil {
|
||||
archivingGoroutines = 1
|
||||
}
|
||||
totalGoroutines := retentionGoroutines + checkpointingGoroutines + dataStagingGoroutines + archivingGoroutines
|
||||
wg.Add(totalGoroutines)
|
||||
|
||||
Retention(wg, ctx)
|
||||
Checkpointing(wg, ctx)
|
||||
Archiving(wg, ctx)
|
||||
if Keys.Archive != nil {
|
||||
Archiving(wg, ctx)
|
||||
}
|
||||
DataStaging(wg, ctx)
|
||||
|
||||
// Note: Signal handling has been removed from this function.
|
||||
@@ -141,9 +162,11 @@ func Init(rawConfig json.RawMessage, wg *sync.WaitGroup) {
|
||||
// Store the shutdown function for later use by Shutdown()
|
||||
shutdownFunc = shutdown
|
||||
|
||||
err = ReceiveNats(ms, 1, ctx)
|
||||
if err != nil {
|
||||
cclog.Fatal(err)
|
||||
if Keys.Subscriptions != nil {
|
||||
err = ReceiveNats(ms, 1, ctx)
|
||||
if err != nil {
|
||||
cclog.Fatal(err)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -183,12 +206,23 @@ func GetMemoryStore() *MemoryStore {
|
||||
return msInstance
|
||||
}
|
||||
|
||||
// SetNodeProvider sets the NodeProvider implementation for the MemoryStore.
|
||||
// This must be called during initialization to provide job state information
|
||||
// for selective buffer retention during Free operations.
|
||||
// If not set, the Free function will fall back to freeing all buffers.
|
||||
func (ms *MemoryStore) SetNodeProvider(provider NodeProvider) {
|
||||
ms.nodeProvider = provider
|
||||
}
|
||||
|
||||
func Shutdown() {
|
||||
// Cancel the context to signal all background goroutines to stop
|
||||
if shutdownFunc != nil {
|
||||
shutdownFunc()
|
||||
}
|
||||
|
||||
if Keys.Checkpoints.FileFormat != "json" {
|
||||
close(LineProtocolMessages)
|
||||
}
|
||||
|
||||
cclog.Infof("[METRICSTORE]> Writing to '%s'...\n", Keys.Checkpoints.RootDir)
|
||||
var files int
|
||||
var err error
|
||||
@@ -199,7 +233,6 @@ func Shutdown() {
|
||||
files, err = ms.ToCheckpoint(Keys.Checkpoints.RootDir, lastCheckpoint.Unix(), time.Now().Unix())
|
||||
} else {
|
||||
files, err = GetAvroStore().ToCheckpoint(Keys.Checkpoints.RootDir, true)
|
||||
close(LineProtocolMessages)
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
@@ -208,15 +241,6 @@ func Shutdown() {
|
||||
cclog.Infof("[METRICSTORE]> Done! (%d files written)\n", files)
|
||||
}
|
||||
|
||||
func getName(m *MemoryStore, i int) string {
|
||||
for key, val := range m.Metrics {
|
||||
if val.offset == i {
|
||||
return key
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
func Retention(wg *sync.WaitGroup, ctx context.Context) {
|
||||
ms := GetMemoryStore()
|
||||
|
||||
@@ -244,7 +268,8 @@ func Retention(wg *sync.WaitGroup, ctx context.Context) {
|
||||
case <-ticker.C:
|
||||
t := time.Now().Add(-d)
|
||||
cclog.Infof("[METRICSTORE]> start freeing buffers (older than %s)...\n", t.Format(time.RFC3339))
|
||||
freed, err := ms.Free(nil, t.Unix())
|
||||
|
||||
freed, err := Free(ms, t)
|
||||
if err != nil {
|
||||
cclog.Errorf("[METRICSTORE]> freeing up buffers failed: %s\n", err.Error())
|
||||
} else {
|
||||
@@ -255,6 +280,104 @@ func Retention(wg *sync.WaitGroup, ctx context.Context) {
|
||||
}()
|
||||
}
|
||||
|
||||
func Free(ms *MemoryStore, t time.Time) (int, error) {
|
||||
// If no NodeProvider is configured, free all buffers older than t
|
||||
if ms.nodeProvider == nil {
|
||||
return ms.Free(nil, t.Unix())
|
||||
}
|
||||
|
||||
excludeSelectors, err := ms.nodeProvider.GetUsedNodes(t.Unix())
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
|
||||
// excludeSelectors := make(map[string][]string, 0)
|
||||
|
||||
// excludeSelectors := map[string][]string{
|
||||
// "alex": {"a0122", "a0123", "a0225"},
|
||||
// "fritz": {"f0201", "f0202"},
|
||||
// }
|
||||
|
||||
switch lenMap := len(excludeSelectors); lenMap {
|
||||
|
||||
// If the length of the map returned by GetUsedNodes() is 0,
|
||||
// then use default Free method with nil selector
|
||||
case 0:
|
||||
return ms.Free(nil, t.Unix())
|
||||
|
||||
// Else formulate selectors, exclude those from the map
|
||||
// and free the rest of the selectors
|
||||
default:
|
||||
selectors := GetSelectors(ms, excludeSelectors)
|
||||
return FreeSelected(ms, selectors, t)
|
||||
}
|
||||
}
|
||||
|
||||
// A function to free specific selectors. Used when we want to retain some specific nodes
|
||||
// beyond the retention time.
|
||||
func FreeSelected(ms *MemoryStore, selectors [][]string, t time.Time) (int, error) {
|
||||
freed := 0
|
||||
|
||||
for _, selector := range selectors {
|
||||
|
||||
freedBuffers, err := ms.Free(selector, t.Unix())
|
||||
if err != nil {
|
||||
cclog.Errorf("error while freeing selected buffers: %#v", err)
|
||||
}
|
||||
freed += freedBuffers
|
||||
|
||||
}
|
||||
|
||||
return freed, nil
|
||||
}
|
||||
|
||||
// This function will populate all the second last levels - meaning nodes
|
||||
// From that we can exclude the specific selectosr/node we want to retain.
|
||||
func GetSelectors(ms *MemoryStore, excludeSelectors map[string][]string) [][]string {
|
||||
allSelectors := ms.GetPaths(2)
|
||||
|
||||
filteredSelectors := make([][]string, 0, len(allSelectors))
|
||||
|
||||
for _, path := range allSelectors {
|
||||
if len(path) < 2 {
|
||||
continue
|
||||
}
|
||||
|
||||
key := path[0] // The "Key" (Level 1)
|
||||
value := path[1] // The "Value" (Level 2)
|
||||
|
||||
exclude := false
|
||||
|
||||
// Check if the key exists in our exclusion map
|
||||
if excludedValues, exists := excludeSelectors[key]; exists {
|
||||
// The key exists, now check if the specific value is in the exclusion list
|
||||
if slices.Contains(excludedValues, value) {
|
||||
exclude = true
|
||||
}
|
||||
}
|
||||
|
||||
if !exclude {
|
||||
filteredSelectors = append(filteredSelectors, path)
|
||||
}
|
||||
}
|
||||
|
||||
// fmt.Printf("All selectors: %#v\n\n", allSelectors)
|
||||
// fmt.Printf("filteredSelectors: %#v\n\n", filteredSelectors)
|
||||
|
||||
return filteredSelectors
|
||||
}
|
||||
|
||||
// GetPaths returns a list of lists (paths) to the specified depth.
|
||||
func (ms *MemoryStore) GetPaths(targetDepth int) [][]string {
|
||||
var results [][]string
|
||||
|
||||
// Start recursion. Initial path is empty.
|
||||
// We treat Root as depth 0.
|
||||
ms.root.collectPaths(0, targetDepth, []string{}, &results)
|
||||
|
||||
return results
|
||||
}
|
||||
|
||||
// Write all values in `metrics` to the level specified by `selector` for time `ts`.
|
||||
// Look at `findLevelOrCreate` for how selectors work.
|
||||
func (m *MemoryStore) Write(selector []string, ts int64, metrics []Metric) error {
|
||||
@@ -3,12 +3,12 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"testing"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
func TestAssignAggregationStrategy(t *testing.T) {
|
||||
@@ -131,7 +131,7 @@ func TestBufferWrite(t *testing.T) {
|
||||
|
||||
func TestBufferRead(t *testing.T) {
|
||||
b := newBuffer(100, 10)
|
||||
|
||||
|
||||
// Write some test data
|
||||
b.write(100, schema.Float(1.0))
|
||||
b.write(110, schema.Float(2.0))
|
||||
@@ -3,56 +3,41 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package metricdata
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ClusterCockpit/cc-backend/internal/memorystore"
|
||||
"github.com/ClusterCockpit/cc-backend/pkg/archive"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/schema"
|
||||
cclog "github.com/ClusterCockpit/cc-lib/v2/ccLogger"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/schema"
|
||||
)
|
||||
|
||||
// Bloat Code
|
||||
type CCMetricStoreConfigInternal struct {
|
||||
Kind string `json:"kind"`
|
||||
Url string `json:"url"`
|
||||
Token string `json:"token"`
|
||||
// TestLoadDataCallback allows tests to override LoadData behavior
|
||||
var TestLoadDataCallback func(job *schema.Job, metrics []string, scopes []schema.MetricScope, ctx context.Context, resolution int) (schema.JobData, error)
|
||||
|
||||
// If metrics are known to this MetricDataRepository under a different
|
||||
// name than in the `metricConfig` section of the 'cluster.json',
|
||||
// provide this optional mapping of local to remote name for this metric.
|
||||
Renamings map[string]string `json:"metricRenamings"`
|
||||
}
|
||||
|
||||
// Bloat Code
|
||||
type CCMetricStoreInternal struct{}
|
||||
|
||||
// Bloat Code
|
||||
func (ccms *CCMetricStoreInternal) Init(rawConfig json.RawMessage) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func (ccms *CCMetricStoreInternal) LoadData(
|
||||
func LoadData(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
resolution int,
|
||||
) (schema.JobData, error) {
|
||||
queries, assignedScope, err := ccms.buildQueries(job, metrics, scopes, int64(resolution))
|
||||
if TestLoadDataCallback != nil {
|
||||
return TestLoadDataCallback(job, metrics, scopes, ctx, resolution)
|
||||
}
|
||||
|
||||
queries, assignedScope, err := buildQueries(job, metrics, scopes, int64(resolution))
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while building queries for jobId %d, Metrics %v, Scopes %v: %s", job.JobID, metrics, scopes, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req := memorystore.APIQueryRequest{
|
||||
req := APIQueryRequest{
|
||||
Cluster: job.Cluster,
|
||||
From: job.StartTime,
|
||||
To: job.StartTime + int64(job.Duration),
|
||||
@@ -61,7 +46,7 @@ func (ccms *CCMetricStoreInternal) LoadData(
|
||||
WithData: true,
|
||||
}
|
||||
|
||||
resBody, err := memorystore.FetchData(req)
|
||||
resBody, err := FetchData(req)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while fetching data : %s", err.Error())
|
||||
return nil, err
|
||||
@@ -149,13 +134,13 @@ var (
|
||||
acceleratorString = string(schema.MetricScopeAccelerator)
|
||||
)
|
||||
|
||||
func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
func buildQueries(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int64,
|
||||
) ([]memorystore.APIQuery, []schema.MetricScope, error) {
|
||||
queries := make([]memorystore.APIQuery, 0, len(metrics)*len(scopes)*len(job.Resources))
|
||||
) ([]APIQuery, []schema.MetricScope, error) {
|
||||
queries := make([]APIQuery, 0, len(metrics)*len(scopes)*len(job.Resources))
|
||||
assignedScope := []schema.MetricScope{}
|
||||
|
||||
subcluster, scerr := archive.GetSubCluster(job.Cluster, job.SubCluster)
|
||||
@@ -217,7 +202,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
continue
|
||||
}
|
||||
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: false,
|
||||
@@ -235,7 +220,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
continue
|
||||
}
|
||||
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -249,7 +234,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
|
||||
// HWThread -> HWThead
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeHWThread {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: false,
|
||||
@@ -265,7 +250,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeCore {
|
||||
cores, _ := topology.GetCoresFromHWThreads(hwthreads)
|
||||
for _, core := range cores {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -282,7 +267,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(hwthreads)
|
||||
for _, socket := range sockets {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -297,7 +282,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
|
||||
// HWThread -> Node
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeNode {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -312,7 +297,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// Core -> Core
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeCore {
|
||||
cores, _ := topology.GetCoresFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: false,
|
||||
@@ -328,7 +313,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromCores(hwthreads)
|
||||
for _, socket := range sockets {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -344,7 +329,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// Core -> Node
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeNode {
|
||||
cores, _ := topology.GetCoresFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -359,7 +344,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// MemoryDomain -> MemoryDomain
|
||||
if nativeScope == schema.MetricScopeMemoryDomain && scope == schema.MetricScopeMemoryDomain {
|
||||
sockets, _ := topology.GetMemoryDomainsFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: false,
|
||||
@@ -374,7 +359,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// MemoryDoman -> Node
|
||||
if nativeScope == schema.MetricScopeMemoryDomain && scope == schema.MetricScopeNode {
|
||||
sockets, _ := topology.GetMemoryDomainsFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -389,7 +374,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// Socket -> Socket
|
||||
if nativeScope == schema.MetricScopeSocket && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: false,
|
||||
@@ -404,7 +389,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
// Socket -> Node
|
||||
if nativeScope == schema.MetricScopeSocket && scope == schema.MetricScopeNode {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(hwthreads)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Aggregate: true,
|
||||
@@ -418,7 +403,7 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
|
||||
// Node -> Node
|
||||
if nativeScope == schema.MetricScopeNode && scope == schema.MetricScopeNode {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: host.Hostname,
|
||||
Resolution: resolution,
|
||||
@@ -435,18 +420,18 @@ func (ccms *CCMetricStoreInternal) buildQueries(
|
||||
return queries, assignedScope, nil
|
||||
}
|
||||
|
||||
func (ccms *CCMetricStoreInternal) LoadStats(
|
||||
func LoadStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string]schema.MetricStatistics, error) {
|
||||
queries, _, err := ccms.buildQueries(job, metrics, []schema.MetricScope{schema.MetricScopeNode}, 0) // #166 Add scope shere for analysis view accelerator normalization?
|
||||
queries, _, err := buildQueries(job, metrics, []schema.MetricScope{schema.MetricScopeNode}, 0) // #166 Add scope shere for analysis view accelerator normalization?
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while building queries for jobId %d, Metrics %v: %s", job.JobID, metrics, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req := memorystore.APIQueryRequest{
|
||||
req := APIQueryRequest{
|
||||
Cluster: job.Cluster,
|
||||
From: job.StartTime,
|
||||
To: job.StartTime + int64(job.Duration),
|
||||
@@ -455,7 +440,7 @@ func (ccms *CCMetricStoreInternal) LoadStats(
|
||||
WithData: false,
|
||||
}
|
||||
|
||||
resBody, err := memorystore.FetchData(req)
|
||||
resBody, err := FetchData(req)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while fetching data : %s", err.Error())
|
||||
return nil, err
|
||||
@@ -492,20 +477,19 @@ func (ccms *CCMetricStoreInternal) LoadStats(
|
||||
return stats, nil
|
||||
}
|
||||
|
||||
// Used for Job-View Statistics Table
|
||||
func (ccms *CCMetricStoreInternal) LoadScopedStats(
|
||||
func LoadScopedStats(
|
||||
job *schema.Job,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
ctx context.Context,
|
||||
) (schema.ScopedJobStats, error) {
|
||||
queries, assignedScope, err := ccms.buildQueries(job, metrics, scopes, 0)
|
||||
queries, assignedScope, err := buildQueries(job, metrics, scopes, 0)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while building queries for jobId %d, Metrics %v, Scopes %v: %s", job.JobID, metrics, scopes, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req := memorystore.APIQueryRequest{
|
||||
req := APIQueryRequest{
|
||||
Cluster: job.Cluster,
|
||||
From: job.StartTime,
|
||||
To: job.StartTime + int64(job.Duration),
|
||||
@@ -514,7 +498,7 @@ func (ccms *CCMetricStoreInternal) LoadScopedStats(
|
||||
WithData: false,
|
||||
}
|
||||
|
||||
resBody, err := memorystore.FetchData(req)
|
||||
resBody, err := FetchData(req)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while fetching data : %s", err.Error())
|
||||
return nil, err
|
||||
@@ -583,15 +567,14 @@ func (ccms *CCMetricStoreInternal) LoadScopedStats(
|
||||
return scopedJobStats, nil
|
||||
}
|
||||
|
||||
// Used for Systems-View Node-Overview
|
||||
func (ccms *CCMetricStoreInternal) LoadNodeData(
|
||||
func LoadNodeData(
|
||||
cluster string,
|
||||
metrics, nodes []string,
|
||||
scopes []schema.MetricScope,
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]map[string][]*schema.JobMetric, error) {
|
||||
req := memorystore.APIQueryRequest{
|
||||
req := APIQueryRequest{
|
||||
Cluster: cluster,
|
||||
From: from.Unix(),
|
||||
To: to.Unix(),
|
||||
@@ -604,7 +587,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeData(
|
||||
} else {
|
||||
for _, node := range nodes {
|
||||
for _, metric := range metrics {
|
||||
req.Queries = append(req.Queries, memorystore.APIQuery{
|
||||
req.Queries = append(req.Queries, APIQuery{
|
||||
Hostname: node,
|
||||
Metric: metric,
|
||||
Resolution: 0, // Default for Node Queries: Will return metric $Timestep Resolution
|
||||
@@ -613,7 +596,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeData(
|
||||
}
|
||||
}
|
||||
|
||||
resBody, err := memorystore.FetchData(req)
|
||||
resBody, err := FetchData(req)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while fetching data : %s", err.Error())
|
||||
return nil, err
|
||||
@@ -622,7 +605,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeData(
|
||||
var errors []string
|
||||
data := make(map[string]map[string][]*schema.JobMetric)
|
||||
for i, res := range resBody.Results {
|
||||
var query memorystore.APIQuery
|
||||
var query APIQuery
|
||||
if resBody.Queries != nil {
|
||||
query = resBody.Queries[i]
|
||||
} else {
|
||||
@@ -673,8 +656,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeData(
|
||||
return data, nil
|
||||
}
|
||||
|
||||
// Used for Systems-View Node-List
|
||||
func (ccms *CCMetricStoreInternal) LoadNodeListData(
|
||||
func LoadNodeListData(
|
||||
cluster, subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
@@ -683,15 +665,14 @@ func (ccms *CCMetricStoreInternal) LoadNodeListData(
|
||||
from, to time.Time,
|
||||
ctx context.Context,
|
||||
) (map[string]schema.JobData, error) {
|
||||
|
||||
// Note: Order of node data is not guaranteed after this point
|
||||
queries, assignedScope, err := ccms.buildNodeQueries(cluster, subCluster, nodes, metrics, scopes, int64(resolution))
|
||||
queries, assignedScope, err := buildNodeQueries(cluster, subCluster, nodes, metrics, scopes, int64(resolution))
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while building node queries for Cluster %s, SubCLuster %s, Metrics %v, Scopes %v: %s", cluster, subCluster, metrics, scopes, err.Error())
|
||||
return nil, err
|
||||
}
|
||||
|
||||
req := memorystore.APIQueryRequest{
|
||||
req := APIQueryRequest{
|
||||
Cluster: cluster,
|
||||
Queries: queries,
|
||||
From: from.Unix(),
|
||||
@@ -700,7 +681,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeListData(
|
||||
WithData: true,
|
||||
}
|
||||
|
||||
resBody, err := memorystore.FetchData(req)
|
||||
resBody, err := FetchData(req)
|
||||
if err != nil {
|
||||
cclog.Errorf("Error while fetching data : %s", err.Error())
|
||||
return nil, err
|
||||
@@ -709,7 +690,7 @@ func (ccms *CCMetricStoreInternal) LoadNodeListData(
|
||||
var errors []string
|
||||
data := make(map[string]schema.JobData)
|
||||
for i, row := range resBody.Results {
|
||||
var query memorystore.APIQuery
|
||||
var query APIQuery
|
||||
if resBody.Queries != nil {
|
||||
query = resBody.Queries[i]
|
||||
} else {
|
||||
@@ -789,15 +770,15 @@ func (ccms *CCMetricStoreInternal) LoadNodeListData(
|
||||
return data, nil
|
||||
}
|
||||
|
||||
func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
func buildNodeQueries(
|
||||
cluster string,
|
||||
subCluster string,
|
||||
nodes []string,
|
||||
metrics []string,
|
||||
scopes []schema.MetricScope,
|
||||
resolution int64,
|
||||
) ([]memorystore.APIQuery, []schema.MetricScope, error) {
|
||||
queries := make([]memorystore.APIQuery, 0, len(metrics)*len(scopes)*len(nodes))
|
||||
) ([]APIQuery, []schema.MetricScope, error) {
|
||||
queries := make([]APIQuery, 0, len(metrics)*len(scopes)*len(nodes))
|
||||
assignedScope := []schema.MetricScope{}
|
||||
|
||||
// Get Topol before loop if subCluster given
|
||||
@@ -812,7 +793,6 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
}
|
||||
|
||||
for _, metric := range metrics {
|
||||
metric := metric
|
||||
mc := archive.GetMetricConfig(cluster, metric)
|
||||
if mc == nil {
|
||||
// return nil, fmt.Errorf("METRICDATA/CCMS > metric '%s' is not specified for cluster '%s'", metric, cluster)
|
||||
@@ -880,7 +860,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
continue
|
||||
}
|
||||
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: false,
|
||||
@@ -898,7 +878,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
continue
|
||||
}
|
||||
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -912,7 +892,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
|
||||
// HWThread -> HWThead
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeHWThread {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: false,
|
||||
@@ -928,7 +908,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeCore {
|
||||
cores, _ := topology.GetCoresFromHWThreads(topology.Node)
|
||||
for _, core := range cores {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -945,7 +925,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(topology.Node)
|
||||
for _, socket := range sockets {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -960,7 +940,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
|
||||
// HWThread -> Node
|
||||
if nativeScope == schema.MetricScopeHWThread && scope == schema.MetricScopeNode {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -975,7 +955,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// Core -> Core
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeCore {
|
||||
cores, _ := topology.GetCoresFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: false,
|
||||
@@ -991,7 +971,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromCores(topology.Node)
|
||||
for _, socket := range sockets {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -1007,7 +987,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// Core -> Node
|
||||
if nativeScope == schema.MetricScopeCore && scope == schema.MetricScopeNode {
|
||||
cores, _ := topology.GetCoresFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -1022,7 +1002,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// MemoryDomain -> MemoryDomain
|
||||
if nativeScope == schema.MetricScopeMemoryDomain && scope == schema.MetricScopeMemoryDomain {
|
||||
sockets, _ := topology.GetMemoryDomainsFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: false,
|
||||
@@ -1037,7 +1017,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// MemoryDoman -> Node
|
||||
if nativeScope == schema.MetricScopeMemoryDomain && scope == schema.MetricScopeNode {
|
||||
sockets, _ := topology.GetMemoryDomainsFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -1052,7 +1032,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// Socket -> Socket
|
||||
if nativeScope == schema.MetricScopeSocket && scope == schema.MetricScopeSocket {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: false,
|
||||
@@ -1067,7 +1047,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
// Socket -> Node
|
||||
if nativeScope == schema.MetricScopeSocket && scope == schema.MetricScopeNode {
|
||||
sockets, _ := topology.GetSocketsFromHWThreads(topology.Node)
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Aggregate: true,
|
||||
@@ -1081,7 +1061,7 @@ func (ccms *CCMetricStoreInternal) buildNodeQueries(
|
||||
|
||||
// Node -> Node
|
||||
if nativeScope == schema.MetricScopeNode && scope == schema.MetricScopeNode {
|
||||
queries = append(queries, memorystore.APIQuery{
|
||||
queries = append(queries, APIQuery{
|
||||
Metric: metric,
|
||||
Hostname: hostname,
|
||||
Resolution: resolution,
|
||||
@@ -3,13 +3,13 @@
|
||||
// Use of this source code is governed by a MIT-style
|
||||
// license that can be found in the LICENSE file.
|
||||
|
||||
package memorystore
|
||||
package metricstore
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"math"
|
||||
|
||||
"github.com/ClusterCockpit/cc-lib/util"
|
||||
"github.com/ClusterCockpit/cc-lib/v2/util"
|
||||
)
|
||||
|
||||
type Stats struct {
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user