diff --git a/CLAUDE.md b/CLAUDE.md index 379b4db..67412a7 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -100,11 +100,15 @@ The backend follows a layered architecture with clear separation of concerns: - Pluggable backends: cc-metric-store, Prometheus, InfluxDB - Each cluster can have a different metric data backend - **internal/archiver**: Job archiving to file-based archive +- **internal/api/nats.go**: NATS-based API for job and node operations + - Subscribes to NATS subjects for job events (start/stop) + - Handles node state updates via NATS + - Uses InfluxDB line protocol message format - **pkg/archive**: Job archive backend implementations - File system backend (default) - S3 backend - SQLite backend (experimental) -- **pkg/nats**: NATS integration for metric ingestion +- **pkg/nats**: NATS client and message decoding utilities ### Frontend Structure @@ -146,6 +150,14 @@ applied automatically on startup. Version tracking in `version` table. ## Configuration - **config.json**: Main configuration (clusters, metric repositories, archive settings) + - `main.apiSubjects`: NATS subject configuration (optional) + - `subjectJobEvent`: Subject for job start/stop events (e.g., "cc.job.event") + - `subjectNodeState`: Subject for node state updates (e.g., "cc.node.state") + - `nats`: NATS client connection configuration (optional) + - `address`: NATS server address (e.g., "nats://localhost:4222") + - `username`: Authentication username (optional) + - `password`: Authentication password (optional) + - `creds-file-path`: Path to NATS credentials file (optional) - **.env**: Environment variables (secrets like JWT keys) - Copy from `configs/env-template.txt` - NEVER commit this file @@ -207,9 +219,87 @@ applied automatically on startup. Version tracking in `version` table. 2. Increment `repository.Version` 3. Test with fresh database and existing database +## NATS API + +The backend supports a NATS-based API as an alternative to the REST API for job and node operations. + +### Setup + +1. Configure NATS client connection in `config.json`: + ```json + { + "nats": { + "address": "nats://localhost:4222", + "username": "user", + "password": "pass" + } + } + ``` + +2. Configure API subjects in `config.json` under `main`: + ```json + { + "main": { + "apiSubjects": { + "subjectJobEvent": "cc.job.event", + "subjectNodeState": "cc.node.state" + } + } + } + ``` + +### Message Format + +Messages use **InfluxDB line protocol** format with the following structure: + +#### Job Events + +**Start Job:** +``` +job,function=start_job event="{\"jobId\":123,\"user\":\"alice\",\"cluster\":\"test\", ...}" 1234567890000000000 +``` + +**Stop Job:** +``` +job,function=stop_job event="{\"jobId\":123,\"cluster\":\"test\",\"startTime\":1234567890,\"stopTime\":1234571490,\"jobState\":\"completed\"}" 1234571490000000000 +``` + +**Tags:** +- `function`: Either `start_job` or `stop_job` + +**Fields:** +- `event`: JSON payload containing job data (see REST API documentation for schema) + +#### Node State Updates + +```json +{ + "cluster": "testcluster", + "nodes": [ + { + "hostname": "node001", + "states": ["allocated"], + "cpusAllocated": 8, + "memoryAllocated": 16384, + "gpusAllocated": 0, + "jobsRunning": 1 + } + ] +} +``` + +### Implementation Notes + +- NATS API mirrors REST API functionality but uses messaging +- Job start/stop events are processed asynchronously +- Duplicate job detection is handled (same as REST API) +- All validation rules from REST API apply +- Messages are logged; no responses are sent back to publishers +- If NATS client is unavailable, API subscriptions are skipped (logged as warning) + ## Dependencies - Go 1.24.0+ (check go.mod for exact version) - Node.js (for frontend builds) - SQLite 3 (only supported database) -- Optional: NATS server for metric ingestion +- Optional: NATS server for NATS API integration diff --git a/README.md b/README.md index a0352d1..468a12a 100644 --- a/README.md +++ b/README.md @@ -22,11 +22,12 @@ switching from PHP Symfony to a Golang based solution are explained ## Overview This is a Golang web backend for the ClusterCockpit job-specific performance -monitoring framework. It provides a REST API for integrating ClusterCockpit with -an HPC cluster batch system and external analysis scripts. Data exchange between -the web front-end and the back-end is based on a GraphQL API. The web frontend -is also served by the backend using [Svelte](https://svelte.dev/) components. -Layout and styling are based on [Bootstrap 5](https://getbootstrap.com/) using +monitoring framework. It provides a REST API and an optional NATS-based messaging +API for integrating ClusterCockpit with an HPC cluster batch system and external +analysis scripts. Data exchange between the web front-end and the back-end is +based on a GraphQL API. The web frontend is also served by the backend using +[Svelte](https://svelte.dev/) components. Layout and styling are based on +[Bootstrap 5](https://getbootstrap.com/) using [Bootstrap Icons](https://icons.getbootstrap.com/). The backend uses [SQLite 3](https://sqlite.org/) as the relational SQL database. @@ -35,6 +36,10 @@ databases, the only tested and supported setup is to use cc-metric-store as the metric data backend. Documentation on how to integrate ClusterCockpit with other time series databases will be added in the future. +For real-time integration with HPC systems, the backend can subscribe to +[NATS](https://nats.io/) subjects to receive job start/stop events and node +state updates, providing an alternative to REST API polling. + Completed batch jobs are stored in a file-based job archive according to [this specification](https://github.com/ClusterCockpit/cc-specifications/tree/master/job-archive). The backend supports authentication via local accounts, an external LDAP @@ -130,27 +135,60 @@ ln -s ./var/job-archive ## Project file structure +- [`.github/`](https://github.com/ClusterCockpit/cc-backend/tree/master/.github) + GitHub Actions workflows and dependabot configuration for CI/CD. - [`api/`](https://github.com/ClusterCockpit/cc-backend/tree/master/api) contains the API schema files for the REST and GraphQL APIs. The REST API is documented in the OpenAPI 3.0 format in - [./api/openapi.yaml](./api/openapi.yaml). + [./api/swagger.yaml](./api/swagger.yaml). The GraphQL schema is in + [./api/schema.graphqls](./api/schema.graphqls). - [`cmd/cc-backend`](https://github.com/ClusterCockpit/cc-backend/tree/master/cmd/cc-backend) - contains `main.go` for the main application. + contains the main application entry point and CLI implementation. - [`configs/`](https://github.com/ClusterCockpit/cc-backend/tree/master/configs) contains documentation about configuration and command line options and required - environment variables. A sample configuration file is provided. -- [`docs/`](https://github.com/ClusterCockpit/cc-backend/tree/master/docs) - contains more in-depth documentation. + environment variables. Sample configuration files are provided. - [`init/`](https://github.com/ClusterCockpit/cc-backend/tree/master/init) contains an example of setting up systemd for production use. - [`internal/`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal) contains library source code that is not intended for use by others. + - [`api`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/api) + REST API handlers and NATS integration + - [`archiver`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/archiver) + Job archiving functionality + - [`auth`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/auth) + Authentication (local, LDAP, OIDC) and JWT token handling + - [`config`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/config) + Configuration management and validation + - [`graph`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/graph) + GraphQL schema and resolvers + - [`importer`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/importer) + Job data import and database initialization + - [`memorystore`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/memorystore) + In-memory metric data store with checkpointing + - [`metricdata`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/metricdata) + Metric data repository implementations (cc-metric-store, Prometheus) + - [`metricDataDispatcher`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/metricDataDispatcher) + Dispatches metric data loading to appropriate backends + - [`repository`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/repository) + Database repository layer for jobs and metadata + - [`routerConfig`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/routerConfig) + HTTP router configuration and middleware + - [`tagger`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/tagger) + Job classification and application detection + - [`taskmanager`](https://github.com/ClusterCockpit/cc-backend/tree/master/internal/taskmanager) + Background task management and scheduled jobs - [`pkg/`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg) contains Go packages that can be used by other projects. + - [`archive`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg/archive) + Job archive backend implementations (filesystem, S3) + - [`nats`](https://github.com/ClusterCockpit/cc-backend/tree/master/pkg/nats) + NATS client and message handling - [`tools/`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools) Additional command line helper tools. - [`archive-manager`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/archive-manager) - Commands for getting infos about and existing job archive. + Commands for getting infos about an existing job archive. + - [`archive-migration`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/archive-migration) + Tool for migrating job archives between formats. - [`convert-pem-pubkey`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/convert-pem-pubkey) Tool to convert external pubkey for use in `cc-backend`. - [`gen-keypair`](https://github.com/ClusterCockpit/cc-backend/tree/master/tools/gen-keypair) @@ -162,7 +200,7 @@ ln -s ./var/job-archive - [`frontend`](https://github.com/ClusterCockpit/cc-backend/tree/master/web/frontend) Svelte components and static assets for the frontend UI - [`templates`](https://github.com/ClusterCockpit/cc-backend/tree/master/web/templates) - Server-side Go templates + Server-side Go templates, including monitoring views - [`gqlgen.yml`](https://github.com/ClusterCockpit/cc-backend/blob/master/gqlgen.yml) Configures the behaviour and generation of [gqlgen](https://github.com/99designs/gqlgen). diff --git a/configs/config-demo.json b/configs/config-demo.json index 58366fb..aa38831 100644 --- a/configs/config-demo.json +++ b/configs/config-demo.json @@ -5,14 +5,9 @@ "resampling": { "minimumPoints": 600, "trigger": 180, - "resolutions": [ - 240, - 60 - ] + "resolutions": [240, 60] }, - "apiAllowedIPs": [ - "*" - ], + "apiAllowedIPs": ["*"], "emission-constant": 317 }, "cron": { @@ -103,4 +98,5 @@ } ] } -} \ No newline at end of file +} + diff --git a/configs/config.json b/configs/config.json index 88a9e93..41d8eca 100644 --- a/configs/config.json +++ b/configs/config.json @@ -15,6 +15,10 @@ 240, 60 ] + }, + "apiSubjects": { + "subjectJobEvent": "cc.job.event", + "subjectNodeState": "cc.node.state" } }, "cron": { diff --git a/internal/config/schema.go b/internal/config/schema.go index b171f96..ff8d0c9 100644 --- a/internal/config/schema.go +++ b/internal/config/schema.go @@ -119,6 +119,21 @@ var configSchema = ` } }, "required": ["trigger", "resolutions"] + }, + "apiSubjects": { + "description": "NATS subjects configuration for subscribing to job and node events.", + "type": "object", + "properties": { + "subjectJobEvent": { + "description": "NATS subject for job events (start_job, stop_job)", + "type": "string" + }, + "subjectNodeState": { + "description": "NATS subject for node state updates", + "type": "string" + } + }, + "required": ["subjectJobEvent", "subjectNodeState"] } }, "required": ["apiAllowedIPs"]