Files
cc-backend/internal/archiver
Jan Eitzinger 1b72b0b5ad Fix critical/severe issues in init, startup and shutdown
- auth: do not abort the server when authentication is disabled. auth.Init
  is now always called; with disable-authentication it sets up an ephemeral
  session store (SESSION_KEY not required) and registers no authenticators,
  so the unconditional auth.GetAuthInstance() callers (server init,
  api.New()) always get a valid instance.
- main: run the graceful-shutdown sequence on the startup-error path. runServer
  derives a cancelable context and, on a server-start failure, cancels it and
  waits so the metricstore final checkpoint / WAL rotation, archiver flush and
  taskmanager shutdown actually run before exit.
- server: log the :80 HTTP->HTTPS redirect listener error instead of dropping it.
- archiver: guard Shutdown against being called when Start never ran
  (avoids close(nil) panic / blocking on a nil workerDone).
- nats API: stop worker goroutines on shutdown via a stop channel + idempotent
  Shutdown(); workers and subscription callbacks select on stop and the
  channels are never closed, so no send-on-closed-channel can occur. Wired
  into Server.Shutdown after the NATS client is closed.
- metricstore: make Shutdown idempotent (nil shutdownFunc, early return) and
  release shutdownFuncMu before the checkpoint write.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 3c179f9caa8f
2026-06-05 10:16:28 +02:00
..
2026-02-10 09:17:34 +01:00

Archiver Package

The archiver package provides asynchronous job archiving functionality for ClusterCockpit. When jobs complete, their metric data is archived from the metric store to a persistent archive backend (filesystem, S3, SQLite, etc.).

Architecture

Producer-Consumer Pattern

┌──────────────┐     TriggerArchiving()      ┌───────────────┐
│  API Handler │  ───────────────────────▶   │ archiveChannel│
│ (Job Stop)   │                             │  (buffer: 128)│
└──────────────┘                             └───────┬───────┘
                                                     │
                   ┌─────────────────────────────────┘
                   │
                   ▼
         ┌──────────────────────┐
         │  archivingWorker()   │
         │   (goroutine)        │
         └──────────┬───────────┘
                    │
                    ▼
         1. Fetch job metadata
         2. Load metric data
         3. Calculate statistics
         4. Archive to backend
         5. Update database
         6. Call hooks

Components

  • archiveChannel: Buffered channel (128 jobs) for async communication
  • archivePending: WaitGroup tracking in-flight archiving operations
  • archivingWorker: Background goroutine processing archiving requests
  • shutdownCtx: Context for graceful cancellation during shutdown

Usage

Initialization

// Start archiver with context for shutdown control
ctx, cancel := context.WithCancel(context.Background())
defer cancel()

archiver.Start(jobRepository, ctx)

Archiving a Job

// Called automatically when a job completes
archiver.TriggerArchiving(job)

The function returns immediately. Actual archiving happens in the background.

Graceful Shutdown

// Shutdown with 10 second timeout
if err := archiver.Shutdown(10 * time.Second); err != nil {
    log.Printf("Archiver shutdown timeout: %v", err)
}

Shutdown process:

  1. Closes channel (rejects new jobs)
  2. Waits for pending jobs (up to timeout)
  3. Cancels context if timeout exceeded
  4. Waits for worker to exit cleanly

Configuration

Channel Buffer Size

The archiving channel has a buffer of 128 jobs. If more than 128 jobs are queued simultaneously, TriggerArchiving() will block until space is available.

To adjust:

// In archiveWorker.go Start() function
archiveChannel = make(chan *schema.Job, 256) // Increase buffer

Scope Selection

Archive data scopes are automatically selected based on job size:

  • Node scope: Always included
  • Core scope: Included for jobs with ≤8 nodes (reduces data volume for large jobs)
  • Accelerator scope: Included if job used accelerators (NumAcc > 0)

To adjust the node threshold:

// In archiver.go ArchiveJob() function
if job.NumNodes <= 16 { // Change from 8 to 16
    scopes = append(scopes, schema.MetricScopeCore)
}

Resolution

Data is archived at the highest available resolution (typically 60s intervals). To change:

// In archiver.go ArchiveJob() function
jobData, err := metricdispatch.LoadData(job, allMetrics, scopes, ctx, 300)
// 0 = highest resolution
// 300 = 5-minute resolution

Error Handling

Automatic Retry

The archiver does not automatically retry failed archiving operations. If archiving fails:

  1. Error is logged
  2. Job is marked as MonitoringStatusArchivingFailed in database
  3. Worker continues processing other jobs

Manual Retry

To re-archive failed jobs, query for jobs with MonitoringStatusArchivingFailed and call TriggerArchiving() again.

Performance Considerations

Single Worker Thread

The archiver uses a single worker goroutine. For high-throughput systems:

  • Large channel buffer (128) prevents blocking
  • Archiving is typically I/O bound (writing to storage)
  • Single worker prevents overwhelming storage backend

Shutdown Timeout

Recommended timeout values:

  • Development: 5-10 seconds
  • Production: 10-30 seconds
  • High-load: 30-60 seconds

Choose based on:

  • Average archiving time per job
  • Storage backend latency
  • Acceptable shutdown delay

Monitoring

Logging

The archiver logs:

  • Info: Startup, shutdown, successful completions
  • Debug: Individual job archiving times
  • Error: Archiving failures with job ID and reason
  • Warn: Shutdown timeout exceeded

Metrics

Monitor these signals for archiver health:

  • Jobs with MonitoringStatusArchivingFailed
  • Time from job stop to successful archive
  • Shutdown timeout occurrences

Thread Safety

All exported functions are safe for concurrent use:

  • Start() - Safe to call once
  • TriggerArchiving() - Safe from multiple goroutines
  • Shutdown() - Safe to call once

Internal state is protected by:

  • Channel synchronization (archiveChannel)
  • WaitGroup for pending count (archivePending)
  • Context for cancellation (shutdownCtx)

Files

  • archiveWorker.go: Worker lifecycle, channel management, shutdown logic
  • archiver.go: Core archiving logic, metric loading, statistics calculation

Dependencies

  • internal/repository: Database operations for job metadata
  • internal/metricdispatch: Loading metric data from various backends
  • pkg/archive: Archive backend abstraction (filesystem, S3, SQLite)
  • cc-lib/schema: Job and metric data structures