Files
cc-backend/internal/importer/README.md

4.0 KiB
Raw Blame History

Importer Package

The importer package provides functionality for importing job data into the ClusterCockpit database from archived job files.

Overview

This package supports two primary import workflows:

  1. Bulk Database Initialization - Reinitialize the entire job database from archived jobs
  2. Individual Job Import - Import specific jobs from metadata/data file pairs

Both workflows enrich job metadata by calculating performance footprints and energy consumption metrics before persisting to the database.

Main Entry Points

InitDB()

Reinitializes the job database from all archived jobs.

if err := importer.InitDB(); err != nil {
    log.Fatal(err)
}

This function:

  • Flushes existing job, tag, and jobtag tables
  • Iterates through all jobs in the configured archive
  • Enriches each job with calculated metrics
  • Inserts jobs into the database in batched transactions (100 jobs per batch)
  • Continues on individual job failures, logging errors

Use Case: Initial database setup or complete database rebuild from archive.

HandleImportFlag(flag string)

Imports jobs from specified file pairs.

// Format: "<meta.json>:<data.json>[,<meta2.json>:<data2.json>,...]"
flag := "/path/to/meta.json:/path/to/data.json"
if err := importer.HandleImportFlag(flag); err != nil {
    log.Fatal(err)
}

This function:

  • Parses the comma-separated file pairs
  • Validates metadata and job data against schemas (if validation enabled)
  • Enriches each job with footprints and energy metrics
  • Imports jobs into both the archive and database
  • Fails fast on the first error

Use Case: Importing specific jobs from external sources or manual job additions.

Job Enrichment

Both import workflows use enrichJobMetadata() to calculate:

Performance Footprints

Performance footprints are calculated from metric averages based on the subcluster configuration:

job.Footprint["mem_used_avg"] = 45.2  // GB
job.Footprint["cpu_load_avg"] = 0.87   // percentage

Energy Metrics

Energy consumption is calculated from power metrics using the formula:

Energy (kWh) = (Power (W) × Duration (s) / 3600) / 1000

For each energy metric:

job.EnergyFootprint["acc_power"] = 12.5  // kWh
job.Energy = 150.2  // Total energy in kWh

Note: Energy calculations for metrics with unit "energy" (Joules) are not yet implemented.

Data Validation

SanityChecks(job *schema.Job)

Validates job metadata before database insertion:

  • Cluster exists in configuration
  • Subcluster is valid (assigns if needed)
  • Job state is valid
  • Resources and user fields are populated
  • Node counts and hardware thread counts are positive
  • Resource count matches declared node count

Normalization Utilities

The package includes utilities for normalizing metric values to appropriate SI prefixes:

Normalize(avg float64, prefix string)

Adjusts values and SI prefixes for readability:

factor, newPrefix := importer.Normalize(2048.0, "M")  
// Converts 2048 MB → ~2.0 GB
// Returns: factor for conversion, "G"

This is useful for automatically scaling metrics (e.g., memory, storage) to human-readable units.

Dependencies

  • github.com/ClusterCockpit/cc-backend/internal/repository - Database operations
  • github.com/ClusterCockpit/cc-backend/pkg/archive - Job archive access
  • github.com/ClusterCockpit/cc-lib/schema - Job schema definitions
  • github.com/ClusterCockpit/cc-lib/ccLogger - Logging
  • github.com/ClusterCockpit/cc-lib/ccUnits - SI unit handling

Error Handling

  • InitDB: Continues processing on individual job failures, logs errors, returns summary
  • HandleImportFlag: Fails fast on first error, returns immediately
  • Both functions log detailed error context for debugging

Performance

  • Transaction Batching: InitDB processes jobs in batches of 100 for optimal database performance
  • Tag Caching: Tag IDs are cached during import to minimize database queries
  • Progress Reporting: InitDB prints progress updates during bulk operations