Add documentation for importer

2026-03-17 21:37:30 +01:00 · 2025-12-04 15:07:09 +01:00
parent f3ea95535b
commit 7cff8bbfd2
6 changed files with 362 additions and 148 deletions
--- a/internal/importer/README.md
+++ b/internal/importer/README.md
@@ -0,0 +1,132 @@
+# Importer Package
+
+The `importer` package provides functionality for importing job data into the ClusterCockpit database from archived job files.
+
+## Overview
+
+This package supports two primary import workflows:
+
+1. **Bulk Database Initialization** - Reinitialize the entire job database from archived jobs
+2. **Individual Job Import** - Import specific jobs from metadata/data file pairs
+
+Both workflows enrich job metadata by calculating performance footprints and energy consumption metrics before persisting to the database.
+
+## Main Entry Points
+
+### InitDB()
+
+Reinitializes the job database from all archived jobs.
+
+```go
+if err := importer.InitDB(); err != nil {
+    log.Fatal(err)
+}
+```
+
+This function:
+- Flushes existing job, tag, and jobtag tables
+- Iterates through all jobs in the configured archive
+- Enriches each job with calculated metrics
+- Inserts jobs into the database in batched transactions (100 jobs per batch)
+- Continues on individual job failures, logging errors
+
+**Use Case**: Initial database setup or complete database rebuild from archive.
+
+### HandleImportFlag(flag string)
+
+Imports jobs from specified file pairs.
+
+```go
+// Format: "<meta.json>:<data.json>[,<meta2.json>:<data2.json>,...]"
+flag := "/path/to/meta.json:/path/to/data.json"
+if err := importer.HandleImportFlag(flag); err != nil {
+    log.Fatal(err)
+}
+```
+
+This function:
+- Parses the comma-separated file pairs
+- Validates metadata and job data against schemas (if validation enabled)
+- Enriches each job with footprints and energy metrics
+- Imports jobs into both the archive and database
+- Fails fast on the first error
+
+**Use Case**: Importing specific jobs from external sources or manual job additions.
+
+## Job Enrichment
+
+Both import workflows use `enrichJobMetadata()` to calculate:
+
+### Performance Footprints
+
+Performance footprints are calculated from metric averages based on the subcluster configuration:
+
+```go
+job.Footprint["mem_used_avg"] = 45.2  // GB
+job.Footprint["cpu_load_avg"] = 0.87   // percentage
+```
+
+### Energy Metrics
+
+Energy consumption is calculated from power metrics using the formula:
+
+```
+Energy (kWh) = (Power (W) × Duration (s) / 3600) / 1000
+```
+
+For each energy metric:
+```go
+job.EnergyFootprint["acc_power"] = 12.5  // kWh
+job.Energy = 150.2  // Total energy in kWh
+```
+
+**Note**: Energy calculations for metrics with unit "energy" (Joules) are not yet implemented.
+
+## Data Validation
+
+### SanityChecks(job *schema.Job)
+
+Validates job metadata before database insertion:
+
+- Cluster exists in configuration
+- Subcluster is valid (assigns if needed)
+- Job state is valid
+- Resources and user fields are populated
+- Node counts and hardware thread counts are positive
+- Resource count matches declared node count
+
+## Normalization Utilities
+
+The package includes utilities for normalizing metric values to appropriate SI prefixes:
+
+### Normalize(avg float64, prefix string)
+
+Adjusts values and SI prefixes for readability:
+
+```go
+factor, newPrefix := importer.Normalize(2048.0, "M")  
+// Converts 2048 MB → ~2.0 GB
+// Returns: factor for conversion, "G"
+```
+
+This is useful for automatically scaling metrics (e.g., memory, storage) to human-readable units.
+
+## Dependencies
+
+- `github.com/ClusterCockpit/cc-backend/internal/repository` - Database operations
+- `github.com/ClusterCockpit/cc-backend/pkg/archive` - Job archive access
+- `github.com/ClusterCockpit/cc-lib/schema` - Job schema definitions
+- `github.com/ClusterCockpit/cc-lib/ccLogger` - Logging
+- `github.com/ClusterCockpit/cc-lib/ccUnits` - SI unit handling
+
+## Error Handling
+
+- **InitDB**: Continues processing on individual job failures, logs errors, returns summary
+- **HandleImportFlag**: Fails fast on first error, returns immediately
+- Both functions log detailed error context for debugging
+
+## Performance
+
+- **Transaction Batching**: InitDB processes jobs in batches of 100 for optimal database performance
+- **Tag Caching**: Tag IDs are cached during import to minimize database queries
+- **Progress Reporting**: InitDB prints progress updates during bulk operations