mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2025-12-15 19:56:16 +01:00
133 lines
4.0 KiB
Markdown
133 lines
4.0 KiB
Markdown
# Importer Package
|
||
|
||
The `importer` package provides functionality for importing job data into the ClusterCockpit database from archived job files.
|
||
|
||
## Overview
|
||
|
||
This package supports two primary import workflows:
|
||
|
||
1. **Bulk Database Initialization** - Reinitialize the entire job database from archived jobs
|
||
2. **Individual Job Import** - Import specific jobs from metadata/data file pairs
|
||
|
||
Both workflows enrich job metadata by calculating performance footprints and energy consumption metrics before persisting to the database.
|
||
|
||
## Main Entry Points
|
||
|
||
### InitDB()
|
||
|
||
Reinitializes the job database from all archived jobs.
|
||
|
||
```go
|
||
if err := importer.InitDB(); err != nil {
|
||
log.Fatal(err)
|
||
}
|
||
```
|
||
|
||
This function:
|
||
- Flushes existing job, tag, and jobtag tables
|
||
- Iterates through all jobs in the configured archive
|
||
- Enriches each job with calculated metrics
|
||
- Inserts jobs into the database in batched transactions (100 jobs per batch)
|
||
- Continues on individual job failures, logging errors
|
||
|
||
**Use Case**: Initial database setup or complete database rebuild from archive.
|
||
|
||
### HandleImportFlag(flag string)
|
||
|
||
Imports jobs from specified file pairs.
|
||
|
||
```go
|
||
// Format: "<meta.json>:<data.json>[,<meta2.json>:<data2.json>,...]"
|
||
flag := "/path/to/meta.json:/path/to/data.json"
|
||
if err := importer.HandleImportFlag(flag); err != nil {
|
||
log.Fatal(err)
|
||
}
|
||
```
|
||
|
||
This function:
|
||
- Parses the comma-separated file pairs
|
||
- Validates metadata and job data against schemas (if validation enabled)
|
||
- Enriches each job with footprints and energy metrics
|
||
- Imports jobs into both the archive and database
|
||
- Fails fast on the first error
|
||
|
||
**Use Case**: Importing specific jobs from external sources or manual job additions.
|
||
|
||
## Job Enrichment
|
||
|
||
Both import workflows use `enrichJobMetadata()` to calculate:
|
||
|
||
### Performance Footprints
|
||
|
||
Performance footprints are calculated from metric averages based on the subcluster configuration:
|
||
|
||
```go
|
||
job.Footprint["mem_used_avg"] = 45.2 // GB
|
||
job.Footprint["cpu_load_avg"] = 0.87 // percentage
|
||
```
|
||
|
||
### Energy Metrics
|
||
|
||
Energy consumption is calculated from power metrics using the formula:
|
||
|
||
```
|
||
Energy (kWh) = (Power (W) × Duration (s) / 3600) / 1000
|
||
```
|
||
|
||
For each energy metric:
|
||
```go
|
||
job.EnergyFootprint["acc_power"] = 12.5 // kWh
|
||
job.Energy = 150.2 // Total energy in kWh
|
||
```
|
||
|
||
**Note**: Energy calculations for metrics with unit "energy" (Joules) are not yet implemented.
|
||
|
||
## Data Validation
|
||
|
||
### SanityChecks(job *schema.Job)
|
||
|
||
Validates job metadata before database insertion:
|
||
|
||
- Cluster exists in configuration
|
||
- Subcluster is valid (assigns if needed)
|
||
- Job state is valid
|
||
- Resources and user fields are populated
|
||
- Node counts and hardware thread counts are positive
|
||
- Resource count matches declared node count
|
||
|
||
## Normalization Utilities
|
||
|
||
The package includes utilities for normalizing metric values to appropriate SI prefixes:
|
||
|
||
### Normalize(avg float64, prefix string)
|
||
|
||
Adjusts values and SI prefixes for readability:
|
||
|
||
```go
|
||
factor, newPrefix := importer.Normalize(2048.0, "M")
|
||
// Converts 2048 MB → ~2.0 GB
|
||
// Returns: factor for conversion, "G"
|
||
```
|
||
|
||
This is useful for automatically scaling metrics (e.g., memory, storage) to human-readable units.
|
||
|
||
## Dependencies
|
||
|
||
- `github.com/ClusterCockpit/cc-backend/internal/repository` - Database operations
|
||
- `github.com/ClusterCockpit/cc-backend/pkg/archive` - Job archive access
|
||
- `github.com/ClusterCockpit/cc-lib/schema` - Job schema definitions
|
||
- `github.com/ClusterCockpit/cc-lib/ccLogger` - Logging
|
||
- `github.com/ClusterCockpit/cc-lib/ccUnits` - SI unit handling
|
||
|
||
## Error Handling
|
||
|
||
- **InitDB**: Continues processing on individual job failures, logs errors, returns summary
|
||
- **HandleImportFlag**: Fails fast on first error, returns immediately
|
||
- Both functions log detailed error context for debugging
|
||
|
||
## Performance
|
||
|
||
- **Transaction Batching**: InitDB processes jobs in batches of 100 for optimal database performance
|
||
- **Tag Caching**: Tag IDs are cached during import to minimize database queries
|
||
- **Progress Reporting**: InitDB prints progress updates during bulk operations
|