mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2025-12-16 04:06:16 +01:00
Add documentation for importer
This commit is contained in:
132
internal/importer/README.md
Normal file
132
internal/importer/README.md
Normal file
@@ -0,0 +1,132 @@
|
||||
# Importer Package
|
||||
|
||||
The `importer` package provides functionality for importing job data into the ClusterCockpit database from archived job files.
|
||||
|
||||
## Overview
|
||||
|
||||
This package supports two primary import workflows:
|
||||
|
||||
1. **Bulk Database Initialization** - Reinitialize the entire job database from archived jobs
|
||||
2. **Individual Job Import** - Import specific jobs from metadata/data file pairs
|
||||
|
||||
Both workflows enrich job metadata by calculating performance footprints and energy consumption metrics before persisting to the database.
|
||||
|
||||
## Main Entry Points
|
||||
|
||||
### InitDB()
|
||||
|
||||
Reinitializes the job database from all archived jobs.
|
||||
|
||||
```go
|
||||
if err := importer.InitDB(); err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
```
|
||||
|
||||
This function:
|
||||
- Flushes existing job, tag, and jobtag tables
|
||||
- Iterates through all jobs in the configured archive
|
||||
- Enriches each job with calculated metrics
|
||||
- Inserts jobs into the database in batched transactions (100 jobs per batch)
|
||||
- Continues on individual job failures, logging errors
|
||||
|
||||
**Use Case**: Initial database setup or complete database rebuild from archive.
|
||||
|
||||
### HandleImportFlag(flag string)
|
||||
|
||||
Imports jobs from specified file pairs.
|
||||
|
||||
```go
|
||||
// Format: "<meta.json>:<data.json>[,<meta2.json>:<data2.json>,...]"
|
||||
flag := "/path/to/meta.json:/path/to/data.json"
|
||||
if err := importer.HandleImportFlag(flag); err != nil {
|
||||
log.Fatal(err)
|
||||
}
|
||||
```
|
||||
|
||||
This function:
|
||||
- Parses the comma-separated file pairs
|
||||
- Validates metadata and job data against schemas (if validation enabled)
|
||||
- Enriches each job with footprints and energy metrics
|
||||
- Imports jobs into both the archive and database
|
||||
- Fails fast on the first error
|
||||
|
||||
**Use Case**: Importing specific jobs from external sources or manual job additions.
|
||||
|
||||
## Job Enrichment
|
||||
|
||||
Both import workflows use `enrichJobMetadata()` to calculate:
|
||||
|
||||
### Performance Footprints
|
||||
|
||||
Performance footprints are calculated from metric averages based on the subcluster configuration:
|
||||
|
||||
```go
|
||||
job.Footprint["mem_used_avg"] = 45.2 // GB
|
||||
job.Footprint["cpu_load_avg"] = 0.87 // percentage
|
||||
```
|
||||
|
||||
### Energy Metrics
|
||||
|
||||
Energy consumption is calculated from power metrics using the formula:
|
||||
|
||||
```
|
||||
Energy (kWh) = (Power (W) × Duration (s) / 3600) / 1000
|
||||
```
|
||||
|
||||
For each energy metric:
|
||||
```go
|
||||
job.EnergyFootprint["acc_power"] = 12.5 // kWh
|
||||
job.Energy = 150.2 // Total energy in kWh
|
||||
```
|
||||
|
||||
**Note**: Energy calculations for metrics with unit "energy" (Joules) are not yet implemented.
|
||||
|
||||
## Data Validation
|
||||
|
||||
### SanityChecks(job *schema.Job)
|
||||
|
||||
Validates job metadata before database insertion:
|
||||
|
||||
- Cluster exists in configuration
|
||||
- Subcluster is valid (assigns if needed)
|
||||
- Job state is valid
|
||||
- Resources and user fields are populated
|
||||
- Node counts and hardware thread counts are positive
|
||||
- Resource count matches declared node count
|
||||
|
||||
## Normalization Utilities
|
||||
|
||||
The package includes utilities for normalizing metric values to appropriate SI prefixes:
|
||||
|
||||
### Normalize(avg float64, prefix string)
|
||||
|
||||
Adjusts values and SI prefixes for readability:
|
||||
|
||||
```go
|
||||
factor, newPrefix := importer.Normalize(2048.0, "M")
|
||||
// Converts 2048 MB → ~2.0 GB
|
||||
// Returns: factor for conversion, "G"
|
||||
```
|
||||
|
||||
This is useful for automatically scaling metrics (e.g., memory, storage) to human-readable units.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- `github.com/ClusterCockpit/cc-backend/internal/repository` - Database operations
|
||||
- `github.com/ClusterCockpit/cc-backend/pkg/archive` - Job archive access
|
||||
- `github.com/ClusterCockpit/cc-lib/schema` - Job schema definitions
|
||||
- `github.com/ClusterCockpit/cc-lib/ccLogger` - Logging
|
||||
- `github.com/ClusterCockpit/cc-lib/ccUnits` - SI unit handling
|
||||
|
||||
## Error Handling
|
||||
|
||||
- **InitDB**: Continues processing on individual job failures, logs errors, returns summary
|
||||
- **HandleImportFlag**: Fails fast on first error, returns immediately
|
||||
- Both functions log detailed error context for debugging
|
||||
|
||||
## Performance
|
||||
|
||||
- **Transaction Batching**: InitDB processes jobs in batches of 100 for optimal database performance
|
||||
- **Tag Caching**: Tag IDs are cached during import to minimize database queries
|
||||
- **Progress Reporting**: InitDB prints progress updates during bulk operations
|
||||
Reference in New Issue
Block a user