cc-backend/docs/Job-Archive.md

The job archive specifies an exchange format for job meta and performance metric
data. It consists of two parts:
* a [SQLite database schema](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#sqlite-database-schema)  for job meta data and performance statistics
* a [Json file format](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#json-file-format) together with a [Directory hierarchy specification](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#directory-hierarchy-specification)

By using an open, portable and simple specification based on files it is
possible to exchange job performance data for research and analysis purposes as
well as use it as a robust way for archiving job performance data to disk.

# SQLite database schema
## Introduction

A SQLite 3 database schema is provided to standardize the job meta data
information in a portable way. The schema also includes optional columns for job
performance statistics (called a job performance footprint). The database acts
as a front end to filter and select subsets of job IDs, that are the keys to get
the full job performance data in the job performance tree hierarchy.

## Database schema

The schema includes 3 tables: the job table, a tag table and a jobtag table
representing the MANY-TO-MANY relation between jobs and tags. The SQL schema is
specified
[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/schemas/jobs-sqlite.sql).
Explanation of the various columns including the JSON datatypes is documented
[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/datastructures/job-meta.schema.json).

# Directory hierarchy specification

## Specification

To manage the number of directories within a single directory a tree approach is
used splitting the integer job ID. The job id is split in junks of 1000 each.
Usually 2 layers of directories is sufficient but the concept can be used for an
arbitrary number of layers.

For a 2 layer schema this can be achieved with (code example in Perl):
``` perl
$level1 = $jobID/1000;
$level2 = $jobID%1000;
$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);
```

## Example

For the job ID 1034871 the directory path is `./1034/871/`.

# Json file format
## Overview

Every cluster must be configured in a `cluster.json` file.

The job data consists of two files:
* `meta.json`: Contains job meta information and job statistics.
* `data.json`: Contains complete job data with time series

The description of the json format specification is available as [[json
schema|https://json-schema.org/]] format file. The latest version of the json
schema is part of the `cc-backend` source tree. For external reference it is
also available in a separate repository.

## Specification `cluster.json`

The json schema specification is available
[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/datastructures/cluster.schema.json).

## Specification `meta.json`

The json schema specification is available
[here](https://github.com/RRZE-HPC/HPCJobDatabase/blob/master/json-schema/job-meta.schema.json).

## Specification `data.json`

The json schema specification is available
[here](https://github.com/RRZE-HPC/HPCJobDatabase/blob/master/json-schema/job-data.schema.json).
Metric time series data is stored for a fixed time step. The time step is set
per metric. If no value is available for a metric time series data timestamp
`null` is entered.
Extend docs 2023-06-14 07:31:29 +02:00			`The job archive specifies an exchange format for job meta and performance metric`
			`data. It consists of two parts:`
			`* a [SQLite database schema](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#sqlite-database-schema) for job meta data and performance statistics`
			`* a [Json file format](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#json-file-format) together with a [Directory hierarchy specification](https://github.com/ClusterCockpit/cc-backend/wiki/Job-Archive#directory-hierarchy-specification)`

			`By using an open, portable and simple specification based on files it is`
			`possible to exchange job performance data for research and analysis purposes as`
			`well as use it as a robust way for archiving job performance data to disk.`

			`# SQLite database schema`
			`## Introduction`

			`A SQLite 3 database schema is provided to standardize the job meta data`
			`information in a portable way. The schema also includes optional columns for job`
			`performance statistics (called a job performance footprint). The database acts`
			`as a front end to filter and select subsets of job IDs, that are the keys to get`
			`the full job performance data in the job performance tree hierarchy.`

			`## Database schema`

			`The schema includes 3 tables: the job table, a tag table and a jobtag table`
			`representing the MANY-TO-MANY relation between jobs and tags. The SQL schema is`
			`specified`
			`[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/schemas/jobs-sqlite.sql).`
			`Explanation of the various columns including the JSON datatypes is documented`
			`[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/datastructures/job-meta.schema.json).`

			`# Directory hierarchy specification`

			`## Specification`

			`To manage the number of directories within a single directory a tree approach is`
			`used splitting the integer job ID. The job id is split in junks of 1000 each.`
			`Usually 2 layers of directories is sufficient but the concept can be used for an`
			`arbitrary number of layers.`

			`For a 2 layer schema this can be achieved with (code example in Perl):`
			``` perl
			`$level1 = $jobID/1000;`
			`$level2 = $jobID%1000;`
			`$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);`
			```

			`## Example`

			For the job ID 1034871 the directory path is `./1034/871/`.

			`# Json file format`
			`## Overview`

			Every cluster must be configured in a `cluster.json` file.

			`The job data consists of two files:`
			* `meta.json`: Contains job meta information and job statistics.
			* `data.json`: Contains complete job data with time series

			`The description of the json format specification is available as [[json`
			`schema\|https://json-schema.org/]] format file. The latest version of the json`
			schema is part of the `cc-backend` source tree. For external reference it is
			`also available in a separate repository.`

			## Specification `cluster.json`

			`The json schema specification is available`
			`[here](https://github.com/ClusterCockpit/cc-specifications/blob/master/datastructures/cluster.schema.json).`

			## Specification `meta.json`

			`The json schema specification is available`
			`[here](https://github.com/RRZE-HPC/HPCJobDatabase/blob/master/json-schema/job-meta.schema.json).`

			## Specification `data.json`

			`The json schema specification is available`
			`[here](https://github.com/RRZE-HPC/HPCJobDatabase/blob/master/json-schema/job-data.schema.json).`
			`Metric time series data is stored for a fixed time step. The time step is set`
			`per metric. If no value is available for a metric time series data timestamp`
			`null` is entered.