cc-specifications/job-archive/README.md

# File based archive specification for HPC jobs

This is a json files based exchange format for HPC job meta and performance metric data.

It consists of two parts:
* a sqlite database schema for job meta data and performance statistics
* a json file format together with a directory hierarchy specification

By using an open, portable and simple specification based on files it is
possible to exchange job performance data for research and analysis purposes as
well as a robust way for archiving job performance data on disk.

## Directory hierarchy and file specification

The job archive has top-level directories named after the clusters. In every
cluster directory there must be one file named `cluster.json` describing the
cluster. The json schema for this file is described here. Within this directory
a three-level directory tree is used to organize job files.

To manage the number of directories within a single directory a tree approach
is used splitting the integer job ID. The job id is split in junks of 1000
each.

For a 2 layer schema this can be achieved with (code example in Perl):

```perl
$level1 = $jobID/1000;
$level2 = $jobID%1000;
$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);

```

The last directory level is the unix epoch timestamp in seconds to allow for
overflowing job ids.

Example:

For the job ID 1034871 the directory path is `./1034/871/<timestamp>/`.

The job data consists of two files:

* meta.json: Contains job meta information and job statistics.
* data.json: Contains complete job data with time series

The description of the json format specification is available as json schema.

Metric time series data is stored with fixed time step. The time step can be
set per metric. If no value is available for a metric time series data
timestamp null must be entered.
Update specs 2022-03-18 15:19:04 +01:00			`# File based archive specification for HPC jobs`

			`This is a json files based exchange format for HPC job meta and performance metric data.`

			`It consists of two parts:`
			`* a sqlite database schema for job meta data and performance statistics`
			`* a json file format together with a directory hierarchy specification`

			`By using an open, portable and simple specification based on files it is`
			`possible to exchange job performance data for research and analysis purposes as`
			`well as a robust way for archiving job performance data on disk.`

			`## Directory hierarchy and file specification`

			`The job archive has top-level directories named after the clusters. In every`
			cluster directory there must be one file named `cluster.json` describing the
			`cluster. The json schema for this file is described here. Within this directory`
			`a three-level directory tree is used to organize job files.`

			`To manage the number of directories within a single directory a tree approach`
			`is used splitting the integer job ID. The job id is split in junks of 1000`
			`each.`

			`For a 2 layer schema this can be achieved with (code example in Perl):`

			```perl
			`$level1 = $jobID/1000;`
			`$level2 = $jobID%1000;`
			`$dstPath = sprintf("%s/%s/%d/%03d", $trunk, $destdir, $level1, $level2);`

			```

			`The last directory level is the unix epoch timestamp in seconds to allow for`
			`overflowing job ids.`

			`Example:`

Update README.md 2022-03-18 15:19:50 +01:00			For the job ID 1034871 the directory path is `./1034/871/<timestamp>/`.
Update specs 2022-03-18 15:19:04 +01:00
			`The job data consists of two files:`

			`* meta.json: Contains job meta information and job statistics.`
			`* data.json: Contains complete job data with time series`

			`The description of the json format specification is available as json schema.`

			`Metric time series data is stored with fixed time step. The time step can be`
			`set per metric. If no value is available for a metric time series data`
			`timestamp null must be entered.`