cc-examples/nhr@fau/README.md

# ClusterCockpit at NHR@FAU

NHR@FAU provides a production instance of ClusterCockpit for support personnel
and users. Authentication is via an LDAP directory as well as via our HPC Portal
(homegrown account management platform) using JWT tokens.

You can find an overview about all clusters
[here](https://doc.nhr.fau.de/clusters/overview/).

Some systems run with job exclusive nodes, others have node sharing enabled.
There are CPU systems (Fritz, Meggie, Woody, TinyFat) as well as GPU accelerated
clusters (Alex, TinyGPU).

NHR@FAU uses the following stack:

* `cc-metric-collector` as node agent
* `cc-metric-store` as temporal metric time series cache. We use one instance
for all clusters.
* `cc-backend`
* A homegrown python script running on the management nodes for providing job
meta data from Slurm
* Builtin sqlite database for job meta and user data (currently 50GB large)
* Job Archive without retention using compressed data.json files (around 700GB)

Currently all API use regular HTTP protocol, but we plan to switch to NATS for
all communication.
We also push the metric data to an InfluxDB instance for debugging purposes.

The backend and metric store run on the same dedicated Dell server running
Ubuntu Linux:

* Two Intel Xeon(R) Platinum 8352Y with 32 cores each
* 512 GB Main memory capacity
* A NVMe Raid with two 7TB disks

This configuration is probably complete overkill, but we wanted to be on the
safe side.
FAU config 2023-06-13 07:26:59 +02:00			`# ClusterCockpit at NHR@FAU`

Rename folder and update config 2024-11-29 11:30:52 +01:00			`NHR@FAU provides a production instance of ClusterCockpit for support personnel`
FAU config 2023-06-13 07:26:59 +02:00			`and users. Authentication is via an LDAP directory as well as via our HPC Portal`
			`(homegrown account management platform) using JWT tokens.`

			`You can find an overview about all clusters`
Rename folder and update config 2024-11-29 11:30:52 +01:00			`[here](https://doc.nhr.fau.de/clusters/overview/).`
FAU config 2023-06-13 07:26:59 +02:00
Rename folder and update config 2024-11-29 11:30:52 +01:00			`Some systems run with job exclusive nodes, others have node sharing enabled.`
			`There are CPU systems (Fritz, Meggie, Woody, TinyFat) as well as GPU accelerated`
			`clusters (Alex, TinyGPU).`
FAU config 2023-06-13 07:26:59 +02:00
			`NHR@FAU uses the following stack:`
Rename folder and update config 2024-11-29 11:30:52 +01:00
FAU config 2023-06-13 07:26:59 +02:00			* `cc-metric-collector` as node agent
Rename folder and update config 2024-11-29 11:30:52 +01:00			* `cc-metric-store` as temporal metric time series cache. We use one instance
			`for all clusters.`
FAU config 2023-06-13 07:26:59 +02:00			* `cc-backend`
Rename folder and update config 2024-11-29 11:30:52 +01:00			`* A homegrown python script running on the management nodes for providing job`
			`meta data from Slurm`
			`* Builtin sqlite database for job meta and user data (currently 50GB large)`
Update README.md 2023-06-13 08:45:44 +02:00			`* Job Archive without retention using compressed data.json files (around 700GB)`
FAU config 2023-06-13 07:26:59 +02:00
Rename folder and update config 2024-11-29 11:30:52 +01:00			`Currently all API use regular HTTP protocol, but we plan to switch to NATS for`
			`all communication.`
FAU config 2023-06-13 07:26:59 +02:00			`We also push the metric data to an InfluxDB instance for debugging purposes.`

			`The backend and metric store run on the same dedicated Dell server running`
			`Ubuntu Linux:`
Rename folder and update config 2024-11-29 11:30:52 +01:00
FAU config 2023-06-13 07:26:59 +02:00			`* Two Intel Xeon(R) Platinum 8352Y with 32 cores each`
			`* 512 GB Main memory capacity`
			`* A NVMe Raid with two 7TB disks`

			`This configuration is probably complete overkill, but we wanted to be on the`
			`safe side.`