mirror of
https://github.com/ClusterCockpit/cc-examples.git
synced 2024-12-26 05:29:06 +01:00
38 lines
1.4 KiB
Markdown
38 lines
1.4 KiB
Markdown
# ClusterCockpit at NHR@FAU
|
|
|
|
NHR@FAU provides a production instance of ClusterCockpit for support personnel
|
|
and users. Authentication is via an LDAP directory as well as via our HPC Portal
|
|
(homegrown account management platform) using JWT tokens.
|
|
|
|
You can find an overview about all clusters
|
|
[here](https://doc.nhr.fau.de/clusters/overview/).
|
|
|
|
Some systems run with job exclusive nodes, others have node sharing enabled.
|
|
There are CPU systems (Fritz, Meggie, Woody, TinyFat) as well as GPU accelerated
|
|
clusters (Alex, TinyGPU).
|
|
|
|
NHR@FAU uses the following stack:
|
|
|
|
* `cc-metric-collector` as node agent
|
|
* `cc-metric-store` as temporal metric time series cache. We use one instance
|
|
for all clusters.
|
|
* `cc-backend`
|
|
* A homegrown python script running on the management nodes for providing job
|
|
meta data from Slurm
|
|
* Builtin sqlite database for job meta and user data (currently 50GB large)
|
|
* Job Archive without retention using compressed data.json files (around 700GB)
|
|
|
|
Currently all API use regular HTTP protocol, but we plan to switch to NATS for
|
|
all communication.
|
|
We also push the metric data to an InfluxDB instance for debugging purposes.
|
|
|
|
The backend and metric store run on the same dedicated Dell server running
|
|
Ubuntu Linux:
|
|
|
|
* Two Intel Xeon(R) Platinum 8352Y with 32 cores each
|
|
* 512 GB Main memory capacity
|
|
* A NVMe Raid with two 7TB disks
|
|
|
|
This configuration is probably complete overkill, but we wanted to be on the
|
|
safe side.
|