Rename folder and update config

2026-03-17 22:17:30 +01:00 · 2024-11-29 11:30:52 +01:00
parent 0d4e698846
commit be661e4e15
66 changed files with 10741 additions and 3284 deletions
--- a/nhr@fau/README.md
+++ b/nhr@fau/README.md
@@ -0,0 +1,37 @@
+# ClusterCockpit at NHR@FAU
+
+NHR@FAU provides a production instance of ClusterCockpit for support personnel
+and users. Authentication is via an LDAP directory as well as via our HPC Portal
+(homegrown account management platform) using JWT tokens.
+
+You can find an overview about all clusters
+[here](https://doc.nhr.fau.de/clusters/overview/).
+
+Some systems run with job exclusive nodes, others have node sharing enabled.
+There are CPU systems (Fritz, Meggie, Woody, TinyFat) as well as GPU accelerated
+clusters (Alex, TinyGPU).
+
+NHR@FAU uses the following stack:
+
+* `cc-metric-collector` as node agent
+* `cc-metric-store` as temporal metric time series cache. We use one instance
+for all clusters.
+* `cc-backend`
+* A homegrown python script running on the management nodes for providing job
+meta data from Slurm
+* Builtin sqlite database for job meta and user data (currently 50GB large)
+* Job Archive without retention using compressed data.json files (around 700GB)
+
+Currently all API use regular HTTP protocol, but we plan to switch to NATS for
+all communication.
+We also push the metric data to an InfluxDB instance for debugging purposes.
+
+The backend and metric store run on the same dedicated Dell server running
+Ubuntu Linux:
+
+* Two Intel Xeon(R) Platinum 8352Y with 32 cores each
+* 512 GB Main memory capacity
+* A NVMe Raid with two 7TB disks
+
+This configuration is probably complete overkill, but we wanted to be on the
+safe side.