Merge pull request from ClusterCockpit/dev

Preconfigured and updated docker services for CC components
This commit is contained in:
Jan Eitzinger 2025-01-31 07:09:38 +01:00 committed by GitHub
commit a945a21bc1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
32 changed files with 1336 additions and 1559 deletions

30
.env

@ -2,15 +2,6 @@
# CCBACKEND DEVEL DOCKER SETTINGS
########################################################################
########################################################################
# SLURM
########################################################################
SLURM_VERSION=22.05.6
ARCH=aarch64
MUNGE_UID=981
SLURM_UID=982
WORKER_UID=1000
########################################################################
# INFLUXDB
########################################################################
@ -22,27 +13,6 @@ INFLUXDB_BUCKET=ClusterCockpit
# Whether or not to check SSL Cert in Symfony Client, Default: false
INFLUXDB_SSL=false
########################################################################
# MARIADB
########################################################################
MARIADB_ROOT_PASSWORD=root
MARIADB_DATABASE=ClusterCockpit
MARIADB_USER=clustercockpit
MARIADB_PASSWORD=clustercockpit
MARIADB_PORT=3306
#########################################
# LDAP
########################################################################
LDAP_ADMIN_PASSWORD=mashup
LDAP_ORGANISATION=NHR@FAU
LDAP_DOMAIN=rrze.uni-erlangen.de
########################################################################
# PHPMyAdmin
########################################################################
PHPMYADMIN_PORT=8081
########################################################################
# INTERNAL SETTINGS
########################################################################

5
.gitignore vendored

@ -3,6 +3,11 @@ data/job-archive/**
data/influxdb
data/sqldata
data/cc-metric-store
data/cc-metric-store-source
data/ldap
data/mariadb
data/slurm
data
cc-backend
cc-backend/**
.vscode

187
README.md Normal file → Executable file

@ -1,74 +1,175 @@
# cc-docker
This is a `docker-compose` setup which provides a quickly started environment for ClusterCockpit development and testing, using `cc-backend`.
A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP), or easily added by manual configuration (MySQL).
A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP, SLURM), or easily added by manual configuration (MariaDB).
It includes the following containers:
* nats (Default)
* cc-metric-store (Default)
* influxdb (Default)
* openldap (Default)
* mysql (Optional)
* mariadb (Optional)
* phpmyadmin (Optional)
|Service full name|docker service name|port|
| --- | --- | --- |
|Slurm Controller service|slurmctld|6818|
|Slurm Database service|slurmdbd|6817|
|Slurm Rest service with JWT authentication|slurmrestd|6820|
|Slurm Worker|node01|6818|
|MariaDB service|mariadb|3306|
|InfluxDB serice|influxdb|8086|
|NATS service|nats|4222, 6222, 8222|
|cc-metric-store service|cc-metric-store|8084|
|OpenLDAP|openldap|389, 636|
The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MySQL, and a LDAP user directory.
The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MariaDB, and a LDAP user directory.
## Known Issues
## Prerequisites
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead.
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8081 (phpmyadmin), 8084 (cc-metric-store), 8086 (nfluxDB), 4222 and 8222 (Nats), or 3306 (MySQL). If one or more ports are already in use, you habe to adapt the related config accordingly.
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
For all the docker services to work correctly, you will need the following tools installed:
## Configuration Templates
1. `docker` and `docker-compose`
2. `golang` (for compiling cc-metric-store)
3. `perl` (for migrateTimestamp.pl) with Cpanel::JSON::XS, Data::Dumper, Time::Piece, Sort::Versions and File::Slurp perl modules.
4. `npm` (for cc-backend)
5. `make` (for building slurm base image)
Located in `./templates`
* `docker-compose.yml.default`: Docker-Compose file to setup cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
* `docker-compose.yml.mysql`: Docker-Compose configuration template if MySQL is desired instead of MariaDB.
* `env.default`: Environment variables for setup with cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
* `env.mysql`: Additional environment variables required if MySQL is desired instead of MariaDB.
It is also recommended to add docker service to sudouser group since the setupDev.sh script assumes sudo permissions for docker and docker-compose services.
You can use:
```
sudo groupadd docker
sudo usermod -aG docker $USER
# restart after adding your docker with your user to sudo group
sudo shutdown -r -t 0
```
Note: You can install all these dependencies via predefined installation steps in `prerequisite_installation_script.sh`.
If you are using different linux flavors, you will have to adapt `prerequisite_installation_script.sh` as well as `setupDev.sh`.
## Setup
1. Clone `cc-backend` repository in chosen base folder: `$> git clone https://github.com/ClusterCockpit/cc-backend.git`
2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the InfluxDB data).
2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the cc-metric-store data).
3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend` to start `cc-backend.`
3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend -server -dev` to start `cc-backend`.
4. By default, you can access `cc-backend` in your browser at `http://localhost:8080`. You can shut down the cc-backend server by pressing `CTRL-C`, remember to also shut down all containers via `$> docker-compose down` afterwards.
5. You can restart the containers with: `$> docker-compose up -d`.
## Post-Setup Adjustment for using `influxdb`
When using `influxdb` as a metric database, one must adjust the following files:
* `cc-backend/var/job-archive/emmy/cluster.json`
* `cc-backend/var/job-archive/woody/cluster.json`
In the JSON, exchange the content of the `metricDataRepository`-Entry (By default configured for `cc-metric-store`) with:
```
"metricDataRepository": {
"kind": "influxdb",
"url": "http://localhost:8086",
"token": "egLfcf7fx0FESqFYU3RpAAbj",
"bucket": "ClusterCockpit",
"org": "ClusterCockpit",
"skiptls": false
}
```
## Usage
## Credentials for logging into clustercockpit
Credentials for the preconfigured demo user are:
* User: `demo`
* Password: `AdminDev`
* Password: `demo`
Credentials for the preconfigured LDAP user are:
* User: `ldapuser`
* Password: `ldapuser`
You can also login as regular user using any credential in the LDAP user directory at `./data/ldap/users.ldif`.
## Preconfigured setup between docker services and ClusterCockpit components
When you are done cloning the cc-backend repo and once you execute `setupDev.sh` file, it will copy a preconfigured `config.json` from `misc/config.json` and replace the `cc-backend/config.json`, which will be used by cc-backend, once you start the server.
The preconfigured config.json attaches to:
#### 1. MariaDB docker service on port 3306 (database: ccbackend)
#### 2. OpenLDAP docker service on port 389
#### 3. cc-metric-store docker service on port 8084
cc-metric-store also has a preconfigured `config.json` in `cc-metric-store/config.json` which attaches to NATS docker service on port 4222 and subscribes to topic 'hpc-nats'.
Basically, all the ClusterCockpit components and the docker services attach to each other like lego pieces.
## Docker commands to access the services
> Note: You need to be in cc-docker directory in order to execute any docker command
You can view all docker processes running on either of the VM instance by using this command:
```
$ docker ps
```
Now that you can see the docker services, and if you want to manually access the docker services, you have to run **`bash`** command in those running services.
> **`Example`**: You want to run slurm commands like `sinfo` or `squeue` or `scontrol` on slurm controller, you cannot directly access it.
You need to **`bash`** into the running service by using the following command:
```
$ docker exec -it <docker service name> bash
#example
$ docker exec -it slurmctld bash
#or
$ docker exec -it mariadb bash
```
Once you start a **`bash`** on any docker service, then you may execute any service related commands in that **`bash`**.
But for Cluster Cockpit development, you only need ports to access these docker services. You have to use `localhost:<port>` when trying to access any docker service. You may need to configure the `cc-backend/config.json` based on these docker services and ports.
## Slurm setup in cc-docker
### 1. Slurm controller
Currently slurm controller is aware of the 1 node that we have setup in our mini cluster i.e. node01.
In order to execute slurm commands, you may need to **`bash`** into the **`slurmctld`** docker service.
```
$ docker exec -it slurmctld bash
```
Then you may be able to run slurm controller commands. A few examples without output are:
```
$ sinfo
or
$ squeue
or
$ scontrol show nodes
```
### 2. Slurm rest service
You do not need to **`bash`** into the slurmrestd service but can directly access the rest API via localhost:6820. A simple example on how to CURL to the slurm rest API is given in the `curl_slurmrestd.sh`.
You can directly use `curl_slurmrestd.sh` with a never expiring JWT token ( can be found in /data/slurm/secret/jwt_token.txt )
You may also use the never expiring token directly from the file for any of your custom CURL commands.
## Known Issues
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead.
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8082 (cc-metric-store), 8086 (InfluxDB), 4222 and 8222 (Nats), or 3306 (MariaDB). If one or more ports are already in use, you have to adapt the related config accordingly.
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
## Docker services and restarting the services
You can find all the docker services in `docker-compose.yml`. Feel free to modify it.
Whenever you modify it, please use
```
$ docker compose down
```
in order to shut down all the services in all the VMs (maininstance, nodeinstance, nodeinstance2) and then start all the services by using
```
$ docker compose up
```
TODO: Update job archive and all other metric data.
The job archive with 1867 jobs originates from the second half of 2020.
Roughly 2700 jobs from the first week of 2021 are loaded with data from InfluxDB.
Some views of ClusterCockpit (e.g. the Users view) show the last week or month.
To show some data there you have to set the filter to time periods with jobs (August 2020 to January 2021).
To show some data there you have to set the filter to time periods with jobs (August 2020 to January 2021).

@ -1,10 +1,12 @@
FROM golang:1.17
FROM golang:1.22.4
RUN apt-get update
RUN apt-get -y install git
RUN rm -rf /cc-metric-store
RUN git clone https://github.com/ClusterCockpit/cc-metric-store.git /cc-metric-store
RUN cd /cc-metric-store && go build
RUN cd /cc-metric-store && go build ./cmd/cc-metric-store
# Reactivate when latest commit is available
#RUN go get -d -v github.com/ClusterCockpit/cc-metric-store

@ -1,28 +1,201 @@
{
"metrics": {
"clock": { "frequency": 60, "aggregation": null, "scope": "node" },
"cpi": { "frequency": 60, "aggregation": null, "scope": "node" },
"cpu_load": { "frequency": 60, "aggregation": null, "scope": "node" },
"flops_any": { "frequency": 60, "aggregation": null, "scope": "node" },
"flops_dp": { "frequency": 60, "aggregation": null, "scope": "node" },
"flops_sp": { "frequency": 60, "aggregation": null, "scope": "node" },
"ib_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
"lustre_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
"mem_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
"mem_used": { "frequency": 60, "aggregation": null, "scope": "node" },
"rapl_power": { "frequency": 60, "aggregation": null, "scope": "node" }
"debug_metric": {
"frequency": 60,
"aggregation": "avg"
},
"clock": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_idle": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_iowait": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_irq": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_system": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_user": {
"frequency": 60,
"aggregation": "avg"
},
"nv_mem_util": {
"frequency": 60,
"aggregation": "avg"
},
"nv_temp": {
"frequency": 60,
"aggregation": "avg"
},
"nv_sm_clock": {
"frequency": 60,
"aggregation": "avg"
},
"acc_utilization": {
"frequency": 60,
"aggregation": "avg"
},
"acc_mem_used": {
"frequency": 60,
"aggregation": "sum"
},
"acc_power": {
"frequency": 60,
"aggregation": "sum"
},
"flops_any": {
"frequency": 60,
"aggregation": "sum"
},
"flops_dp": {
"frequency": 60,
"aggregation": "sum"
},
"flops_sp": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"cpu_power": {
"frequency": 60,
"aggregation": "sum"
},
"core_power": {
"frequency": 60,
"aggregation": "sum"
},
"mem_power": {
"frequency": 60,
"aggregation": "sum"
},
"ipc": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_load": {
"frequency": 60,
"aggregation": null
},
"lustre_close": {
"frequency": 60,
"aggregation": null
},
"lustre_open": {
"frequency": 60,
"aggregation": null
},
"lustre_statfs": {
"frequency": 60,
"aggregation": null
},
"lustre_read_bytes": {
"frequency": 60,
"aggregation": null
},
"lustre_write_bytes": {
"frequency": 60,
"aggregation": null
},
"net_bw": {
"frequency": 60,
"aggregation": null
},
"file_bw": {
"frequency": 60,
"aggregation": null
},
"mem_bw": {
"frequency": 60,
"aggregation": "sum"
},
"mem_cached": {
"frequency": 60,
"aggregation": null
},
"mem_used": {
"frequency": 60,
"aggregation": null
},
"net_bytes_in": {
"frequency": 60,
"aggregation": null
},
"net_bytes_out": {
"frequency": 60,
"aggregation": null
},
"nfs4_read": {
"frequency": 60,
"aggregation": null
},
"nfs4_total": {
"frequency": 60,
"aggregation": null
},
"nfs4_write": {
"frequency": 60,
"aggregation": null
},
"vectorization_ratio": {
"frequency": 60,
"aggregation": "avg"
}
},
"checkpoints": {
"interval": 100000000000,
"interval": "12h",
"directory": "/data/checkpoints",
"restore": 100000000000
"restore": "48h"
},
"archive": {
"interval": 100000000000,
"interval": "50h",
"directory": "/data/archive"
},
"retention-in-memory": 100000000000,
"http-api-address": "0.0.0.0:8081",
"nats": "nats://cc-nats:4222",
"http-api": {
"address": "0.0.0.0:8084",
"https-cert-file": null,
"https-key-file": null
},
"retention-in-memory": "48h",
"nats": [
{
"address": "nats://nats:4222",
"username": "root",
"password": "root",
"subscriptions": [
{
"subscribe-to": "hpc-nats",
"cluster-tag": "fritz"
},
{
"subscribe-to": "hpc-nats",
"cluster-tag": "alex"
}
]
}
],
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
}
}

@ -1,34 +0,0 @@
#!/usr/bin/env bash
if [ -d symfony ]; then
echo "Data already initialized!"
echo -n "Perform a fresh initialisation? [yes to proceed / no to exit] "
read -r answer
if [ "$answer" == "yes" ]; then
echo "Cleaning directories ..."
rm -rf symfony
rm -rf job-archive
rm -rf influxdb/data/*
rm -rf sqldata/*
echo "done."
else
echo "Aborting ..."
exit
fi
fi
mkdir symfony
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive_stable.tar.xz
tar xJf job-archive_stable.tar.xz
rm ./job-archive_stable.tar.xz
# 101 is the uid and gid of the user and group www-data in the cc-php container running php-fpm.
# For a demo with no new jobs it is enough to give www read permissions on that directory.
# echo "This script needs to chown the job-archive directory so that the application can write to it:"
# sudo chown -R 82:82 ./job-archive
mkdir -p influxdb/data
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/influxdbv2-data_stable.tar.xz
cd influxdb/data
tar xJf ../../influxdbv2-data_stable.tar.xz
rm ../../influxdbv2-data_stable.tar.xz

File diff suppressed because it is too large Load Diff

@ -1,5 +0,0 @@
[mysqld]
innodb_buffer_pool_size=4096M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
max_allowed_packet=16M

@ -1,48 +0,0 @@
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=snowflake
SlurmctldHost=slurmctld
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm/d
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
# LOGGING AND ACCOUNTING
AccountingStorageHost=slurmdb
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_script,job_comment,job_env,job_extra
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
# COMPUTE NODES
NodeName=node0[1-2] CPUs=1 State=UNKNOWN
PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP

139
dataGenerationScript.sh Executable file

@ -0,0 +1,139 @@
#!/bin/bash
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| This is Data generation script for docker services |"
echo "| Starting file required by docker services in data/ |"
echo "|--------------------------------------------------------------------------------------|"
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints
# After this, migrateTimestamp.pl will run from setupDev.sh. This will update the timestamps
# for all the checkpoint files, which then can be read by cc-metric-store.
# cc-metric-store reads only data upto certain time, like 48 hours of data.
# These checkpoint files have timestamp older than 48 hours and needs to be updated with
# migrateTimestamp.pl file, which will be automatically invoked from setupDev.sh.
if [ ! -d data/cc-metric-store-source ]; then
mkdir -p data/cc-metric-store-source/checkpoints
cd data/cc-metric-store-source/checkpoints
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
tar xf cc-metric-store-checkpoints.tar.xz
rm cc-metric-store-checkpoints.tar.xz
cd ../../../
else
echo "'data/cc-metric-store-source' already exists!"
fi
# A simple configuration file for mariadb docker service.
# Required because you can specify only one database per docker service.
# This file mentions the database to be created for cc-backend.
# This file automatically picked by mariadb after the docker service starts.
if [ ! -d data/mariadb ]; then
mkdir -p data/mariadb
cat > data/mariadb/01.databases.sql <<EOF
CREATE DATABASE IF NOT EXISTS \`ccbackend\`;
EOF
else
echo "'data/mariadb' already exists!"
fi
# A simple configuration file for openldap docker service.
# Creates a simple user 'ldapuser' with password 'ldapuser'.
# This file automatically picked by openldap after the docker service starts.
if [ ! -d data/ldap ]; then
mkdir -p data/ldap
cat > data/ldap/add_users.ldif <<EOF
dn: ou=users,dc=example,dc=com
objectClass: organizationalUnit
ou: users
dn: uid=ldapuser,ou=users,dc=example,dc=com
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: top
cn: Ldap User
sn: User
uid: ldapuser
uidNumber: 1
gidNumber: 1
homeDirectory: /home/ldapuser
userPassword: {SSHA}sQRqFQtuiupej7J/rbrQrTwYEHDduV+N
EOF
else
echo "'data/ldap' already exists!"
fi
# A simple configuration file for nats docker service.
# Required because we need to execute custom commands after nats docker service starts.
# This file automatically executed when the nats docker service starts.
# After docker service starts, there is an infinite while loop that publises data for 'fritz' and 'alex' cluster
# to subject 'hpc-nats' every 1 minute. Random data is generated only for node level metrics, not hardware level metrics.
if [ ! -d data/nats ]; then
mkdir -p data/nats
cat > data/nats/docker-entrypoint.sh <<EOF
#!/bin/sh
set -e
# Start NATS server in the background
nats-server --user root --pass root --http_port 8222 &
# Wait for NATS to be ready
until nc -z 0.0.0.0 4222; do
echo "Waiting for NATS to start..."
sleep 1
done
echo "NATS is up and running. Executing custom script..."
apk add curl
curl -sf https://binaries.nats.dev/nats-io/natscli/nats@latest | sh
# This is a dummy data generation loop, that inserts data for given nodes at 1 min interval
while true; do
# Timestamp in seconds
timestamp="\$(date '+%s')"
# Generate data for alex cluster. Push to sample_alex.txt
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
for hostname in a0603 a0903 a0832 a0329 a0702 a0122 a1624 a0731 a0224 a0704 a0631 a0225 a0222 a0427 a0603 a0429 a0833 a0705 a0901 a0601 a0227 a0804 a0322 a0226 a0126 a0129 a0605 a0801 a0934; do
echo "\$metric,cluster=alex,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_alex.txt
done
done
# Nats client will publish the data from sample_alex.txt to 'hpc-nats' subject on this nats server
./nats pub hpc-nats "\$(cat sample_alex.txt)" -s nats://0.0.0.0:4222 --user root --password root
# Generate data for fritz cluster. Push to sample_fritz.txt
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
for hostname in f0201 f0202 f0203 f0204 f0205 f0206 f0207 f0208 f0209 f0210 f0211 f0212 f0213 f0214 f0215 f0217 f0218 f0219 f0220 f0221 f0222 f0223 f0224 f0225 f0226 f0227 f0228 f0229; do
echo "\$metric,cluster=fritz,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_fritz.txt
done
done
# Nats client will publish the data from sample_fritz.txt to 'hpc-nats' subject on this nats server
./nats pub hpc-nats "\$(cat sample_fritz.txt)" -s nats://0.0.0.0:4222 --user root --password root
rm sample_alex.txt
rm sample_fritz.txt
sleep 1m
done
EOF
else
echo "'data/nats' already exists!"
fi
# prepare folders for influxdb3
if [ ! -d data/influxdb ]; then
mkdir -p data/influxdb/data
mkdir -p data/influxdb/config
else
echo "'data/influxdb' already exists!"
fi
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Finished generating relevant files for docker services in data/ |"
echo "|--------------------------------------------------------------------------------------|"

114
docker-compose.yml Normal file → Executable file

@ -3,15 +3,19 @@ services:
container_name: nats
image: nats:alpine
ports:
- "4222:4222"
- "8222:8222"
- "0.0.0.0:4222:4222"
- "0.0.0.0:8222:8222"
- "0.0.0.0:6222:6222"
volumes:
- ${DATADIR}/nats:/data
entrypoint: ["/bin/sh", "/data/docker-entrypoint.sh"]
cc-metric-store:
container_name: cc-metric-store
build:
context: ./cc-metric-store
ports:
- "8084:8084"
- "0.0.0.0:8084:8084"
volumes:
- ${DATADIR}/cc-metric-store:/data
depends_on:
@ -19,8 +23,8 @@ services:
influxdb:
container_name: influxdb
image: influxdb
command: ["--reporting-disabled"]
image: influxdb:latest
command: ["--reporting-disabled", "--log-level=debug"]
environment:
DOCKER_INFLUXDB_INIT_MODE: setup
DOCKER_INFLUXDB_INIT_USERNAME: devel
@ -30,7 +34,7 @@ services:
DOCKER_INFLUXDB_INIT_RETENTION: 100w
DOCKER_INFLUXDB_INIT_ADMIN_TOKEN: ${INFLUXDB_ADMIN_TOKEN}
ports:
- "127.0.0.1:${INFLUXDB_PORT}:8086"
- "0.0.0.0:8086:8086"
volumes:
- ${DATADIR}/influxdb/data:/var/lib/influxdb2
- ${DATADIR}/influxdb/config:/etc/influxdb2
@ -40,9 +44,15 @@ services:
image: osixia/openldap:1.5.0
command: --copy-service --loglevel debug
environment:
- LDAP_ADMIN_PASSWORD=${LDAP_ADMIN_PASSWORD}
- LDAP_ORGANISATION=${LDAP_ORGANISATION}
- LDAP_DOMAIN=${LDAP_DOMAIN}
- LDAP_ADMIN_PASSWORD=mashup
- LDAP_ORGANISATION=Example Organization
- LDAP_DOMAIN=example.com
- LDAP_LOGGING=true
- LDAP_CONNECTION=default
- LDAP_CONNECTIONS=default
- LDAP_DEFAULT_HOSTS=0.0.0.0
ports:
- "0.0.0.0:389:389"
volumes:
- ${DATADIR}/ldap:/container/service/slapd/assets/config/bootstrap/ldif/custom
@ -51,36 +61,18 @@ services:
image: mariadb:latest
command: ["--default-authentication-plugin=mysql_native_password"]
environment:
MARIADB_ROOT_PASSWORD: ${MARIADB_ROOT_PASSWORD}
MARIADB_ROOT_PASSWORD: root
MARIADB_DATABASE: slurm_acct_db
MARIADB_USER: slurm
MARIADB_PASSWORD: demo
ports:
- "127.0.0.1:${MARIADB_PORT}:3306"
- "0.0.0.0:3306:3306"
volumes:
- ${DATADIR}/mariadb:/etc/mysql/conf.d
# - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
- ${DATADIR}/mariadb:/docker-entrypoint-initdb.d
cap_add:
- SYS_NICE
# mysql:
# container_name: mysql
# image: mysql:8.0.22
# command: ["--default-authentication-plugin=mysql_native_password"]
# environment:
# MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD}
# MYSQL_DATABASE: ${MYSQL_DATABASE}
# MYSQL_USER: ${MYSQL_USER}
# MYSQL_PASSWORD: ${MYSQL_PASSWORD}
# ports:
# - "127.0.0.1:${MYSQL_PORT}:3306"
# # volumes:
# # - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
# # - ${DATADIR}/sqldata:/var/lib/mysql
# cap_add:
# - SYS_NICE
slurm-controller:
slurmctld:
container_name: slurmctld
hostname: slurmctld
build:
@ -89,40 +81,66 @@ services:
volumes:
- ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
- ${DATADIR}/slurm/state:/var/lib/slurm/d
ports:
- "6817:6817"
slurm-database:
container_name: slurmdb
hostname: slurmdb
slurmdbd:
container_name: slurmdbd
hostname: slurmdbd
build:
context: ./slurm/database
depends_on:
- mariadb
- slurm-controller
- slurmctld
privileged: true
volumes:
- ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret
- ./slurm/database/slurmdbd.conf:/home/config/slurmdbd.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6819:6819"
slurm-worker01:
node01:
container_name: node01
hostname: node01
build:
context: ./slurm/worker
depends_on:
- slurm-controller
- slurmctld
privileged: true
volumes:
- ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret
- ./slurm/worker/cgroup.conf:/home/config/cgroup.conf
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6818:6818"
# slurm-worker02:
# container_name: node02
# hostname: node02
# build:
# context: ./slurm/worker
# depends_on:
# - slurm-controller
# privileged: true
# volumes:
# - ${DATADIR}/slurm/home:/home
# - ${DATADIR}/slurm/secret:/.secret
slurmrestd:
container_name: slurmrestd
hostname: slurmrestd
build:
context: ./slurm/rest
environment:
- SLURM_JWT=daemon
- SLURMRESTD_DEBUG=9
depends_on:
- slurmctld
privileged: true
volumes:
- ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- ./slurm/rest/slurmrestd.conf:/home/config/slurmrestd.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6820:6820"

@ -1,5 +0,0 @@
SLURM_VERSION=22.05.6
ARCH=aarch64
MUNGE_UID=981
SLURM_UID=982
WORKER_UID=1000

@ -9,7 +9,6 @@ use File::Slurp;
use Data::Dumper;
use Time::Piece;
use Sort::Versions;
use REST::Client;
### JOB-ARCHIVE
my $localtime = localtime;
@ -19,80 +18,80 @@ my $archiveSrc = './data/job-archive-source';
my @ArchiveClusters;
# Get clusters by job-archive/$subfolder
opendir my $dh, $archiveSrc or die "can't open directory: $!";
while ( readdir $dh ) {
chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive';
# opendir my $dh, $archiveSrc or die "can't open directory: $!";
# while ( readdir $dh ) {
# chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive' or $_ eq 'version.txt';
my $cluster = $_;
push @ArchiveClusters, $cluster;
}
# my $cluster = $_;
# push @ArchiveClusters, $cluster;
# }
# start for jobarchive
foreach my $cluster ( @ArchiveClusters ) {
print "Starting to update start- and stoptimes in job-archive for $cluster\n";
# # start for jobarchive
# foreach my $cluster ( @ArchiveClusters ) {
# print "Starting to update start- and stoptimes in job-archive for $cluster\n";
opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!";
while ( readdir $dhLevel1 ) {
chomp; next if $_ eq '.' or $_ eq '..';
my $level1 = $_;
# opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!";
# while ( readdir $dhLevel1 ) {
# chomp; next if $_ eq '.' or $_ eq '..';
# my $level1 = $_;
if ( -d "$archiveSrc/$cluster/$level1" ) {
opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!";
while ( readdir $dhLevel2 ) {
chomp; next if $_ eq '.' or $_ eq '..';
my $level2 = $_;
my $jobSource = "$archiveSrc/$cluster/$level1/$level2";
my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/";
my $jobOrigin = $jobSource;
# check if files are directly accessible (old format) else get subfolders as file and update path
if ( ! -e "$jobSource/meta.json") {
my @folders = read_dir($jobSource);
if (!@folders) {
next;
}
# Only use first subfolder for now TODO
$jobSource = "$jobSource/".$folders[0];
}
# check if subfolder contains file, else remove source and skip
if ( ! -e "$jobSource/meta.json") {
# rmtree $jobOrigin;
next;
}
# if ( -d "$archiveSrc/$cluster/$level1" ) {
# opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!";
# while ( readdir $dhLevel2 ) {
# chomp; next if $_ eq '.' or $_ eq '..';
# my $level2 = $_;
# my $jobSource = "$archiveSrc/$cluster/$level1/$level2";
# my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/";
# my $jobOrigin = $jobSource;
# # check if files are directly accessible (old format) else get subfolders as file and update path
# if ( ! -e "$jobSource/meta.json") {
# my @folders = read_dir($jobSource);
# if (!@folders) {
# next;
# }
# # Only use first subfolder for now TODO
# $jobSource = "$jobSource/".$folders[0];
# }
# # check if subfolder contains file, else remove source and skip
# if ( ! -e "$jobSource/meta.json") {
# # rmtree $jobOrigin;
# next;
# }
my $rawstr = read_file("$jobSource/meta.json");
my $json = decode_json($rawstr);
# my $rawstr = read_file("$jobSource/meta.json");
# my $json = decode_json($rawstr);
# NOTE Start meta.json iteration here
# my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT;
# Set new startTime: Between 5 days and 1 day before now
# # NOTE Start meta.json iteration here
# # my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT;
# # Set new startTime: Between 5 days and 1 day before now
# Remove id from attributes
$json->{startTime} = $epochtime - (int(rand(432000)) + 86400);
$json->{stopTime} = $json->{startTime} + $json->{duration};
# # Remove id from attributes
# $json->{startTime} = $epochtime - (int(rand(432000)) + 86400);
# $json->{stopTime} = $json->{startTime} + $json->{duration};
# Add starttime subfolder to target path
$jobTarget .= $json->{startTime};
# # Add starttime subfolder to target path
# $jobTarget .= $json->{startTime};
# target is not directory
if ( not -d $jobTarget ){
# print "Writing files\n";
# print "$cluster/$level1/$level2\n";
make_path($jobTarget);
# # target is not directory
# if ( not -d $jobTarget ){
# # print "Writing files\n";
# # print "$cluster/$level1/$level2\n";
# make_path($jobTarget);
my $outstr = encode_json($json);
write_file("$jobTarget/meta.json", $outstr);
# my $outstr = encode_json($json);
# write_file("$jobTarget/meta.json", $outstr);
my $datstr = read_file("$jobSource/data.json");
write_file("$jobTarget/data.json", $datstr);
} else {
# rmtree $jobSource;
}
}
}
}
}
print "Done for job-archive\n";
sleep(1);
# my $datstr = read_file("$jobSource/data.json.gz");
# write_file("$jobTarget/data.json.gz", $datstr);
# } else {
# # rmtree $jobSource;
# }
# }
# }
# }
# }
# print "Done for job-archive\n";
# sleep(1);
## CHECKPOINTS
chomp(my $checkpointStart=`date --date 'TZ="Europe/Berlin" 0:00 7 days ago' +%s`);

77
misc/config.json Normal file

@ -0,0 +1,77 @@
{
"addr": "127.0.0.1:8080",
"short-running-jobs-duration": 300,
"archive": {
"kind": "file",
"path": "./var/job-archive"
},
"jwts": {
"max-age": "2000h"
},
"db-driver": "mysql",
"db": "root:root@tcp(0.0.0.0:3306)/ccbackend",
"ldap": {
"url": "ldap://0.0.0.0",
"user_base": "ou=users,dc=example,dc=com",
"search_dn": "cn=admin,dc=example,dc=com",
"user_bind": "uid={username},ou=users,dc=example,dc=com",
"user_filter": "(&(objectclass=posixAccount))",
"syncUserOnLogin": true
},
"enable-resampling": {
"trigger": 30,
"resolutions": [
600,
300,
120,
60
]
},
"emission-constant": 317,
"clusters": [
{
"name": "fritz",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://0.0.0.0:8084",
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
},
"filterRanges": {
"numNodes": {
"from": 1,
"to": 64
},
"duration": {
"from": 0,
"to": 86400
},
"startTime": {
"from": "2022-01-01T00:00:00Z",
"to": null
}
}
},
{
"name": "alex",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://0.0.0.0:8084",
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
},
"filterRanges": {
"numNodes": {
"from": 1,
"to": 64
},
"duration": {
"from": 0,
"to": 86400
},
"startTime": {
"from": "2022-01-01T00:00:00Z",
"to": null
}
}
}
]
}

3
misc/curl_slurmrestd.sh Executable file

@ -0,0 +1,3 @@
SLURM_JWT=$(cat data/slurm/secret/jwt_token.txt)
curl -X 'GET' -v 'http://localhost:6820/slurm/v0.0.39/node/node01' --location --silent --show-error -H "X-SLURM-USER-NAME: root" -H "X-SLURM-USER-TOKEN: $SLURM_JWT"
# curl -v --unix-socket data/slurm/tmp/slurmrestd.socket 'http://localhost:6820/slurm/v0.0.39/ping'

27
misc/jwt_verifier.py Normal file

@ -0,0 +1,27 @@
#!/usr/bin/env python3
import sys
import os
import pprint
import json
import time
from datetime import datetime, timedelta, timezone
from jwt import JWT
from jwt.jwa import HS256
from jwt.jwk import jwk_from_dict
from jwt.utils import b64decode,b64encode
if len(sys.argv) != 2:
sys.exit("verify_jwt.py [JWT Token]");
with open("data/slurm/secret/jwt_hs256.key", "rb") as f:
priv_key = f.read()
signing_key = jwk_from_dict({
'kty': 'oct',
'k': b64encode(priv_key)
})
a = JWT()
b = a.decode(sys.argv[1], signing_key, algorithms=["HS256"])
print(b)

@ -0,0 +1,40 @@
#!/bin/bash -l
sudo apt-get update
sudo apt-get upgrade -f -y
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -f -y gcc
sudo apt-get install -f -y npm
sudo apt-get install -f -y make
sudo apt-get install -f -y gh
sudo apt-get install -f -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo apt-get install -f -y docker-compose
sudo apt install perl -f -y libdatetime-perl libjson-perl
sudo apt-get install -f -y golang-go
sudo cpan Cpanel::JSON::XS
sudo cpan File::Slurp
sudo cpan Data::Dumper
sudo cpan Time::Piece
sudo cpan Sort::Versions
sudo groupadd docker
sudo usermod -aG docker ubuntu
sudo shutdown -r -t 0

@ -1,48 +1,42 @@
#!/bin/bash
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Welcome to cc-docker automatic deployment script. |"
echo "| Make sure you have sudo rights to run docker services |"
echo "| This script assumes that docker command is added to sudo group |"
echo "| This means that docker commands do not explicitly require |"
echo "| 'sudo' keyword to run. You can use this following command: |"
echo "| |"
echo "| > sudo groupadd docker |"
echo "| > sudo usermod -aG docker $USER |"
echo "| |"
echo "| This will add docker to the sudo usergroup and all the docker |"
echo "| command will run as sudo by default without requiring |"
echo "| 'sudo' keyword. |"
echo "|--------------------------------------------------------------------------------------|"
echo ""
# Check cc-backend, touch job.db if exists
# Check cc-backend if exists
if [ ! -d cc-backend ]; then
echo "'cc-backend' not yet prepared! Please clone cc-backend repository before starting this script."
echo -n "Stopped."
exit
else
cd cc-backend
if [ ! -d var ]; then
mkdir var
touch var/job.db
else
echo "'cc-backend/var' exists. Cautiously exiting."
echo -n "Stopped."
exit
fi
fi
# Download unedited job-archive to ./data/job-archive-source
if [ ! -d data/job-archive-source ]; then
cd data
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
tar xf job-archive-demo.tar
mv ./job-archive ./job-archive-source
rm ./job-archive-demo.tar
cd ..
else
echo "'data/job-archive-source' already exists!"
# Creates data directory if it does not exists.
# Contains all the mount points required by all the docker services
# and their static files.
if [ ! -d data ]; then
mkdir -m777 data
fi
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints
if [ ! -d data/cc-metric-store-source ]; then
mkdir -p data/cc-metric-store-source/checkpoints
cd data/cc-metric-store-source/checkpoints
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
tar xf cc-metric-store-checkpoints.tar.xz
rm cc-metric-store-checkpoints.tar.xz
cd ../../../
else
echo "'data/cc-metric-store-source' already exists!"
fi
# Invokes the dataGenerationScript.sh, which then populates the required
# static files by the docker services. These static files are required by docker services after startup.
chmod u+x dataGenerationScript.sh
./dataGenerationScript.sh
# Update timestamps
# Update timestamps for all the checkpoints in data/cc-metric-store-source
# and dumps new files in data/cc-metric-store.
perl ./migrateTimestamps.pl
# Create archive folder for rewritten ccms checkpoints
@ -51,32 +45,54 @@ if [ ! -d data/cc-metric-store/archive ]; then
fi
# cleanup sources
# rm -r ./data/job-archive-source
# rm -r ./data/cc-metric-store-source
# prepare folders for influxdb2
if [ ! -d data/influxdb ]; then
mkdir -p data/influxdb/data
mkdir -p data/influxdb/config/influx-configs
else
echo "'data/influxdb' already exists!"
if [ -d data/cc-metric-store-source ]; then
rm -r data/cc-metric-store-source
fi
# Check dotenv-file and docker-compose-yml, copy accordingly if not present and build docker services
if [ ! -d .env ]; then
cp templates/env.default ./.env
fi
# Just in case user forgot manually shutdown the docker services.
docker-compose down
docker-compose down --remove-orphans
if [ ! -d docker-compose.yml ]; then
cp templates/docker-compose.yml.default ./docker-compose.yml
fi
# This automatically builds the base docker image for slurm.
# All the slurm docker service in docker-compose.yml refer to
# the base image created from this directory.
cd slurm/base/
make
cd ../..
# Starts all the docker services from docker-compose.yml.
docker-compose build
./cc-backend/cc-backend --init-db --add-user demo:admin:AdminDev
docker-compose up -d
cd cc-backend
if [ ! -d var ]; then
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
tar xf job-archive-demo.tar
rm ./job-archive-demo.tar
cp ./configs/env-template.txt .env
cp -f ../misc/config.json config.json
make
./cc-backend -migrate-db
./cc-backend --init-db --add-user demo:admin:demo
cd ..
else
cd ..
echo "'cc-backend/var' exists. Cautiously exiting."
fi
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Check logs for each slurm service by using these commands: |"
echo "| docker-compose logs slurmctld |"
echo "| docker-compose logs slurmdbd |"
echo "| docker-compose logs slurmrestd |"
echo "| docker-compose logs node01 |"
echo "|======================================================================================|"
echo "| Setup complete, containers are up by default: Shut down with 'docker-compose down'. |"
echo "| Use './cc-backend/cc-backend -server' to start cc-backend. |"
echo "| Use scripts in /scripts to load data into influx or mariadb. |"
echo "|--------------------------------------------------------------------------------------|"
echo ""
echo "Setup complete, containers are up by default: Shut down with 'docker-compose down'."
echo "Use './cc-backend/cc-backend' to start cc-backend."
echo "Use scripts in /scripts to load data into influx or mariadb."
# ./cc-backend/cc-backend

@ -1,41 +1,39 @@
FROM rockylinux:8
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
ENV SLURM_VERSION=22.05.6
ENV ARCH=aarch64
ENV SLURM_VERSION=24.05.3
ENV HTTP_PARSER_VERSION=2.8.0
RUN yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y
RUN yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
RUN ARCH=$(uname -m) && yum install -y https://rpmfind.net/linux/almalinux/8.10/PowerTools/x86_64/os/Packages/http-parser-devel-2.8.0-9.el8.$ARCH.rpm
RUN groupadd -g 981 munge \
&& useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u 981 -g munge -s /sbin/nologin munge \
&& groupadd -g 982 slurm \
&& useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 982 -g slurm -s /bin/bash slurm \
&& groupadd -g 1000 worker \
&& useradd -m -c "Workflow user" -d /home/worker -u 1000 -g worker -s /bin/bash worker
&& groupadd -g 1000 slurm \
&& useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 1000 -g slurm -s /bin/bash slurm \
&& groupadd -g 982 worker \
&& useradd -m -c "Workflow user" -d /home/worker -u 982 -g worker -s /bin/bash worker
RUN yum install -y munge munge-libs
RUN dnf --enablerepo=powertools install munge-devel -y
RUN yum install rng-tools -y
RUN yum install -y munge munge-libs rng-tools \
python3 gcc openssl openssl-devel \
openssh-server openssh-clients dbus-devel \
pam-devel numactl numactl-devel hwloc sudo \
lua readline-devel ncurses-devel man2html \
autoconf automake json-c-devel libjwt-devel \
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
RUN yum install -y python3 gcc openssl openssl-devel \
openssh-server openssh-clients dbus-devel \
pam-devel numactl numactl-devel hwloc sudo \
lua readline-devel ncurses-devel man2html \
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
RUN dnf --enablerepo=powertools install -y munge-devel rrdtool-devel lua-devel hwloc-devel mariadb-server mariadb-devel
RUN dnf --enablerepo=powertools install rrdtool-devel lua-devel hwloc-devel rpm-build -y
RUN dnf install mariadb-server mariadb-devel -y
RUN mkdir /usr/local/slurm-tmp
RUN cd /usr/local/slurm-tmp
RUN wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2
RUN rpmbuild -ta slurm-${SLURM_VERSION}.tar.bz2
RUN mkdir -p /usr/local/slurm-tmp \
&& cd /usr/local/slurm-tmp \
&& wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
&& rpmbuild -ta --with slurmrestd --with jwt slurm-${SLURM_VERSION}.tar.bz2
WORKDIR /root/rpmbuild/RPMS/${ARCH}
RUN yum -y --nogpgcheck localinstall \
slurm-${SLURM_VERSION}-1.el8.${ARCH}.rpm \
slurm-perlapi-${SLURM_VERSION}-1.el8.${ARCH}.rpm \
slurm-slurmctld-${SLURM_VERSION}-1.el8.${ARCH}.rpm
WORKDIR /
RUN ARCH=$(uname -m) \
&& yum -y --nogpgcheck localinstall \
/root/rpmbuild/RPMS/$ARCH/slurm-${SLURM_VERSION}*.$ARCH.rpm \
/root/rpmbuild/RPMS/$ARCH/slurm-perlapi-${SLURM_VERSION}*.$ARCH.rpm \
/root/rpmbuild/RPMS/$ARCH/slurm-slurmctld-${SLURM_VERSION}*.$ARCH.rpm
VOLUME ["/home", "/.secret"]
# 22: SSH
@ -43,4 +41,5 @@ VOLUME ["/home", "/.secret"]
# 6817: SlurmCtlD
# 6818: SlurmD
# 6819: SlurmDBD
EXPOSE 22 6817 6818 6819
# 6820: SlurmRestD
EXPOSE 22 6817 6818 6819 6820

@ -1,6 +1,6 @@
include ../../.env
IMAGE = clustercockpit/slurm.base
SLURM_VERSION = 24.05.3
.PHONY: build clean
build:

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
FROM clustercockpit/slurm.base:24.05.3
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"]

@ -1,23 +1,43 @@
#!/usr/bin/env bash
set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
SLURM_JWT=daemon
SLURMRESTD_SECURITY=disable_user_check
_delete_secrets() {
if [ -f /.secret/munge.key ]; then
echo "Removing secrets"
sudo rm -rf /.secret/munge.key
sudo rm -rf /.secret/worker-secret.tar.gz
sudo rm -rf /.secret/setup-worker-ssh.sh
sudo rm -rf /.secret/jwt_hs256.key
sudo rm -rf /.secret/jwt_token.txt
echo "Done removing secrets"
ls /.secret/
fi
}
# start sshd server
_sshd_host() {
if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi
echo "Starting sshd"
/usr/sbin/sshd
if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi
echo "Starting sshd"
/usr/sbin/sshd
}
# setup worker ssh to be passwordless
_ssh_worker() {
if [[ ! -d /home/worker ]]; then
if [[ ! -d /home/worker ]]; then
mkdir -p /home/worker
chown -R worker:worker /home/worker
fi
cat > /home/worker/setup-worker-ssh.sh <<EOF2
cat >/home/worker/setup-worker-ssh.sh <<EOF2
mkdir -p ~/.ssh
chmod 0700 ~/.ssh
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
@ -41,7 +61,7 @@ EOF2
# start munge and generate key
_munge_start() {
echo "Starting munge"
echo "Starting munge"
chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge
chmod 0700 /etc/munge
chmod 0711 /var/lib/munge
@ -50,9 +70,9 @@ _munge_start() {
/sbin/create-munge-key -f
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r -f
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
chmod 600 /etc/munge/munge.key
sudo -u munge /sbin/munged
munge -n
munge -n | unmunge
@ -61,31 +81,97 @@ _munge_start() {
# copy secrets to /.secret directory for other nodes
_copy_secrets() {
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
cp /home/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
cp /etc/munge/munge.key /.secret/munge.key
rm -f /home/worker/worker-secret.tar.gz
rm -f /home/worker/setup-worker-ssh.sh
while [ ! -f /home/worker/worker-secret.tar.gz ]; do
echo -n "."
sleep 1
done
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
cp /home/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
cp /etc/munge/munge.key /.secret/munge.key
rm -f /home/worker/worker-secret.tar.gz
rm -f /home/worker/setup-worker-ssh.sh
}
_openssl_jwt_key() {
mkdir -p /var/spool/slurm/statesave
dd if=/dev/random of=/var/spool/slurm/statesave/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /var/spool/slurm/statesave/jwt_hs256.key /.secret/jwt_hs256.key
chmod 777 /.secret/jwt_hs256.key
}
_generate_jwt_token() {
secret_key=$(cat /var/spool/slurm/statesave/jwt_hs256.key)
start_time=$(date +%s)
exp_time=$((start_time + 100000000))
base64url() {
# Don't wrap, make URL-safe, delete trailer.
base64 -w 0 | tr '+/' '-_' | tr -d '='
}
jwt_header=$(echo -n '{"alg":"HS256","typ":"JWT"}' | base64url)
jwt_claims=$(cat <<EOF |
{
"sun": "root",
"exp": $exp_time,
"iat": $start_time
}
EOF
jq -Mcj '.' | base64url)
# jq -Mcj => Monochrome output, compact output, join lines
jwt_signature=$(echo -n "${jwt_header}.${jwt_claims}" |
openssl dgst -sha256 -hmac "$secret_key" -binary | base64url)
# Use the same colours as jwt.io, more-or-less.
echo "$(tput setaf 1)${jwt_header}$(tput sgr0).$(tput setaf 5)${jwt_claims}$(tput sgr0).$(tput setaf 6)${jwt_signature}$(tput sgr0)"
jwt="${jwt_header}.${jwt_claims}.${jwt_signature}"
echo $jwt | cat >/.secret/jwt_token.txt
chmod 777 /.secret/jwt_token.txt
}
# run slurmctld
_slurmctld() {
cd /root/rpmbuild/RPMS/aarch64
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm \
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm
cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm
echo "checking for slurmdbd.conf"
while [ ! -f /.secret/slurmdbd.conf ]; do
echo -n "."
echo "."
sleep 1
done
echo ""
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/spool /var/lib /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
mkdir -p /etc/config
chown -R slurm: /etc/config
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
chown -R slurm: /var/log/slurmctld.log
touch /var/log/slurmd.log
chown -R slurm: /var/log/slurmd.log
touch /var/lib/slurm/d/job_state
chown -R slurm: /var/lib/slurm/d/job_state
touch /var/lib/slurm/d/fed_mgr_state
chown -R slurm: /var/lib/slurm/d/fed_mgr_state
touch /var/run/slurm/d/slurmctld.pid
chown -R slurm: /var/run/slurm/d/slurmctld.pid
touch /var/run/slurm/d/slurmd.pid
chown -R slurm: /var/run/slurm/d/slurmd.pid
if [[ ! -f /home/config/slurm.conf ]]; then
echo "### Missing slurm.conf ###"
exit
@ -95,15 +181,43 @@ _slurmctld() {
chown slurm: /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurm.conf
fi
sacctmgr -i add cluster "snowflake"
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
sudo yum install -y lsof
sudo yum install -y jq
_openssl_jwt_key
if [ ! -f /.secret/jwt_hs256.key ]; then
echo "### Missing jwt.key ###"
exit 1
else
cp /.secret/jwt_hs256.key /etc/config/jwt_hs256.key
chown slurm: /etc/config/jwt_hs256.key
chmod 0600 /etc/config/jwt_hs256.key
fi
_generate_jwt_token
while ! nc -z slurmdbd 6819; do
echo "Waiting for slurmdbd to be ready..."
sleep 2
done
sacctmgr -i add cluster name=linux
sleep 2s
echo "Starting slurmctld"
echo "Starting slurmctld"
cp -f /etc/slurm/slurm.conf /.secret/
/usr/sbin/slurmctld
/usr/sbin/slurmctld -Dvv
echo "Started slurmctld"
}
### main ###
_delete_secrets
_sshd_host
_ssh_worker
_munge_start
_copy_secrets

108
slurm/controller/slurm.conf Normal file

@ -0,0 +1,108 @@
# slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=slurmctld
ControlAddr=slurmctld
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurm/d
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/d/slurmctld.pid
SlurmdPidFile=/var/run/slurm/d/slurmd.pid
ProctrackType=proctrack/linuxproc
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
#PluginDir=
#CacheGroups=0
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
# SelectType=select/con_res
SelectTypeParameters=CR_CPU_Memory
# FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp.log
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherType=jobacct_gather/cgroup
#ProctrackType=proctrack/cgroup
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd
AccountingStoragePort=6819
#AccountingStorageLoc=slurm_acct_db
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
PartitionName=DEFAULT Nodes=node01
PartitionName=debug Nodes=node01 Default=YES MaxTime=INFINITE State=UP
# # COMPUTE NODES
# NodeName=c[1-2] RealMemory=1000 State=UNKNOWN
NodeName=node01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1
# #
# # PARTITIONS
# PartitionName=normal Default=yes Nodes=c[1-2] Priority=50 DefMemPerCPU=500 Shared=NO MaxNodes=2 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
#PrEpPlugins=pika

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
FROM clustercockpit/slurm.base:24.05.3
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"]

@ -1,6 +1,10 @@
#!/usr/bin/env bash
set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
SLURM_JWT=daemon
SLURM_ACCT_DB_SQL=/slurm_acct_db.sql
# start sshd server
@ -48,12 +52,16 @@ _wait_for_worker() {
# run slurmdbd
_slurmdbd() {
cd /root/rpmbuild/RPMS/aarch64
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
slurm-slurmdbd-22.05.6-1.el8.aarch64.rpm
cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmdbd-$SLURM_VERSION*.$ARCH.rpm
mkdir -p /var/spool/slurm/d /var/log/slurm /etc/slurm
chown slurm: /var/spool/slurm/d /var/log/slurm
chown -R slurm: /var/spool/slurm/d /var/log/slurm
mkdir -p /etc/config
chown -R slurm: /etc/config
if [[ ! -f /home/config/slurmdbd.conf ]]; then
echo "### Missing slurmdbd.conf ###"
exit
@ -62,10 +70,31 @@ _slurmdbd() {
cp /home/config/slurmdbd.conf /etc/slurm/slurmdbd.conf
chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf
cp /etc/slurm/slurmdbd.conf /.secret/slurmdbd.conf
fi
echo "checking for jwt.key"
while [ ! -f /.secret/jwt_hs256.key ]; do
echo "."
sleep 1
done
mkdir -p /var/spool/slurm/statesave
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
echo ""
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
echo "Starting slurmdbd"
cp /etc/slurm/slurmdbd.conf /.secret/slurmdbd.conf
/usr/sbin/slurmdbd
/usr/sbin/slurmdbd -Dvv
echo "Started slurmdbd"
}
### main ###

@ -1,3 +1,8 @@
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info
#ArchiveJobs=yes
#ArchiveDir="/tmp"
@ -8,16 +13,19 @@
#
# Authentication info
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
#
#AuthInfo=/var/run/munge/munge.socket.2
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
# slurmDBD info
DbdAddr=slurmdb
DbdHost=slurmdb
DbdAddr=slurmdbd
DbdHost=slurmdbd
DbdPort=6819
SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
# PidFile=/var/run/slurmdbd/slurmdbd.pid
#PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes
@ -25,7 +33,6 @@ PidFile=/var/run/slurmdbd.pid
# Database info
StorageType=accounting_storage/mysql
StorageHost=mariadb
StoragePort=3306
StoragePass=demo
StorageUser=slurm
StoragePass=demo
StorageLoc=slurm_acct_db

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
FROM clustercockpit/slurm.base:24.05.3
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"]

@ -1,108 +1,142 @@
#!/usr/bin/env bash
set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
# SLURMRESTD="/tmp/slurmrestd.socket"
SLURM_JWT=daemon
# start sshd server
_sshd_host() {
if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi
/usr/sbin/sshd
}
# setup worker ssh to be passwordless
_ssh_worker() {
if [[ ! -d /home/worker ]]; then
mkdir -p /home/worker
chown -R worker:worker /home/worker
if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi
cat > /home/worker/setup-worker-ssh.sh <<EOF2
mkdir -p ~/.ssh
chmod 0700 ~/.ssh
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
chmod 0640 ~/.ssh/authorized_keys
cat >> ~/.ssh/config <<EOF
Host *
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel QUIET
EOF
chmod 0644 ~/.ssh/config
cd ~/
tar -czvf ~/worker-secret.tar.gz .ssh
cd -
EOF2
chmod +x /home/worker/setup-worker-ssh.sh
chown worker: /home/worker/setup-worker-ssh.sh
sudo -u worker /home/worker/setup-worker-ssh.sh
/usr/sbin/sshd
}
# start munge and generate key
_munge_start() {
# start munge using existing key
_munge_start_using_key() {
if [ ! -f /.secret/munge.key ]; then
echo -n "checking for munge.key"
while [ ! -f /.secret/munge.key ]; do
echo -n "."
sleep 1
done
echo ""
fi
cp /.secret/munge.key /etc/munge/munge.key
chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge
chmod 0700 /etc/munge
chmod 0711 /var/lib/munge
chmod 0700 /var/log/munge
chmod 0755 /var/run/munge
/sbin/create-munge-key -f
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r -f
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
sudo -u munge /sbin/munged
munge -n
munge -n | unmunge
remunge
}
# copy secrets to /.secret directory for other nodes
_copy_secrets() {
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
cp thome/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
cp /etc/munge/munge.key /.secret/munge.key
rm -f /home/worker/worker-secret.tar.gz
rm -f /home/worker/setup-worker-ssh.sh
_enable_slurmrestd() {
cat >/usr/lib/systemd/system/slurmrestd.service <<EOF
[Unit]
Description=Slurm REST daemon
After=network-online.target slurmctld.service
Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmrestd
EnvironmentFile=-/etc/default/slurmrestd
# slurmrestd should not run as root or the slurm user.
# Please either use the -u and -g options in /etc/sysconfig/slurmrestd or
# /etc/default/slurmrestd, or explicitly set the User and Group in this file
# an unpriviledged user to run as.
User=slurm
Restart=always
RestartSec=5
# Group=
# Default to listen on both socket and slurmrestd port
ExecStart=/usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt $SLURMRESTD_OPTIONS -vvvvvv -s dbv0.0.39,v0.0.39 0.0.0.0:6820
# Enable auth/jwt be default, comment out the line to disable it for slurmrestd
Environment="SLURM_JWT=daemon"
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
EOF
}
# run slurmctld
_slurmctld() {
cd /root/rpmbuild/RPMS/aarch64
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm \
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm \
slurm-slurmrestd-22.05.6-1.el8.aarch64.rpm
# run slurmrestd
_slurmrestd() {
cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmrestd-$SLURM_VERSION*.$ARCH.rpm
echo -n "checking for slurmdbd.conf"
while [ ! -f /.secret/slurmdbd.conf ]; do
echo -n "."
sleep 1
done
echo ""
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log
if [[ ! -f /home/config/slurm.conf ]]; then
mkdir -p /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
chown -R slurm: /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
chmod 755 /var/run/slurm
touch /var/log/slurmrestd.log
chown slurm: /var/log/slurmrestd.log
if [[ ! -f /home/config/slurmrestd.conf ]]; then
echo "### Missing slurm.conf ###"
exit
else
echo "### use provided slurm.conf ###"
cp /home/config/slurm.conf /etc/slurm/slurm.conf
echo "### use provided slurmrestd.conf ###"
cp /home/config/slurmrestd.conf /etc/config/slurmrestd.conf
cp /home/config/slurm.conf /etc/config/slurm.conf
fi
sacctmgr -i add cluster "snowflake"
echo "checking for jwt.key"
while [ ! -f /.secret/jwt_hs256.key ]; do
echo "."
sleep 1
done
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
sudo yum install -y lsof
sudo yum install -y socat
mkdir -p /var/spool/slurm/statesave
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
chmod 0400 /var/spool/slurm/statesave/jwt_hs256.key
echo ""
sleep 2s
/usr/sbin/slurmctld
cp -f /etc/slurm/slurm.conf /.secret/
echo "Starting slurmrestd"
# _enable_slurmrestd
# sudo ln -s /usr/lib/systemd/system/slurmrestd.service /etc/systemd/system/multi-user.target.wants/slurmrestd.service
SLURMRESTD_SECURITY=disable_user_check SLURMRESTD_DEBUG=9 /usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt -s dbv0.0.39,v0.0.39 -u slurm 0.0.0.0:6820
echo "Started slurmrestd"
}
### main ###
_sshd_host
_ssh_worker
_munge_start
_copy_secrets
_slurmctld
_munge_start_using_key
_slurmrestd
tail -f /dev/null

@ -0,0 +1,4 @@
#
# Example slurmdbd.conf file.
#
include /etc/config/slurm.conf

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
FROM clustercockpit/slurm.base:24.05.3
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -8,4 +8,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
WORKDIR /home/worker
COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"]

@ -1,3 +1,4 @@
CgroupPlugin=disabled
ConstrainCores=yes
ConstrainDevices=no
ConstrainRAMSpace=yes

@ -1,6 +1,10 @@
#!/usr/bin/env bash
set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
# start sshd server
_sshd_host() {
if [ ! -d /var/run/sshd ]; then
@ -12,6 +16,10 @@ _sshd_host() {
# start munge using existing key
_munge_start_using_key() {
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
echo -n "cheking for munge.key"
while [ ! -f /.secret/munge.key ]; do
echo -n "."
@ -32,49 +40,67 @@ _munge_start_using_key() {
# wait for worker user in shared /home volume
_wait_for_worker() {
echo "checking for id_rsa.pub"
if [ ! -f /home/worker/.ssh/id_rsa.pub ]; then
echo -n "checking for id_rsa.pub"
echo "checking for id_rsa.pub"
while [ ! -f /home/worker/.ssh/id_rsa.pub ]; do
echo -n "."
sleep 1
done
echo ""
fi
echo "done checking for id_rsa.pub"
}
_start_dbus() {
dbus-uuidgen > /var/lib/dbus/machine-id
mkdir -p /var/run/dbus
dbus-daemon --config-file=/usr/share/dbus-1/system.conf --print-address
dbus-uuidgen >/var/lib/dbus/machine-id
mkdir -p /var/run/dbus
dbus-daemon --config-file=/usr/share/dbus-1/system.conf --print-address
}
# run slurmd
_slurmd() {
cd /root/rpmbuild/RPMS/aarch64
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm
if [ ! -f /.secret/slurm.conf ]; then
echo -n "checking for slurm.conf"
while [ ! -f /.secret/slurm.conf ]; do
echo -n "."
sleep 1
done
echo ""
fi
mkdir -p /var/spool/slurm/d /etc/slurm
chown slurm: /var/spool/slurm/d
cp /home/config/cgroup.conf /etc/slurm/cgroup.conf
chown slurm: /etc/slurm/cgroup.conf
chmod 600 /etc/slurm/cgroup.conf
cp /home/config/slurm.conf /etc/slurm/slurm.conf
chown slurm: /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurm.conf
touch /var/log/slurmd.log
chown slurm: /var/log/slurmd.log
echo -n "Starting slurmd"
/usr/sbin/slurmd
cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-$SLURM_VERSION*.$ARCH.rpm
echo "checking for slurm.conf"
if [ ! -f /.secret/slurm.conf ]; then
echo "checking for slurm.conf"
while [ ! -f /.secret/slurm.conf ]; do
echo -n "."
sleep 1
done
echo ""
fi
echo "found slurm.conf"
# sudo yum install -y nc
# sudo yum install -y procps
# sudo yum install -y iputils
mkdir -p /var/spool/slurm/d /etc/slurm /var/run/slurm/d /var/log/slurm
chown slurm: /var/spool/slurm/d /var/run/slurm/d /var/log/slurm
cp /home/config/cgroup.conf /etc/slurm/cgroup.conf
chown slurm: /etc/slurm/cgroup.conf
chmod 600 /etc/slurm/cgroup.conf
cp /home/config/slurm.conf /etc/slurm/slurm.conf
chown slurm: /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurm.conf
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log
touch /var/run/slurm/d/slurmd.pid
chmod 600 /var/run/slurm/d/slurmd.pid
chown slurm: /var/run/slurm/d/slurmd.pid
echo "Starting slurmd"
/usr/sbin/slurmstepd infinity &
/usr/sbin/slurmd -Dvv
echo "Started slurmd"
}
### main ###