Merge pull request #2 from ClusterCockpit/dev

Preconfigured and updated docker services for CC components
This commit is contained in:
Jan Eitzinger 2025-01-31 07:09:38 +01:00 committed by GitHub
commit a945a21bc1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
32 changed files with 1336 additions and 1559 deletions

30
.env
View File

@ -2,15 +2,6 @@
# CCBACKEND DEVEL DOCKER SETTINGS # CCBACKEND DEVEL DOCKER SETTINGS
######################################################################## ########################################################################
########################################################################
# SLURM
########################################################################
SLURM_VERSION=22.05.6
ARCH=aarch64
MUNGE_UID=981
SLURM_UID=982
WORKER_UID=1000
######################################################################## ########################################################################
# INFLUXDB # INFLUXDB
######################################################################## ########################################################################
@ -22,27 +13,6 @@ INFLUXDB_BUCKET=ClusterCockpit
# Whether or not to check SSL Cert in Symfony Client, Default: false # Whether or not to check SSL Cert in Symfony Client, Default: false
INFLUXDB_SSL=false INFLUXDB_SSL=false
########################################################################
# MARIADB
########################################################################
MARIADB_ROOT_PASSWORD=root
MARIADB_DATABASE=ClusterCockpit
MARIADB_USER=clustercockpit
MARIADB_PASSWORD=clustercockpit
MARIADB_PORT=3306
#########################################
# LDAP
########################################################################
LDAP_ADMIN_PASSWORD=mashup
LDAP_ORGANISATION=NHR@FAU
LDAP_DOMAIN=rrze.uni-erlangen.de
########################################################################
# PHPMyAdmin
########################################################################
PHPMYADMIN_PORT=8081
######################################################################## ########################################################################
# INTERNAL SETTINGS # INTERNAL SETTINGS
######################################################################## ########################################################################

5
.gitignore vendored
View File

@ -3,6 +3,11 @@ data/job-archive/**
data/influxdb data/influxdb
data/sqldata data/sqldata
data/cc-metric-store data/cc-metric-store
data/cc-metric-store-source
data/ldap
data/mariadb
data/slurm
data
cc-backend cc-backend
cc-backend/** cc-backend/**
.vscode .vscode

187
README.md Normal file → Executable file
View File

@ -1,74 +1,175 @@
# cc-docker # cc-docker
This is a `docker-compose` setup which provides a quickly started environment for ClusterCockpit development and testing, using `cc-backend`. This is a `docker-compose` setup which provides a quickly started environment for ClusterCockpit development and testing, using `cc-backend`.
A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP), or easily added by manual configuration (MySQL). A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP, SLURM), or easily added by manual configuration (MariaDB).
It includes the following containers: It includes the following containers:
* nats (Default) |Service full name|docker service name|port|
* cc-metric-store (Default) | --- | --- | --- |
* influxdb (Default) |Slurm Controller service|slurmctld|6818|
* openldap (Default) |Slurm Database service|slurmdbd|6817|
* mysql (Optional) |Slurm Rest service with JWT authentication|slurmrestd|6820|
* mariadb (Optional) |Slurm Worker|node01|6818|
* phpmyadmin (Optional) |MariaDB service|mariadb|3306|
|InfluxDB serice|influxdb|8086|
|NATS service|nats|4222, 6222, 8222|
|cc-metric-store service|cc-metric-store|8084|
|OpenLDAP|openldap|389, 636|
The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MySQL, and a LDAP user directory. The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MariaDB, and a LDAP user directory.
## Known Issues ## Prerequisites
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead. For all the docker services to work correctly, you will need the following tools installed:
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8081 (phpmyadmin), 8084 (cc-metric-store), 8086 (nfluxDB), 4222 and 8222 (Nats), or 3306 (MySQL). If one or more ports are already in use, you habe to adapt the related config accordingly.
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
## Configuration Templates 1. `docker` and `docker-compose`
2. `golang` (for compiling cc-metric-store)
3. `perl` (for migrateTimestamp.pl) with Cpanel::JSON::XS, Data::Dumper, Time::Piece, Sort::Versions and File::Slurp perl modules.
4. `npm` (for cc-backend)
5. `make` (for building slurm base image)
Located in `./templates` It is also recommended to add docker service to sudouser group since the setupDev.sh script assumes sudo permissions for docker and docker-compose services.
* `docker-compose.yml.default`: Docker-Compose file to setup cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
* `docker-compose.yml.mysql`: Docker-Compose configuration template if MySQL is desired instead of MariaDB. You can use:
* `env.default`: Environment variables for setup with cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
* `env.mysql`: Additional environment variables required if MySQL is desired instead of MariaDB. ```
sudo groupadd docker
sudo usermod -aG docker $USER
# restart after adding your docker with your user to sudo group
sudo shutdown -r -t 0
```
Note: You can install all these dependencies via predefined installation steps in `prerequisite_installation_script.sh`.
If you are using different linux flavors, you will have to adapt `prerequisite_installation_script.sh` as well as `setupDev.sh`.
## Setup ## Setup
1. Clone `cc-backend` repository in chosen base folder: `$> git clone https://github.com/ClusterCockpit/cc-backend.git` 1. Clone `cc-backend` repository in chosen base folder: `$> git clone https://github.com/ClusterCockpit/cc-backend.git`
2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the InfluxDB data). 2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the cc-metric-store data).
3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend` to start `cc-backend.` 3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend -server -dev` to start `cc-backend`.
4. By default, you can access `cc-backend` in your browser at `http://localhost:8080`. You can shut down the cc-backend server by pressing `CTRL-C`, remember to also shut down all containers via `$> docker-compose down` afterwards. 4. By default, you can access `cc-backend` in your browser at `http://localhost:8080`. You can shut down the cc-backend server by pressing `CTRL-C`, remember to also shut down all containers via `$> docker-compose down` afterwards.
5. You can restart the containers with: `$> docker-compose up -d`. 5. You can restart the containers with: `$> docker-compose up -d`.
## Post-Setup Adjustment for using `influxdb` ## Credentials for logging into clustercockpit
When using `influxdb` as a metric database, one must adjust the following files:
* `cc-backend/var/job-archive/emmy/cluster.json`
* `cc-backend/var/job-archive/woody/cluster.json`
In the JSON, exchange the content of the `metricDataRepository`-Entry (By default configured for `cc-metric-store`) with:
```
"metricDataRepository": {
"kind": "influxdb",
"url": "http://localhost:8086",
"token": "egLfcf7fx0FESqFYU3RpAAbj",
"bucket": "ClusterCockpit",
"org": "ClusterCockpit",
"skiptls": false
}
```
## Usage
Credentials for the preconfigured demo user are: Credentials for the preconfigured demo user are:
* User: `demo` * User: `demo`
* Password: `AdminDev` * Password: `demo`
Credentials for the preconfigured LDAP user are:
* User: `ldapuser`
* Password: `ldapuser`
You can also login as regular user using any credential in the LDAP user directory at `./data/ldap/users.ldif`. You can also login as regular user using any credential in the LDAP user directory at `./data/ldap/users.ldif`.
## Preconfigured setup between docker services and ClusterCockpit components
When you are done cloning the cc-backend repo and once you execute `setupDev.sh` file, it will copy a preconfigured `config.json` from `misc/config.json` and replace the `cc-backend/config.json`, which will be used by cc-backend, once you start the server.
The preconfigured config.json attaches to:
#### 1. MariaDB docker service on port 3306 (database: ccbackend)
#### 2. OpenLDAP docker service on port 389
#### 3. cc-metric-store docker service on port 8084
cc-metric-store also has a preconfigured `config.json` in `cc-metric-store/config.json` which attaches to NATS docker service on port 4222 and subscribes to topic 'hpc-nats'.
Basically, all the ClusterCockpit components and the docker services attach to each other like lego pieces.
## Docker commands to access the services
> Note: You need to be in cc-docker directory in order to execute any docker command
You can view all docker processes running on either of the VM instance by using this command:
```
$ docker ps
```
Now that you can see the docker services, and if you want to manually access the docker services, you have to run **`bash`** command in those running services.
> **`Example`**: You want to run slurm commands like `sinfo` or `squeue` or `scontrol` on slurm controller, you cannot directly access it.
You need to **`bash`** into the running service by using the following command:
```
$ docker exec -it <docker service name> bash
#example
$ docker exec -it slurmctld bash
#or
$ docker exec -it mariadb bash
```
Once you start a **`bash`** on any docker service, then you may execute any service related commands in that **`bash`**.
But for Cluster Cockpit development, you only need ports to access these docker services. You have to use `localhost:<port>` when trying to access any docker service. You may need to configure the `cc-backend/config.json` based on these docker services and ports.
## Slurm setup in cc-docker
### 1. Slurm controller
Currently slurm controller is aware of the 1 node that we have setup in our mini cluster i.e. node01.
In order to execute slurm commands, you may need to **`bash`** into the **`slurmctld`** docker service.
```
$ docker exec -it slurmctld bash
```
Then you may be able to run slurm controller commands. A few examples without output are:
```
$ sinfo
or
$ squeue
or
$ scontrol show nodes
```
### 2. Slurm rest service
You do not need to **`bash`** into the slurmrestd service but can directly access the rest API via localhost:6820. A simple example on how to CURL to the slurm rest API is given in the `curl_slurmrestd.sh`.
You can directly use `curl_slurmrestd.sh` with a never expiring JWT token ( can be found in /data/slurm/secret/jwt_token.txt )
You may also use the never expiring token directly from the file for any of your custom CURL commands.
## Known Issues
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead.
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8082 (cc-metric-store), 8086 (InfluxDB), 4222 and 8222 (Nats), or 3306 (MariaDB). If one or more ports are already in use, you have to adapt the related config accordingly.
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
## Docker services and restarting the services
You can find all the docker services in `docker-compose.yml`. Feel free to modify it.
Whenever you modify it, please use
```
$ docker compose down
```
in order to shut down all the services in all the VMs (maininstance, nodeinstance, nodeinstance2) and then start all the services by using
```
$ docker compose up
```
TODO: Update job archive and all other metric data. TODO: Update job archive and all other metric data.
The job archive with 1867 jobs originates from the second half of 2020. The job archive with 1867 jobs originates from the second half of 2020.
Roughly 2700 jobs from the first week of 2021 are loaded with data from InfluxDB. Roughly 2700 jobs from the first week of 2021 are loaded with data from InfluxDB.
Some views of ClusterCockpit (e.g. the Users view) show the last week or month. Some views of ClusterCockpit (e.g. the Users view) show the last week or month.
To show some data there you have to set the filter to time periods with jobs (August 2020 to January 2021). To show some data there you have to set the filter to time periods with jobs (August 2020 to January 2021).

View File

@ -1,10 +1,12 @@
FROM golang:1.17 FROM golang:1.22.4
RUN apt-get update RUN apt-get update
RUN apt-get -y install git RUN apt-get -y install git
RUN rm -rf /cc-metric-store
RUN git clone https://github.com/ClusterCockpit/cc-metric-store.git /cc-metric-store RUN git clone https://github.com/ClusterCockpit/cc-metric-store.git /cc-metric-store
RUN cd /cc-metric-store && go build RUN cd /cc-metric-store && go build ./cmd/cc-metric-store
# Reactivate when latest commit is available # Reactivate when latest commit is available
#RUN go get -d -v github.com/ClusterCockpit/cc-metric-store #RUN go get -d -v github.com/ClusterCockpit/cc-metric-store

View File

@ -1,28 +1,201 @@
{ {
"metrics": { "metrics": {
"clock": { "frequency": 60, "aggregation": null, "scope": "node" }, "debug_metric": {
"cpi": { "frequency": 60, "aggregation": null, "scope": "node" }, "frequency": 60,
"cpu_load": { "frequency": 60, "aggregation": null, "scope": "node" }, "aggregation": "avg"
"flops_any": { "frequency": 60, "aggregation": null, "scope": "node" }, },
"flops_dp": { "frequency": 60, "aggregation": null, "scope": "node" }, "clock": {
"flops_sp": { "frequency": 60, "aggregation": null, "scope": "node" }, "frequency": 60,
"ib_bw": { "frequency": 60, "aggregation": null, "scope": "node" }, "aggregation": "avg"
"lustre_bw": { "frequency": 60, "aggregation": null, "scope": "node" }, },
"mem_bw": { "frequency": 60, "aggregation": null, "scope": "node" }, "cpu_idle": {
"mem_used": { "frequency": 60, "aggregation": null, "scope": "node" }, "frequency": 60,
"rapl_power": { "frequency": 60, "aggregation": null, "scope": "node" } "aggregation": "avg"
},
"cpu_iowait": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_irq": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_system": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_user": {
"frequency": 60,
"aggregation": "avg"
},
"nv_mem_util": {
"frequency": 60,
"aggregation": "avg"
},
"nv_temp": {
"frequency": 60,
"aggregation": "avg"
},
"nv_sm_clock": {
"frequency": 60,
"aggregation": "avg"
},
"acc_utilization": {
"frequency": 60,
"aggregation": "avg"
},
"acc_mem_used": {
"frequency": 60,
"aggregation": "sum"
},
"acc_power": {
"frequency": 60,
"aggregation": "sum"
},
"flops_any": {
"frequency": 60,
"aggregation": "sum"
},
"flops_dp": {
"frequency": 60,
"aggregation": "sum"
},
"flops_sp": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit": {
"frequency": 60,
"aggregation": "sum"
},
"ib_recv_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"ib_xmit_pkts": {
"frequency": 60,
"aggregation": "sum"
},
"cpu_power": {
"frequency": 60,
"aggregation": "sum"
},
"core_power": {
"frequency": 60,
"aggregation": "sum"
},
"mem_power": {
"frequency": 60,
"aggregation": "sum"
},
"ipc": {
"frequency": 60,
"aggregation": "avg"
},
"cpu_load": {
"frequency": 60,
"aggregation": null
},
"lustre_close": {
"frequency": 60,
"aggregation": null
},
"lustre_open": {
"frequency": 60,
"aggregation": null
},
"lustre_statfs": {
"frequency": 60,
"aggregation": null
},
"lustre_read_bytes": {
"frequency": 60,
"aggregation": null
},
"lustre_write_bytes": {
"frequency": 60,
"aggregation": null
},
"net_bw": {
"frequency": 60,
"aggregation": null
},
"file_bw": {
"frequency": 60,
"aggregation": null
},
"mem_bw": {
"frequency": 60,
"aggregation": "sum"
},
"mem_cached": {
"frequency": 60,
"aggregation": null
},
"mem_used": {
"frequency": 60,
"aggregation": null
},
"net_bytes_in": {
"frequency": 60,
"aggregation": null
},
"net_bytes_out": {
"frequency": 60,
"aggregation": null
},
"nfs4_read": {
"frequency": 60,
"aggregation": null
},
"nfs4_total": {
"frequency": 60,
"aggregation": null
},
"nfs4_write": {
"frequency": 60,
"aggregation": null
},
"vectorization_ratio": {
"frequency": 60,
"aggregation": "avg"
}
}, },
"checkpoints": { "checkpoints": {
"interval": 100000000000, "interval": "12h",
"directory": "/data/checkpoints", "directory": "/data/checkpoints",
"restore": 100000000000 "restore": "48h"
}, },
"archive": { "archive": {
"interval": 100000000000, "interval": "50h",
"directory": "/data/archive" "directory": "/data/archive"
}, },
"retention-in-memory": 100000000000, "http-api": {
"http-api-address": "0.0.0.0:8081", "address": "0.0.0.0:8084",
"nats": "nats://cc-nats:4222", "https-cert-file": null,
"https-key-file": null
},
"retention-in-memory": "48h",
"nats": [
{
"address": "nats://nats:4222",
"username": "root",
"password": "root",
"subscriptions": [
{
"subscribe-to": "hpc-nats",
"cluster-tag": "fritz"
},
{
"subscribe-to": "hpc-nats",
"cluster-tag": "alex"
}
]
}
],
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0=" "jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
} }

View File

@ -1,34 +0,0 @@
#!/usr/bin/env bash
if [ -d symfony ]; then
echo "Data already initialized!"
echo -n "Perform a fresh initialisation? [yes to proceed / no to exit] "
read -r answer
if [ "$answer" == "yes" ]; then
echo "Cleaning directories ..."
rm -rf symfony
rm -rf job-archive
rm -rf influxdb/data/*
rm -rf sqldata/*
echo "done."
else
echo "Aborting ..."
exit
fi
fi
mkdir symfony
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive_stable.tar.xz
tar xJf job-archive_stable.tar.xz
rm ./job-archive_stable.tar.xz
# 101 is the uid and gid of the user and group www-data in the cc-php container running php-fpm.
# For a demo with no new jobs it is enough to give www read permissions on that directory.
# echo "This script needs to chown the job-archive directory so that the application can write to it:"
# sudo chown -R 82:82 ./job-archive
mkdir -p influxdb/data
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/influxdbv2-data_stable.tar.xz
cd influxdb/data
tar xJf ../../influxdbv2-data_stable.tar.xz
rm ../../influxdbv2-data_stable.tar.xz

File diff suppressed because it is too large Load Diff

View File

@ -1,5 +0,0 @@
[mysqld]
innodb_buffer_pool_size=4096M
innodb_log_file_size=64M
innodb_lock_wait_timeout=900
max_allowed_packet=16M

View File

@ -1,48 +0,0 @@
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=snowflake
SlurmctldHost=slurmctld
SlurmUser=slurm
SlurmctldPort=6817
SlurmdPort=6818
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
SlurmdSpoolDir=/var/spool/slurm/d
StateSaveLocation=/var/spool/slurm/ctld
SwitchType=switch/none
TaskPlugin=task/affinity
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
# LOGGING AND ACCOUNTING
AccountingStorageHost=slurmdb
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageUser=slurm
AccountingStoreFlags=job_script,job_comment,job_env,job_extra
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
# COMPUTE NODES
NodeName=node0[1-2] CPUs=1 State=UNKNOWN
PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP

139
dataGenerationScript.sh Executable file
View File

@ -0,0 +1,139 @@
#!/bin/bash
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| This is Data generation script for docker services |"
echo "| Starting file required by docker services in data/ |"
echo "|--------------------------------------------------------------------------------------|"
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints
# After this, migrateTimestamp.pl will run from setupDev.sh. This will update the timestamps
# for all the checkpoint files, which then can be read by cc-metric-store.
# cc-metric-store reads only data upto certain time, like 48 hours of data.
# These checkpoint files have timestamp older than 48 hours and needs to be updated with
# migrateTimestamp.pl file, which will be automatically invoked from setupDev.sh.
if [ ! -d data/cc-metric-store-source ]; then
mkdir -p data/cc-metric-store-source/checkpoints
cd data/cc-metric-store-source/checkpoints
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
tar xf cc-metric-store-checkpoints.tar.xz
rm cc-metric-store-checkpoints.tar.xz
cd ../../../
else
echo "'data/cc-metric-store-source' already exists!"
fi
# A simple configuration file for mariadb docker service.
# Required because you can specify only one database per docker service.
# This file mentions the database to be created for cc-backend.
# This file automatically picked by mariadb after the docker service starts.
if [ ! -d data/mariadb ]; then
mkdir -p data/mariadb
cat > data/mariadb/01.databases.sql <<EOF
CREATE DATABASE IF NOT EXISTS \`ccbackend\`;
EOF
else
echo "'data/mariadb' already exists!"
fi
# A simple configuration file for openldap docker service.
# Creates a simple user 'ldapuser' with password 'ldapuser'.
# This file automatically picked by openldap after the docker service starts.
if [ ! -d data/ldap ]; then
mkdir -p data/ldap
cat > data/ldap/add_users.ldif <<EOF
dn: ou=users,dc=example,dc=com
objectClass: organizationalUnit
ou: users
dn: uid=ldapuser,ou=users,dc=example,dc=com
objectClass: inetOrgPerson
objectClass: posixAccount
objectClass: top
cn: Ldap User
sn: User
uid: ldapuser
uidNumber: 1
gidNumber: 1
homeDirectory: /home/ldapuser
userPassword: {SSHA}sQRqFQtuiupej7J/rbrQrTwYEHDduV+N
EOF
else
echo "'data/ldap' already exists!"
fi
# A simple configuration file for nats docker service.
# Required because we need to execute custom commands after nats docker service starts.
# This file automatically executed when the nats docker service starts.
# After docker service starts, there is an infinite while loop that publises data for 'fritz' and 'alex' cluster
# to subject 'hpc-nats' every 1 minute. Random data is generated only for node level metrics, not hardware level metrics.
if [ ! -d data/nats ]; then
mkdir -p data/nats
cat > data/nats/docker-entrypoint.sh <<EOF
#!/bin/sh
set -e
# Start NATS server in the background
nats-server --user root --pass root --http_port 8222 &
# Wait for NATS to be ready
until nc -z 0.0.0.0 4222; do
echo "Waiting for NATS to start..."
sleep 1
done
echo "NATS is up and running. Executing custom script..."
apk add curl
curl -sf https://binaries.nats.dev/nats-io/natscli/nats@latest | sh
# This is a dummy data generation loop, that inserts data for given nodes at 1 min interval
while true; do
# Timestamp in seconds
timestamp="\$(date '+%s')"
# Generate data for alex cluster. Push to sample_alex.txt
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
for hostname in a0603 a0903 a0832 a0329 a0702 a0122 a1624 a0731 a0224 a0704 a0631 a0225 a0222 a0427 a0603 a0429 a0833 a0705 a0901 a0601 a0227 a0804 a0322 a0226 a0126 a0129 a0605 a0801 a0934; do
echo "\$metric,cluster=alex,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_alex.txt
done
done
# Nats client will publish the data from sample_alex.txt to 'hpc-nats' subject on this nats server
./nats pub hpc-nats "\$(cat sample_alex.txt)" -s nats://0.0.0.0:4222 --user root --password root
# Generate data for fritz cluster. Push to sample_fritz.txt
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
for hostname in f0201 f0202 f0203 f0204 f0205 f0206 f0207 f0208 f0209 f0210 f0211 f0212 f0213 f0214 f0215 f0217 f0218 f0219 f0220 f0221 f0222 f0223 f0224 f0225 f0226 f0227 f0228 f0229; do
echo "\$metric,cluster=fritz,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_fritz.txt
done
done
# Nats client will publish the data from sample_fritz.txt to 'hpc-nats' subject on this nats server
./nats pub hpc-nats "\$(cat sample_fritz.txt)" -s nats://0.0.0.0:4222 --user root --password root
rm sample_alex.txt
rm sample_fritz.txt
sleep 1m
done
EOF
else
echo "'data/nats' already exists!"
fi
# prepare folders for influxdb3
if [ ! -d data/influxdb ]; then
mkdir -p data/influxdb/data
mkdir -p data/influxdb/config
else
echo "'data/influxdb' already exists!"
fi
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Finished generating relevant files for docker services in data/ |"
echo "|--------------------------------------------------------------------------------------|"

114
docker-compose.yml Normal file → Executable file
View File

@ -3,15 +3,19 @@ services:
container_name: nats container_name: nats
image: nats:alpine image: nats:alpine
ports: ports:
- "4222:4222" - "0.0.0.0:4222:4222"
- "8222:8222" - "0.0.0.0:8222:8222"
- "0.0.0.0:6222:6222"
volumes:
- ${DATADIR}/nats:/data
entrypoint: ["/bin/sh", "/data/docker-entrypoint.sh"]
cc-metric-store: cc-metric-store:
container_name: cc-metric-store container_name: cc-metric-store
build: build:
context: ./cc-metric-store context: ./cc-metric-store
ports: ports:
- "8084:8084" - "0.0.0.0:8084:8084"
volumes: volumes:
- ${DATADIR}/cc-metric-store:/data - ${DATADIR}/cc-metric-store:/data
depends_on: depends_on:
@ -19,8 +23,8 @@ services:
influxdb: influxdb:
container_name: influxdb container_name: influxdb
image: influxdb image: influxdb:latest
command: ["--reporting-disabled"] command: ["--reporting-disabled", "--log-level=debug"]
environment: environment:
DOCKER_INFLUXDB_INIT_MODE: setup DOCKER_INFLUXDB_INIT_MODE: setup
DOCKER_INFLUXDB_INIT_USERNAME: devel DOCKER_INFLUXDB_INIT_USERNAME: devel
@ -30,7 +34,7 @@ services:
DOCKER_INFLUXDB_INIT_RETENTION: 100w DOCKER_INFLUXDB_INIT_RETENTION: 100w
DOCKER_INFLUXDB_INIT_ADMIN_TOKEN: ${INFLUXDB_ADMIN_TOKEN} DOCKER_INFLUXDB_INIT_ADMIN_TOKEN: ${INFLUXDB_ADMIN_TOKEN}
ports: ports:
- "127.0.0.1:${INFLUXDB_PORT}:8086" - "0.0.0.0:8086:8086"
volumes: volumes:
- ${DATADIR}/influxdb/data:/var/lib/influxdb2 - ${DATADIR}/influxdb/data:/var/lib/influxdb2
- ${DATADIR}/influxdb/config:/etc/influxdb2 - ${DATADIR}/influxdb/config:/etc/influxdb2
@ -40,9 +44,15 @@ services:
image: osixia/openldap:1.5.0 image: osixia/openldap:1.5.0
command: --copy-service --loglevel debug command: --copy-service --loglevel debug
environment: environment:
- LDAP_ADMIN_PASSWORD=${LDAP_ADMIN_PASSWORD} - LDAP_ADMIN_PASSWORD=mashup
- LDAP_ORGANISATION=${LDAP_ORGANISATION} - LDAP_ORGANISATION=Example Organization
- LDAP_DOMAIN=${LDAP_DOMAIN} - LDAP_DOMAIN=example.com
- LDAP_LOGGING=true
- LDAP_CONNECTION=default
- LDAP_CONNECTIONS=default
- LDAP_DEFAULT_HOSTS=0.0.0.0
ports:
- "0.0.0.0:389:389"
volumes: volumes:
- ${DATADIR}/ldap:/container/service/slapd/assets/config/bootstrap/ldif/custom - ${DATADIR}/ldap:/container/service/slapd/assets/config/bootstrap/ldif/custom
@ -51,36 +61,18 @@ services:
image: mariadb:latest image: mariadb:latest
command: ["--default-authentication-plugin=mysql_native_password"] command: ["--default-authentication-plugin=mysql_native_password"]
environment: environment:
MARIADB_ROOT_PASSWORD: ${MARIADB_ROOT_PASSWORD} MARIADB_ROOT_PASSWORD: root
MARIADB_DATABASE: slurm_acct_db MARIADB_DATABASE: slurm_acct_db
MARIADB_USER: slurm MARIADB_USER: slurm
MARIADB_PASSWORD: demo MARIADB_PASSWORD: demo
ports: ports:
- "127.0.0.1:${MARIADB_PORT}:3306" - "0.0.0.0:3306:3306"
volumes: volumes:
- ${DATADIR}/mariadb:/etc/mysql/conf.d - ${DATADIR}/mariadb:/docker-entrypoint-initdb.d
# - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
cap_add: cap_add:
- SYS_NICE - SYS_NICE
# mysql: slurmctld:
# container_name: mysql
# image: mysql:8.0.22
# command: ["--default-authentication-plugin=mysql_native_password"]
# environment:
# MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD}
# MYSQL_DATABASE: ${MYSQL_DATABASE}
# MYSQL_USER: ${MYSQL_USER}
# MYSQL_PASSWORD: ${MYSQL_PASSWORD}
# ports:
# - "127.0.0.1:${MYSQL_PORT}:3306"
# # volumes:
# # - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
# # - ${DATADIR}/sqldata:/var/lib/mysql
# cap_add:
# - SYS_NICE
slurm-controller:
container_name: slurmctld container_name: slurmctld
hostname: slurmctld hostname: slurmctld
build: build:
@ -89,40 +81,66 @@ services:
volumes: volumes:
- ${DATADIR}/slurm/home:/home - ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret - ${DATADIR}/slurm/secret:/.secret
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
- ${DATADIR}/slurm/state:/var/lib/slurm/d
ports:
- "6817:6817"
slurm-database: slurmdbd:
container_name: slurmdb container_name: slurmdbd
hostname: slurmdb hostname: slurmdbd
build: build:
context: ./slurm/database context: ./slurm/database
depends_on: depends_on:
- mariadb - mariadb
- slurm-controller - slurmctld
privileged: true privileged: true
volumes: volumes:
- ${DATADIR}/slurm/home:/home - ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret - ${DATADIR}/slurm/secret:/.secret
- ./slurm/database/slurmdbd.conf:/home/config/slurmdbd.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6819:6819"
slurm-worker01: node01:
container_name: node01 container_name: node01
hostname: node01 hostname: node01
build: build:
context: ./slurm/worker context: ./slurm/worker
depends_on: depends_on:
- slurm-controller - slurmctld
privileged: true privileged: true
volumes: volumes:
- ${DATADIR}/slurm/home:/home - ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret - ${DATADIR}/slurm/secret:/.secret
- ./slurm/worker/cgroup.conf:/home/config/cgroup.conf
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6818:6818"
# slurm-worker02: slurmrestd:
# container_name: node02 container_name: slurmrestd
# hostname: node02 hostname: slurmrestd
# build: build:
# context: ./slurm/worker context: ./slurm/rest
# depends_on: environment:
# - slurm-controller - SLURM_JWT=daemon
# privileged: true - SLURMRESTD_DEBUG=9
# volumes: depends_on:
# - ${DATADIR}/slurm/home:/home - slurmctld
# - ${DATADIR}/slurm/secret:/.secret privileged: true
volumes:
- ${DATADIR}/slurm/home:/home
- ${DATADIR}/slurm/secret:/.secret
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
- ./slurm/rest/slurmrestd.conf:/home/config/slurmrestd.conf
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
ports:
- "6820:6820"

View File

@ -1,5 +0,0 @@
SLURM_VERSION=22.05.6
ARCH=aarch64
MUNGE_UID=981
SLURM_UID=982
WORKER_UID=1000

View File

@ -9,7 +9,6 @@ use File::Slurp;
use Data::Dumper; use Data::Dumper;
use Time::Piece; use Time::Piece;
use Sort::Versions; use Sort::Versions;
use REST::Client;
### JOB-ARCHIVE ### JOB-ARCHIVE
my $localtime = localtime; my $localtime = localtime;
@ -19,80 +18,80 @@ my $archiveSrc = './data/job-archive-source';
my @ArchiveClusters; my @ArchiveClusters;
# Get clusters by job-archive/$subfolder # Get clusters by job-archive/$subfolder
opendir my $dh, $archiveSrc or die "can't open directory: $!"; # opendir my $dh, $archiveSrc or die "can't open directory: $!";
while ( readdir $dh ) { # while ( readdir $dh ) {
chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive'; # chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive' or $_ eq 'version.txt';
my $cluster = $_; # my $cluster = $_;
push @ArchiveClusters, $cluster; # push @ArchiveClusters, $cluster;
} # }
# start for jobarchive # # start for jobarchive
foreach my $cluster ( @ArchiveClusters ) { # foreach my $cluster ( @ArchiveClusters ) {
print "Starting to update start- and stoptimes in job-archive for $cluster\n"; # print "Starting to update start- and stoptimes in job-archive for $cluster\n";
opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!"; # opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!";
while ( readdir $dhLevel1 ) { # while ( readdir $dhLevel1 ) {
chomp; next if $_ eq '.' or $_ eq '..'; # chomp; next if $_ eq '.' or $_ eq '..';
my $level1 = $_; # my $level1 = $_;
if ( -d "$archiveSrc/$cluster/$level1" ) { # if ( -d "$archiveSrc/$cluster/$level1" ) {
opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!"; # opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!";
while ( readdir $dhLevel2 ) { # while ( readdir $dhLevel2 ) {
chomp; next if $_ eq '.' or $_ eq '..'; # chomp; next if $_ eq '.' or $_ eq '..';
my $level2 = $_; # my $level2 = $_;
my $jobSource = "$archiveSrc/$cluster/$level1/$level2"; # my $jobSource = "$archiveSrc/$cluster/$level1/$level2";
my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/"; # my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/";
my $jobOrigin = $jobSource; # my $jobOrigin = $jobSource;
# check if files are directly accessible (old format) else get subfolders as file and update path # # check if files are directly accessible (old format) else get subfolders as file and update path
if ( ! -e "$jobSource/meta.json") { # if ( ! -e "$jobSource/meta.json") {
my @folders = read_dir($jobSource); # my @folders = read_dir($jobSource);
if (!@folders) { # if (!@folders) {
next; # next;
} # }
# Only use first subfolder for now TODO # # Only use first subfolder for now TODO
$jobSource = "$jobSource/".$folders[0]; # $jobSource = "$jobSource/".$folders[0];
} # }
# check if subfolder contains file, else remove source and skip # # check if subfolder contains file, else remove source and skip
if ( ! -e "$jobSource/meta.json") { # if ( ! -e "$jobSource/meta.json") {
# rmtree $jobOrigin; # # rmtree $jobOrigin;
next; # next;
} # }
my $rawstr = read_file("$jobSource/meta.json"); # my $rawstr = read_file("$jobSource/meta.json");
my $json = decode_json($rawstr); # my $json = decode_json($rawstr);
# NOTE Start meta.json iteration here # # NOTE Start meta.json iteration here
# my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT; # # my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT;
# Set new startTime: Between 5 days and 1 day before now # # Set new startTime: Between 5 days and 1 day before now
# Remove id from attributes # # Remove id from attributes
$json->{startTime} = $epochtime - (int(rand(432000)) + 86400); # $json->{startTime} = $epochtime - (int(rand(432000)) + 86400);
$json->{stopTime} = $json->{startTime} + $json->{duration}; # $json->{stopTime} = $json->{startTime} + $json->{duration};
# Add starttime subfolder to target path # # Add starttime subfolder to target path
$jobTarget .= $json->{startTime}; # $jobTarget .= $json->{startTime};
# target is not directory # # target is not directory
if ( not -d $jobTarget ){ # if ( not -d $jobTarget ){
# print "Writing files\n"; # # print "Writing files\n";
# print "$cluster/$level1/$level2\n"; # # print "$cluster/$level1/$level2\n";
make_path($jobTarget); # make_path($jobTarget);
my $outstr = encode_json($json); # my $outstr = encode_json($json);
write_file("$jobTarget/meta.json", $outstr); # write_file("$jobTarget/meta.json", $outstr);
my $datstr = read_file("$jobSource/data.json"); # my $datstr = read_file("$jobSource/data.json.gz");
write_file("$jobTarget/data.json", $datstr); # write_file("$jobTarget/data.json.gz", $datstr);
} else { # } else {
# rmtree $jobSource; # # rmtree $jobSource;
} # }
} # }
} # }
} # }
} # }
print "Done for job-archive\n"; # print "Done for job-archive\n";
sleep(1); # sleep(1);
## CHECKPOINTS ## CHECKPOINTS
chomp(my $checkpointStart=`date --date 'TZ="Europe/Berlin" 0:00 7 days ago' +%s`); chomp(my $checkpointStart=`date --date 'TZ="Europe/Berlin" 0:00 7 days ago' +%s`);

77
misc/config.json Normal file
View File

@ -0,0 +1,77 @@
{
"addr": "127.0.0.1:8080",
"short-running-jobs-duration": 300,
"archive": {
"kind": "file",
"path": "./var/job-archive"
},
"jwts": {
"max-age": "2000h"
},
"db-driver": "mysql",
"db": "root:root@tcp(0.0.0.0:3306)/ccbackend",
"ldap": {
"url": "ldap://0.0.0.0",
"user_base": "ou=users,dc=example,dc=com",
"search_dn": "cn=admin,dc=example,dc=com",
"user_bind": "uid={username},ou=users,dc=example,dc=com",
"user_filter": "(&(objectclass=posixAccount))",
"syncUserOnLogin": true
},
"enable-resampling": {
"trigger": 30,
"resolutions": [
600,
300,
120,
60
]
},
"emission-constant": 317,
"clusters": [
{
"name": "fritz",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://0.0.0.0:8084",
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
},
"filterRanges": {
"numNodes": {
"from": 1,
"to": 64
},
"duration": {
"from": 0,
"to": 86400
},
"startTime": {
"from": "2022-01-01T00:00:00Z",
"to": null
}
}
},
{
"name": "alex",
"metricDataRepository": {
"kind": "cc-metric-store",
"url": "http://0.0.0.0:8084",
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
},
"filterRanges": {
"numNodes": {
"from": 1,
"to": 64
},
"duration": {
"from": 0,
"to": 86400
},
"startTime": {
"from": "2022-01-01T00:00:00Z",
"to": null
}
}
}
]
}

3
misc/curl_slurmrestd.sh Executable file
View File

@ -0,0 +1,3 @@
SLURM_JWT=$(cat data/slurm/secret/jwt_token.txt)
curl -X 'GET' -v 'http://localhost:6820/slurm/v0.0.39/node/node01' --location --silent --show-error -H "X-SLURM-USER-NAME: root" -H "X-SLURM-USER-TOKEN: $SLURM_JWT"
# curl -v --unix-socket data/slurm/tmp/slurmrestd.socket 'http://localhost:6820/slurm/v0.0.39/ping'

27
misc/jwt_verifier.py Normal file
View File

@ -0,0 +1,27 @@
#!/usr/bin/env python3
import sys
import os
import pprint
import json
import time
from datetime import datetime, timedelta, timezone
from jwt import JWT
from jwt.jwa import HS256
from jwt.jwk import jwk_from_dict
from jwt.utils import b64decode,b64encode
if len(sys.argv) != 2:
sys.exit("verify_jwt.py [JWT Token]");
with open("data/slurm/secret/jwt_hs256.key", "rb") as f:
priv_key = f.read()
signing_key = jwk_from_dict({
'kty': 'oct',
'k': b64encode(priv_key)
})
a = JWT()
b = a.decode(sys.argv[1], signing_key, algorithms=["HS256"])
print(b)

View File

@ -0,0 +1,40 @@
#!/bin/bash -l
sudo apt-get update
sudo apt-get upgrade -f -y
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -f -y gcc
sudo apt-get install -f -y npm
sudo apt-get install -f -y make
sudo apt-get install -f -y gh
sudo apt-get install -f -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo apt-get install -f -y docker-compose
sudo apt install perl -f -y libdatetime-perl libjson-perl
sudo apt-get install -f -y golang-go
sudo cpan Cpanel::JSON::XS
sudo cpan File::Slurp
sudo cpan Data::Dumper
sudo cpan Time::Piece
sudo cpan Sort::Versions
sudo groupadd docker
sudo usermod -aG docker ubuntu
sudo shutdown -r -t 0

View File

@ -1,48 +1,42 @@
#!/bin/bash #!/bin/bash
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Welcome to cc-docker automatic deployment script. |"
echo "| Make sure you have sudo rights to run docker services |"
echo "| This script assumes that docker command is added to sudo group |"
echo "| This means that docker commands do not explicitly require |"
echo "| 'sudo' keyword to run. You can use this following command: |"
echo "| |"
echo "| > sudo groupadd docker |"
echo "| > sudo usermod -aG docker $USER |"
echo "| |"
echo "| This will add docker to the sudo usergroup and all the docker |"
echo "| command will run as sudo by default without requiring |"
echo "| 'sudo' keyword. |"
echo "|--------------------------------------------------------------------------------------|"
echo ""
# Check cc-backend, touch job.db if exists # Check cc-backend if exists
if [ ! -d cc-backend ]; then if [ ! -d cc-backend ]; then
echo "'cc-backend' not yet prepared! Please clone cc-backend repository before starting this script." echo "'cc-backend' not yet prepared! Please clone cc-backend repository before starting this script."
echo -n "Stopped." echo -n "Stopped."
exit exit
else
cd cc-backend
if [ ! -d var ]; then
mkdir var
touch var/job.db
else
echo "'cc-backend/var' exists. Cautiously exiting."
echo -n "Stopped."
exit
fi
fi fi
# Creates data directory if it does not exists.
# Download unedited job-archive to ./data/job-archive-source # Contains all the mount points required by all the docker services
if [ ! -d data/job-archive-source ]; then # and their static files.
cd data if [ ! -d data ]; then
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar mkdir -m777 data
tar xf job-archive-demo.tar
mv ./job-archive ./job-archive-source
rm ./job-archive-demo.tar
cd ..
else
echo "'data/job-archive-source' already exists!"
fi fi
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints # Invokes the dataGenerationScript.sh, which then populates the required
if [ ! -d data/cc-metric-store-source ]; then # static files by the docker services. These static files are required by docker services after startup.
mkdir -p data/cc-metric-store-source/checkpoints chmod u+x dataGenerationScript.sh
cd data/cc-metric-store-source/checkpoints ./dataGenerationScript.sh
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
tar xf cc-metric-store-checkpoints.tar.xz
rm cc-metric-store-checkpoints.tar.xz
cd ../../../
else
echo "'data/cc-metric-store-source' already exists!"
fi
# Update timestamps # Update timestamps for all the checkpoints in data/cc-metric-store-source
# and dumps new files in data/cc-metric-store.
perl ./migrateTimestamps.pl perl ./migrateTimestamps.pl
# Create archive folder for rewritten ccms checkpoints # Create archive folder for rewritten ccms checkpoints
@ -51,32 +45,54 @@ if [ ! -d data/cc-metric-store/archive ]; then
fi fi
# cleanup sources # cleanup sources
# rm -r ./data/job-archive-source if [ -d data/cc-metric-store-source ]; then
# rm -r ./data/cc-metric-store-source rm -r data/cc-metric-store-source
# prepare folders for influxdb2
if [ ! -d data/influxdb ]; then
mkdir -p data/influxdb/data
mkdir -p data/influxdb/config/influx-configs
else
echo "'data/influxdb' already exists!"
fi fi
# Check dotenv-file and docker-compose-yml, copy accordingly if not present and build docker services # Just in case user forgot manually shutdown the docker services.
if [ ! -d .env ]; then docker-compose down
cp templates/env.default ./.env docker-compose down --remove-orphans
fi
if [ ! -d docker-compose.yml ]; then # This automatically builds the base docker image for slurm.
cp templates/docker-compose.yml.default ./docker-compose.yml # All the slurm docker service in docker-compose.yml refer to
fi # the base image created from this directory.
cd slurm/base/
make
cd ../..
# Starts all the docker services from docker-compose.yml.
docker-compose build docker-compose build
./cc-backend/cc-backend --init-db --add-user demo:admin:AdminDev
docker-compose up -d docker-compose up -d
cd cc-backend
if [ ! -d var ]; then
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
tar xf job-archive-demo.tar
rm ./job-archive-demo.tar
cp ./configs/env-template.txt .env
cp -f ../misc/config.json config.json
make
./cc-backend -migrate-db
./cc-backend --init-db --add-user demo:admin:demo
cd ..
else
cd ..
echo "'cc-backend/var' exists. Cautiously exiting."
fi
echo ""
echo "|--------------------------------------------------------------------------------------|"
echo "| Check logs for each slurm service by using these commands: |"
echo "| docker-compose logs slurmctld |"
echo "| docker-compose logs slurmdbd |"
echo "| docker-compose logs slurmrestd |"
echo "| docker-compose logs node01 |"
echo "|======================================================================================|"
echo "| Setup complete, containers are up by default: Shut down with 'docker-compose down'. |"
echo "| Use './cc-backend/cc-backend -server' to start cc-backend. |"
echo "| Use scripts in /scripts to load data into influx or mariadb. |"
echo "|--------------------------------------------------------------------------------------|"
echo "" echo ""
echo "Setup complete, containers are up by default: Shut down with 'docker-compose down'."
echo "Use './cc-backend/cc-backend' to start cc-backend."
echo "Use scripts in /scripts to load data into influx or mariadb."
# ./cc-backend/cc-backend

View File

@ -1,41 +1,39 @@
FROM rockylinux:8 FROM rockylinux:8
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de> LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
ENV SLURM_VERSION=22.05.6 ENV SLURM_VERSION=24.05.3
ENV ARCH=aarch64 ENV HTTP_PARSER_VERSION=2.8.0
RUN yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y RUN yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
RUN ARCH=$(uname -m) && yum install -y https://rpmfind.net/linux/almalinux/8.10/PowerTools/x86_64/os/Packages/http-parser-devel-2.8.0-9.el8.$ARCH.rpm
RUN groupadd -g 981 munge \ RUN groupadd -g 981 munge \
&& useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u 981 -g munge -s /sbin/nologin munge \ && useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u 981 -g munge -s /sbin/nologin munge \
&& groupadd -g 982 slurm \ && groupadd -g 1000 slurm \
&& useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 982 -g slurm -s /bin/bash slurm \ && useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 1000 -g slurm -s /bin/bash slurm \
&& groupadd -g 1000 worker \ && groupadd -g 982 worker \
&& useradd -m -c "Workflow user" -d /home/worker -u 1000 -g worker -s /bin/bash worker && useradd -m -c "Workflow user" -d /home/worker -u 982 -g worker -s /bin/bash worker
RUN yum install -y munge munge-libs RUN yum install -y munge munge-libs rng-tools \
RUN dnf --enablerepo=powertools install munge-devel -y python3 gcc openssl openssl-devel \
RUN yum install rng-tools -y openssh-server openssh-clients dbus-devel \
pam-devel numactl numactl-devel hwloc sudo \
lua readline-devel ncurses-devel man2html \
autoconf automake json-c-devel libjwt-devel \
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
RUN yum install -y python3 gcc openssl openssl-devel \ RUN dnf --enablerepo=powertools install -y munge-devel rrdtool-devel lua-devel hwloc-devel mariadb-server mariadb-devel
openssh-server openssh-clients dbus-devel \
pam-devel numactl numactl-devel hwloc sudo \
lua readline-devel ncurses-devel man2html \
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
RUN dnf --enablerepo=powertools install rrdtool-devel lua-devel hwloc-devel rpm-build -y RUN mkdir -p /usr/local/slurm-tmp \
RUN dnf install mariadb-server mariadb-devel -y && cd /usr/local/slurm-tmp \
RUN mkdir /usr/local/slurm-tmp && wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
RUN cd /usr/local/slurm-tmp && rpmbuild -ta --with slurmrestd --with jwt slurm-${SLURM_VERSION}.tar.bz2
RUN wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2
RUN rpmbuild -ta slurm-${SLURM_VERSION}.tar.bz2
WORKDIR /root/rpmbuild/RPMS/${ARCH} RUN ARCH=$(uname -m) \
RUN yum -y --nogpgcheck localinstall \ && yum -y --nogpgcheck localinstall \
slurm-${SLURM_VERSION}-1.el8.${ARCH}.rpm \ /root/rpmbuild/RPMS/$ARCH/slurm-${SLURM_VERSION}*.$ARCH.rpm \
slurm-perlapi-${SLURM_VERSION}-1.el8.${ARCH}.rpm \ /root/rpmbuild/RPMS/$ARCH/slurm-perlapi-${SLURM_VERSION}*.$ARCH.rpm \
slurm-slurmctld-${SLURM_VERSION}-1.el8.${ARCH}.rpm /root/rpmbuild/RPMS/$ARCH/slurm-slurmctld-${SLURM_VERSION}*.$ARCH.rpm
WORKDIR /
VOLUME ["/home", "/.secret"] VOLUME ["/home", "/.secret"]
# 22: SSH # 22: SSH
@ -43,4 +41,5 @@ VOLUME ["/home", "/.secret"]
# 6817: SlurmCtlD # 6817: SlurmCtlD
# 6818: SlurmD # 6818: SlurmD
# 6819: SlurmDBD # 6819: SlurmDBD
EXPOSE 22 6817 6818 6819 # 6820: SlurmRestD
EXPOSE 22 6817 6818 6819 6820

View File

@ -1,6 +1,6 @@
include ../../.env include ../../.env
IMAGE = clustercockpit/slurm.base IMAGE = clustercockpit/slurm.base
SLURM_VERSION = 24.05.3
.PHONY: build clean .PHONY: build clean
build: build:

View File

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6 FROM clustercockpit/slurm.base:24.05.3
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de> LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up # clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum && rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"] ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@ -1,23 +1,43 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -e set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
SLURM_JWT=daemon
SLURMRESTD_SECURITY=disable_user_check
_delete_secrets() {
if [ -f /.secret/munge.key ]; then
echo "Removing secrets"
sudo rm -rf /.secret/munge.key
sudo rm -rf /.secret/worker-secret.tar.gz
sudo rm -rf /.secret/setup-worker-ssh.sh
sudo rm -rf /.secret/jwt_hs256.key
sudo rm -rf /.secret/jwt_token.txt
echo "Done removing secrets"
ls /.secret/
fi
}
# start sshd server # start sshd server
_sshd_host() { _sshd_host() {
if [ ! -d /var/run/sshd ]; then if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N '' ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi fi
echo "Starting sshd" echo "Starting sshd"
/usr/sbin/sshd /usr/sbin/sshd
} }
# setup worker ssh to be passwordless # setup worker ssh to be passwordless
_ssh_worker() { _ssh_worker() {
if [[ ! -d /home/worker ]]; then if [[ ! -d /home/worker ]]; then
mkdir -p /home/worker mkdir -p /home/worker
chown -R worker:worker /home/worker chown -R worker:worker /home/worker
fi fi
cat > /home/worker/setup-worker-ssh.sh <<EOF2 cat >/home/worker/setup-worker-ssh.sh <<EOF2
mkdir -p ~/.ssh mkdir -p ~/.ssh
chmod 0700 ~/.ssh chmod 0700 ~/.ssh
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)" ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
@ -41,7 +61,7 @@ EOF2
# start munge and generate key # start munge and generate key
_munge_start() { _munge_start() {
echo "Starting munge" echo "Starting munge"
chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge
chmod 0700 /etc/munge chmod 0700 /etc/munge
chmod 0711 /var/lib/munge chmod 0711 /var/lib/munge
@ -50,9 +70,9 @@ _munge_start() {
/sbin/create-munge-key -f /sbin/create-munge-key -f
rngd -r /dev/urandom rngd -r /dev/urandom
/usr/sbin/create-munge-key -r -f /usr/sbin/create-munge-key -r -f
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key" sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
chown munge: /etc/munge/munge.key chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key chmod 600 /etc/munge/munge.key
sudo -u munge /sbin/munged sudo -u munge /sbin/munged
munge -n munge -n
munge -n | unmunge munge -n | unmunge
@ -61,31 +81,97 @@ _munge_start() {
# copy secrets to /.secret directory for other nodes # copy secrets to /.secret directory for other nodes
_copy_secrets() { _copy_secrets() {
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz while [ ! -f /home/worker/worker-secret.tar.gz ]; do
cp /home/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh echo -n "."
cp /etc/munge/munge.key /.secret/munge.key sleep 1
rm -f /home/worker/worker-secret.tar.gz done
rm -f /home/worker/setup-worker-ssh.sh cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
cp /home/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
cp /etc/munge/munge.key /.secret/munge.key
rm -f /home/worker/worker-secret.tar.gz
rm -f /home/worker/setup-worker-ssh.sh
}
_openssl_jwt_key() {
mkdir -p /var/spool/slurm/statesave
dd if=/dev/random of=/var/spool/slurm/statesave/jwt_hs256.key bs=32 count=1
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /var/spool/slurm/statesave/jwt_hs256.key /.secret/jwt_hs256.key
chmod 777 /.secret/jwt_hs256.key
}
_generate_jwt_token() {
secret_key=$(cat /var/spool/slurm/statesave/jwt_hs256.key)
start_time=$(date +%s)
exp_time=$((start_time + 100000000))
base64url() {
# Don't wrap, make URL-safe, delete trailer.
base64 -w 0 | tr '+/' '-_' | tr -d '='
}
jwt_header=$(echo -n '{"alg":"HS256","typ":"JWT"}' | base64url)
jwt_claims=$(cat <<EOF |
{
"sun": "root",
"exp": $exp_time,
"iat": $start_time
}
EOF
jq -Mcj '.' | base64url)
# jq -Mcj => Monochrome output, compact output, join lines
jwt_signature=$(echo -n "${jwt_header}.${jwt_claims}" |
openssl dgst -sha256 -hmac "$secret_key" -binary | base64url)
# Use the same colours as jwt.io, more-or-less.
echo "$(tput setaf 1)${jwt_header}$(tput sgr0).$(tput setaf 5)${jwt_claims}$(tput sgr0).$(tput setaf 6)${jwt_signature}$(tput sgr0)"
jwt="${jwt_header}.${jwt_claims}.${jwt_signature}"
echo $jwt | cat >/.secret/jwt_token.txt
chmod 777 /.secret/jwt_token.txt
} }
# run slurmctld # run slurmctld
_slurmctld() { _slurmctld() {
cd /root/rpmbuild/RPMS/aarch64 cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \ yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \ slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm \ slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm
echo "checking for slurmdbd.conf" echo "checking for slurmdbd.conf"
while [ ! -f /.secret/slurmdbd.conf ]; do while [ ! -f /.secret/slurmdbd.conf ]; do
echo -n "." echo "."
sleep 1 sleep 1
done done
echo "" echo ""
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/spool /var/lib /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
mkdir -p /etc/config
chown -R slurm: /etc/config
touch /var/log/slurmctld.log touch /var/log/slurmctld.log
chown slurm: /var/log/slurmctld.log chown -R slurm: /var/log/slurmctld.log
touch /var/log/slurmd.log
chown -R slurm: /var/log/slurmd.log
touch /var/lib/slurm/d/job_state
chown -R slurm: /var/lib/slurm/d/job_state
touch /var/lib/slurm/d/fed_mgr_state
chown -R slurm: /var/lib/slurm/d/fed_mgr_state
touch /var/run/slurm/d/slurmctld.pid
chown -R slurm: /var/run/slurm/d/slurmctld.pid
touch /var/run/slurm/d/slurmd.pid
chown -R slurm: /var/run/slurm/d/slurmd.pid
if [[ ! -f /home/config/slurm.conf ]]; then if [[ ! -f /home/config/slurm.conf ]]; then
echo "### Missing slurm.conf ###" echo "### Missing slurm.conf ###"
exit exit
@ -95,15 +181,43 @@ _slurmctld() {
chown slurm: /etc/slurm/slurm.conf chown slurm: /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurm.conf chmod 600 /etc/slurm/slurm.conf
fi fi
sacctmgr -i add cluster "snowflake"
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
sudo yum install -y lsof
sudo yum install -y jq
_openssl_jwt_key
if [ ! -f /.secret/jwt_hs256.key ]; then
echo "### Missing jwt.key ###"
exit 1
else
cp /.secret/jwt_hs256.key /etc/config/jwt_hs256.key
chown slurm: /etc/config/jwt_hs256.key
chmod 0600 /etc/config/jwt_hs256.key
fi
_generate_jwt_token
while ! nc -z slurmdbd 6819; do
echo "Waiting for slurmdbd to be ready..."
sleep 2
done
sacctmgr -i add cluster name=linux
sleep 2s sleep 2s
echo "Starting slurmctld" echo "Starting slurmctld"
cp -f /etc/slurm/slurm.conf /.secret/ cp -f /etc/slurm/slurm.conf /.secret/
/usr/sbin/slurmctld /usr/sbin/slurmctld -Dvv
echo "Started slurmctld"
} }
### main ### ### main ###
_delete_secrets
_sshd_host _sshd_host
_ssh_worker _ssh_worker
_munge_start _munge_start
_copy_secrets _copy_secrets

108
slurm/controller/slurm.conf Normal file
View File

@ -0,0 +1,108 @@
# slurm.conf
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=slurmctld
ControlAddr=slurmctld
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/lib/slurm/d
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurm/d/slurmctld.pid
SlurmdPidFile=/var/run/slurm/d/slurmd.pid
ProctrackType=proctrack/linuxproc
AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
#PluginDir=
#CacheGroups=0
#FirstJobId=
ReturnToService=0
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
TaskPlugin=task/affinity
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
#SchedulerPort=
#SchedulerRootFilter=
# SelectType=select/con_res
SelectTypeParameters=CR_CPU_Memory
# FastSchedule=1
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=6
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=6
SlurmdLogFile=/var/log/slurm/slurmd.log
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/jobcomp.log
#
# ACCOUNTING
JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherType=jobacct_gather/cgroup
#ProctrackType=proctrack/cgroup
JobAcctGatherFrequency=30
#
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=slurmdbd
AccountingStoragePort=6819
#AccountingStorageLoc=slurm_acct_db
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
PartitionName=DEFAULT Nodes=node01
PartitionName=debug Nodes=node01 Default=YES MaxTime=INFINITE State=UP
# # COMPUTE NODES
# NodeName=c[1-2] RealMemory=1000 State=UNKNOWN
NodeName=node01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1
# #
# # PARTITIONS
# PartitionName=normal Default=yes Nodes=c[1-2] Priority=50 DefMemPerCPU=500 Shared=NO MaxNodes=2 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
#PrEpPlugins=pika

View File

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6 FROM clustercockpit/slurm.base:24.05.3
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de> LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up # clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum && rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"] ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@ -1,6 +1,10 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -e set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
SLURM_JWT=daemon
SLURM_ACCT_DB_SQL=/slurm_acct_db.sql SLURM_ACCT_DB_SQL=/slurm_acct_db.sql
# start sshd server # start sshd server
@ -48,12 +52,16 @@ _wait_for_worker() {
# run slurmdbd # run slurmdbd
_slurmdbd() { _slurmdbd() {
cd /root/rpmbuild/RPMS/aarch64 cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \ yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \ slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmdbd-22.05.6-1.el8.aarch64.rpm slurm-slurmdbd-$SLURM_VERSION*.$ARCH.rpm
mkdir -p /var/spool/slurm/d /var/log/slurm /etc/slurm mkdir -p /var/spool/slurm/d /var/log/slurm /etc/slurm
chown slurm: /var/spool/slurm/d /var/log/slurm chown -R slurm: /var/spool/slurm/d /var/log/slurm
mkdir -p /etc/config
chown -R slurm: /etc/config
if [[ ! -f /home/config/slurmdbd.conf ]]; then if [[ ! -f /home/config/slurmdbd.conf ]]; then
echo "### Missing slurmdbd.conf ###" echo "### Missing slurmdbd.conf ###"
exit exit
@ -62,10 +70,31 @@ _slurmdbd() {
cp /home/config/slurmdbd.conf /etc/slurm/slurmdbd.conf cp /home/config/slurmdbd.conf /etc/slurm/slurmdbd.conf
chown slurm: /etc/slurm/slurmdbd.conf chown slurm: /etc/slurm/slurmdbd.conf
chmod 600 /etc/slurm/slurmdbd.conf chmod 600 /etc/slurm/slurmdbd.conf
cp /etc/slurm/slurmdbd.conf /.secret/slurmdbd.conf
fi fi
echo "checking for jwt.key"
while [ ! -f /.secret/jwt_hs256.key ]; do
echo "."
sleep 1
done
mkdir -p /var/spool/slurm/statesave
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
echo ""
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
echo "Starting slurmdbd" echo "Starting slurmdbd"
cp /etc/slurm/slurmdbd.conf /.secret/slurmdbd.conf /usr/sbin/slurmdbd -Dvv
/usr/sbin/slurmdbd echo "Started slurmdbd"
} }
### main ### ### main ###

View File

@ -1,3 +1,8 @@
#
# Example slurmdbd.conf file.
#
# See the slurmdbd.conf man page for more information.
#
# Archive info # Archive info
#ArchiveJobs=yes #ArchiveJobs=yes
#ArchiveDir="/tmp" #ArchiveDir="/tmp"
@ -8,16 +13,19 @@
# #
# Authentication info # Authentication info
AuthType=auth/munge AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2 #AuthInfo=/var/run/munge/munge.socket.2
# AuthAltTypes=auth/jwt
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
# slurmDBD info # slurmDBD info
DbdAddr=slurmdb DbdAddr=slurmdbd
DbdHost=slurmdb DbdHost=slurmdbd
DbdPort=6819 DbdPort=6819
SlurmUser=slurm SlurmUser=slurm
#MessageTimeout=300
DebugLevel=4 DebugLevel=4
#DefaultQOS=normal,standby
LogFile=/var/log/slurm/slurmdbd.log LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid # PidFile=/var/run/slurmdbd/slurmdbd.pid
#PluginDir=/usr/lib/slurm #PluginDir=/usr/lib/slurm
#PrivateData=accounts,users,usage,jobs #PrivateData=accounts,users,usage,jobs
#TrackWCKey=yes #TrackWCKey=yes
@ -25,7 +33,6 @@ PidFile=/var/run/slurmdbd.pid
# Database info # Database info
StorageType=accounting_storage/mysql StorageType=accounting_storage/mysql
StorageHost=mariadb StorageHost=mariadb
StoragePort=3306
StoragePass=demo
StorageUser=slurm StorageUser=slurm
StoragePass=demo
StorageLoc=slurm_acct_db StorageLoc=slurm_acct_db

View File

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6 FROM clustercockpit/slurm.base:24.05.3
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de> LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up # clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
&& rm -rf /var/cache/yum && rm -rf /var/cache/yum
COPY docker-entrypoint.sh /docker-entrypoint.sh COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"] ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@ -1,108 +1,142 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -e set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
# SLURMRESTD="/tmp/slurmrestd.socket"
SLURM_JWT=daemon
# start sshd server # start sshd server
_sshd_host() { _sshd_host() {
if [ ! -d /var/run/sshd ]; then if [ ! -d /var/run/sshd ]; then
mkdir /var/run/sshd mkdir /var/run/sshd
ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N '' ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ''
fi
/usr/sbin/sshd
}
# setup worker ssh to be passwordless
_ssh_worker() {
if [[ ! -d /home/worker ]]; then
mkdir -p /home/worker
chown -R worker:worker /home/worker
fi fi
cat > /home/worker/setup-worker-ssh.sh <<EOF2 /usr/sbin/sshd
mkdir -p ~/.ssh
chmod 0700 ~/.ssh
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
chmod 0640 ~/.ssh/authorized_keys
cat >> ~/.ssh/config <<EOF
Host *
StrictHostKeyChecking no
UserKnownHostsFile /dev/null
LogLevel QUIET
EOF
chmod 0644 ~/.ssh/config
cd ~/
tar -czvf ~/worker-secret.tar.gz .ssh
cd -
EOF2
chmod +x /home/worker/setup-worker-ssh.sh
chown worker: /home/worker/setup-worker-ssh.sh
sudo -u worker /home/worker/setup-worker-ssh.sh
} }
# start munge and generate key # start munge using existing key
_munge_start() { _munge_start_using_key() {
if [ ! -f /.secret/munge.key ]; then
echo -n "checking for munge.key"
while [ ! -f /.secret/munge.key ]; do
echo -n "."
sleep 1
done
echo ""
fi
cp /.secret/munge.key /etc/munge/munge.key
chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge
chmod 0700 /etc/munge chmod 0700 /etc/munge
chmod 0711 /var/lib/munge chmod 0711 /var/lib/munge
chmod 0700 /var/log/munge chmod 0700 /var/log/munge
chmod 0755 /var/run/munge chmod 0755 /var/run/munge
/sbin/create-munge-key -f
rngd -r /dev/urandom
/usr/sbin/create-munge-key -r -f
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
chown munge: /etc/munge/munge.key
chmod 400 /etc/munge/munge.key
sudo -u munge /sbin/munged sudo -u munge /sbin/munged
munge -n munge -n
munge -n | unmunge munge -n | unmunge
remunge remunge
} }
# copy secrets to /.secret directory for other nodes _enable_slurmrestd() {
_copy_secrets() {
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz cat >/usr/lib/systemd/system/slurmrestd.service <<EOF
cp thome/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh [Unit]
cp /etc/munge/munge.key /.secret/munge.key Description=Slurm REST daemon
rm -f /home/worker/worker-secret.tar.gz After=network-online.target slurmctld.service
rm -f /home/worker/setup-worker-ssh.sh Wants=network-online.target
ConditionPathExists=/etc/slurm/slurm.conf
[Service]
Type=simple
EnvironmentFile=-/etc/sysconfig/slurmrestd
EnvironmentFile=-/etc/default/slurmrestd
# slurmrestd should not run as root or the slurm user.
# Please either use the -u and -g options in /etc/sysconfig/slurmrestd or
# /etc/default/slurmrestd, or explicitly set the User and Group in this file
# an unpriviledged user to run as.
User=slurm
Restart=always
RestartSec=5
# Group=
# Default to listen on both socket and slurmrestd port
ExecStart=/usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt $SLURMRESTD_OPTIONS -vvvvvv -s dbv0.0.39,v0.0.39 0.0.0.0:6820
# Enable auth/jwt be default, comment out the line to disable it for slurmrestd
Environment="SLURM_JWT=daemon"
ExecReload=/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
EOF
} }
# run slurmctld # run slurmrestd
_slurmctld() { _slurmrestd() {
cd /root/rpmbuild/RPMS/aarch64 cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \ yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \ slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \ slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm \ slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm \ slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmrestd-22.05.6-1.el8.aarch64.rpm slurm-slurmrestd-$SLURM_VERSION*.$ARCH.rpm
echo -n "checking for slurmdbd.conf" echo -n "checking for slurmdbd.conf"
while [ ! -f /.secret/slurmdbd.conf ]; do while [ ! -f /.secret/slurmdbd.conf ]; do
echo -n "." echo -n "."
sleep 1 sleep 1
done done
echo "" echo ""
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm mkdir -p /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
touch /var/log/slurmctld.log chown -R slurm: /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
chown slurm: /var/log/slurmctld.log chmod 755 /var/run/slurm
if [[ ! -f /home/config/slurm.conf ]]; then
touch /var/log/slurmrestd.log
chown slurm: /var/log/slurmrestd.log
if [[ ! -f /home/config/slurmrestd.conf ]]; then
echo "### Missing slurm.conf ###" echo "### Missing slurm.conf ###"
exit exit
else else
echo "### use provided slurm.conf ###" echo "### use provided slurmrestd.conf ###"
cp /home/config/slurm.conf /etc/slurm/slurm.conf cp /home/config/slurmrestd.conf /etc/config/slurmrestd.conf
cp /home/config/slurm.conf /etc/config/slurm.conf
fi fi
sacctmgr -i add cluster "snowflake"
echo "checking for jwt.key"
while [ ! -f /.secret/jwt_hs256.key ]; do
echo "."
sleep 1
done
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
sudo yum install -y lsof
sudo yum install -y socat
mkdir -p /var/spool/slurm/statesave
chown slurm:slurm /var/spool/slurm/statesave
chmod 0755 /var/spool/slurm/statesave
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
chmod 0400 /var/spool/slurm/statesave/jwt_hs256.key
echo ""
sleep 2s sleep 2s
/usr/sbin/slurmctld echo "Starting slurmrestd"
cp -f /etc/slurm/slurm.conf /.secret/ # _enable_slurmrestd
# sudo ln -s /usr/lib/systemd/system/slurmrestd.service /etc/systemd/system/multi-user.target.wants/slurmrestd.service
SLURMRESTD_SECURITY=disable_user_check SLURMRESTD_DEBUG=9 /usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt -s dbv0.0.39,v0.0.39 -u slurm 0.0.0.0:6820
echo "Started slurmrestd"
} }
### main ### ### main ###
_sshd_host _sshd_host
_ssh_worker _munge_start_using_key
_munge_start _slurmrestd
_copy_secrets
_slurmctld
tail -f /dev/null tail -f /dev/null

View File

@ -0,0 +1,4 @@
#
# Example slurmdbd.conf file.
#
include /etc/config/slurm.conf

View File

@ -1,5 +1,5 @@
FROM clustercockpit/slurm.base:22.05.6 FROM clustercockpit/slurm.base:24.05.3
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de> LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
# clean up # clean up
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
@ -8,4 +8,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
WORKDIR /home/worker WORKDIR /home/worker
COPY docker-entrypoint.sh /docker-entrypoint.sh COPY docker-entrypoint.sh /docker-entrypoint.sh
CMD ["/usr/sbin/init"]
ENTRYPOINT ["/docker-entrypoint.sh"] ENTRYPOINT ["/docker-entrypoint.sh"]

View File

@ -1,3 +1,4 @@
CgroupPlugin=disabled
ConstrainCores=yes ConstrainCores=yes
ConstrainDevices=no ConstrainDevices=no
ConstrainRAMSpace=yes ConstrainRAMSpace=yes

View File

@ -1,6 +1,10 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -e set -e
# Determine the system architecture dynamically
ARCH=$(uname -m)
SLURM_VERSION="24.05.3"
# start sshd server # start sshd server
_sshd_host() { _sshd_host() {
if [ ! -d /var/run/sshd ]; then if [ ! -d /var/run/sshd ]; then
@ -12,6 +16,10 @@ _sshd_host() {
# start munge using existing key # start munge using existing key
_munge_start_using_key() { _munge_start_using_key() {
sudo yum install -y nc
sudo yum install -y procps
sudo yum install -y iputils
echo -n "cheking for munge.key" echo -n "cheking for munge.key"
while [ ! -f /.secret/munge.key ]; do while [ ! -f /.secret/munge.key ]; do
echo -n "." echo -n "."
@ -32,49 +40,67 @@ _munge_start_using_key() {
# wait for worker user in shared /home volume # wait for worker user in shared /home volume
_wait_for_worker() { _wait_for_worker() {
echo "checking for id_rsa.pub"
if [ ! -f /home/worker/.ssh/id_rsa.pub ]; then if [ ! -f /home/worker/.ssh/id_rsa.pub ]; then
echo -n "checking for id_rsa.pub" echo "checking for id_rsa.pub"
while [ ! -f /home/worker/.ssh/id_rsa.pub ]; do while [ ! -f /home/worker/.ssh/id_rsa.pub ]; do
echo -n "." echo -n "."
sleep 1 sleep 1
done done
echo "" echo ""
fi fi
echo "done checking for id_rsa.pub"
} }
_start_dbus() { _start_dbus() {
dbus-uuidgen > /var/lib/dbus/machine-id dbus-uuidgen >/var/lib/dbus/machine-id
mkdir -p /var/run/dbus mkdir -p /var/run/dbus
dbus-daemon --config-file=/usr/share/dbus-1/system.conf --print-address dbus-daemon --config-file=/usr/share/dbus-1/system.conf --print-address
} }
# run slurmd # run slurmd
_slurmd() { _slurmd() {
cd /root/rpmbuild/RPMS/aarch64 cd /root/rpmbuild/RPMS/$ARCH
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \ yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \ slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \ slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
slurm-torque-22.05.6-1.el8.aarch64.rpm slurm-torque-$SLURM_VERSION*.$ARCH.rpm
if [ ! -f /.secret/slurm.conf ]; then
echo -n "checking for slurm.conf" echo "checking for slurm.conf"
while [ ! -f /.secret/slurm.conf ]; do if [ ! -f /.secret/slurm.conf ]; then
echo -n "." echo "checking for slurm.conf"
sleep 1 while [ ! -f /.secret/slurm.conf ]; do
done echo -n "."
echo "" sleep 1
fi done
mkdir -p /var/spool/slurm/d /etc/slurm echo ""
chown slurm: /var/spool/slurm/d fi
cp /home/config/cgroup.conf /etc/slurm/cgroup.conf echo "found slurm.conf"
chown slurm: /etc/slurm/cgroup.conf
chmod 600 /etc/slurm/cgroup.conf # sudo yum install -y nc
cp /home/config/slurm.conf /etc/slurm/slurm.conf # sudo yum install -y procps
chown slurm: /etc/slurm/slurm.conf # sudo yum install -y iputils
chmod 600 /etc/slurm/slurm.conf
touch /var/log/slurmd.log mkdir -p /var/spool/slurm/d /etc/slurm /var/run/slurm/d /var/log/slurm
chown slurm: /var/log/slurmd.log chown slurm: /var/spool/slurm/d /var/run/slurm/d /var/log/slurm
echo -n "Starting slurmd" cp /home/config/cgroup.conf /etc/slurm/cgroup.conf
/usr/sbin/slurmd chown slurm: /etc/slurm/cgroup.conf
chmod 600 /etc/slurm/cgroup.conf
cp /home/config/slurm.conf /etc/slurm/slurm.conf
chown slurm: /etc/slurm/slurm.conf
chmod 600 /etc/slurm/slurm.conf
touch /var/log/slurm/slurmd.log
chown slurm: /var/log/slurm/slurmd.log
touch /var/run/slurm/d/slurmd.pid
chmod 600 /var/run/slurm/d/slurmd.pid
chown slurm: /var/run/slurm/d/slurmd.pid
echo "Starting slurmd"
/usr/sbin/slurmstepd infinity &
/usr/sbin/slurmd -Dvv
echo "Started slurmd"
} }
### main ### ### main ###