mirror of
https://github.com/ClusterCockpit/cc-docker.git
synced 2025-03-15 19:35:56 +01:00
Merge pull request #2 from ClusterCockpit/dev
Preconfigured and updated docker services for CC components
This commit is contained in:
commit
a945a21bc1
30
.env
30
.env
@ -2,15 +2,6 @@
|
||||
# CCBACKEND DEVEL DOCKER SETTINGS
|
||||
########################################################################
|
||||
|
||||
########################################################################
|
||||
# SLURM
|
||||
########################################################################
|
||||
SLURM_VERSION=22.05.6
|
||||
ARCH=aarch64
|
||||
MUNGE_UID=981
|
||||
SLURM_UID=982
|
||||
WORKER_UID=1000
|
||||
|
||||
########################################################################
|
||||
# INFLUXDB
|
||||
########################################################################
|
||||
@ -22,27 +13,6 @@ INFLUXDB_BUCKET=ClusterCockpit
|
||||
# Whether or not to check SSL Cert in Symfony Client, Default: false
|
||||
INFLUXDB_SSL=false
|
||||
|
||||
########################################################################
|
||||
# MARIADB
|
||||
########################################################################
|
||||
MARIADB_ROOT_PASSWORD=root
|
||||
MARIADB_DATABASE=ClusterCockpit
|
||||
MARIADB_USER=clustercockpit
|
||||
MARIADB_PASSWORD=clustercockpit
|
||||
MARIADB_PORT=3306
|
||||
|
||||
#########################################
|
||||
# LDAP
|
||||
########################################################################
|
||||
LDAP_ADMIN_PASSWORD=mashup
|
||||
LDAP_ORGANISATION=NHR@FAU
|
||||
LDAP_DOMAIN=rrze.uni-erlangen.de
|
||||
|
||||
########################################################################
|
||||
# PHPMyAdmin
|
||||
########################################################################
|
||||
PHPMYADMIN_PORT=8081
|
||||
|
||||
########################################################################
|
||||
# INTERNAL SETTINGS
|
||||
########################################################################
|
||||
|
5
.gitignore
vendored
5
.gitignore
vendored
@ -3,6 +3,11 @@ data/job-archive/**
|
||||
data/influxdb
|
||||
data/sqldata
|
||||
data/cc-metric-store
|
||||
data/cc-metric-store-source
|
||||
data/ldap
|
||||
data/mariadb
|
||||
data/slurm
|
||||
data
|
||||
cc-backend
|
||||
cc-backend/**
|
||||
.vscode
|
||||
|
185
README.md
Normal file → Executable file
185
README.md
Normal file → Executable file
@ -1,72 +1,173 @@
|
||||
# cc-docker
|
||||
|
||||
This is a `docker-compose` setup which provides a quickly started environment for ClusterCockpit development and testing, using `cc-backend`.
|
||||
A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP), or easily added by manual configuration (MySQL).
|
||||
A number of services is readily available as docker container (nats, cc-metric-store, InfluxDB, LDAP, SLURM), or easily added by manual configuration (MariaDB).
|
||||
|
||||
It includes the following containers:
|
||||
* nats (Default)
|
||||
* cc-metric-store (Default)
|
||||
* influxdb (Default)
|
||||
* openldap (Default)
|
||||
* mysql (Optional)
|
||||
* mariadb (Optional)
|
||||
* phpmyadmin (Optional)
|
||||
|Service full name|docker service name|port|
|
||||
| --- | --- | --- |
|
||||
|Slurm Controller service|slurmctld|6818|
|
||||
|Slurm Database service|slurmdbd|6817|
|
||||
|Slurm Rest service with JWT authentication|slurmrestd|6820|
|
||||
|Slurm Worker|node01|6818|
|
||||
|MariaDB service|mariadb|3306|
|
||||
|InfluxDB serice|influxdb|8086|
|
||||
|NATS service|nats|4222, 6222, 8222|
|
||||
|cc-metric-store service|cc-metric-store|8084|
|
||||
|OpenLDAP|openldap|389, 636|
|
||||
|
||||
The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MySQL, and a LDAP user directory.
|
||||
The setup comes with fixture data for a Job archive, cc-metric-store checkpoints, InfluxDB, MariaDB, and a LDAP user directory.
|
||||
|
||||
## Known Issues
|
||||
## Prerequisites
|
||||
|
||||
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead.
|
||||
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8081 (phpmyadmin), 8084 (cc-metric-store), 8086 (nfluxDB), 4222 and 8222 (Nats), or 3306 (MySQL). If one or more ports are already in use, you habe to adapt the related config accordingly.
|
||||
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
|
||||
For all the docker services to work correctly, you will need the following tools installed:
|
||||
|
||||
## Configuration Templates
|
||||
1. `docker` and `docker-compose`
|
||||
2. `golang` (for compiling cc-metric-store)
|
||||
3. `perl` (for migrateTimestamp.pl) with Cpanel::JSON::XS, Data::Dumper, Time::Piece, Sort::Versions and File::Slurp perl modules.
|
||||
4. `npm` (for cc-backend)
|
||||
5. `make` (for building slurm base image)
|
||||
|
||||
Located in `./templates`
|
||||
* `docker-compose.yml.default`: Docker-Compose file to setup cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
|
||||
* `docker-compose.yml.mysql`: Docker-Compose configuration template if MySQL is desired instead of MariaDB.
|
||||
* `env.default`: Environment variables for setup with cc-metric-store, InfluxDB, MariaDB, PhpMyadmin, and LDAP containers (Default). Used in `setupDev.sh`.
|
||||
* `env.mysql`: Additional environment variables required if MySQL is desired instead of MariaDB.
|
||||
It is also recommended to add docker service to sudouser group since the setupDev.sh script assumes sudo permissions for docker and docker-compose services.
|
||||
|
||||
You can use:
|
||||
|
||||
```
|
||||
sudo groupadd docker
|
||||
sudo usermod -aG docker $USER
|
||||
|
||||
# restart after adding your docker with your user to sudo group
|
||||
sudo shutdown -r -t 0
|
||||
```
|
||||
|
||||
Note: You can install all these dependencies via predefined installation steps in `prerequisite_installation_script.sh`.
|
||||
|
||||
If you are using different linux flavors, you will have to adapt `prerequisite_installation_script.sh` as well as `setupDev.sh`.
|
||||
|
||||
## Setup
|
||||
|
||||
1. Clone `cc-backend` repository in chosen base folder: `$> git clone https://github.com/ClusterCockpit/cc-backend.git`
|
||||
|
||||
2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the InfluxDB data).
|
||||
2. Run `$ ./setupDev.sh`: **NOTICE** The script will download files of a total size of 338MB (mostly for the cc-metric-store data).
|
||||
|
||||
3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend` to start `cc-backend.`
|
||||
3. The setup-script launches the supporting container stack in the background automatically if everything went well. Run `$> ./cc-backend/cc-backend -server -dev` to start `cc-backend`.
|
||||
|
||||
4. By default, you can access `cc-backend` in your browser at `http://localhost:8080`. You can shut down the cc-backend server by pressing `CTRL-C`, remember to also shut down all containers via `$> docker-compose down` afterwards.
|
||||
|
||||
5. You can restart the containers with: `$> docker-compose up -d`.
|
||||
|
||||
## Post-Setup Adjustment for using `influxdb`
|
||||
|
||||
When using `influxdb` as a metric database, one must adjust the following files:
|
||||
* `cc-backend/var/job-archive/emmy/cluster.json`
|
||||
* `cc-backend/var/job-archive/woody/cluster.json`
|
||||
|
||||
In the JSON, exchange the content of the `metricDataRepository`-Entry (By default configured for `cc-metric-store`) with:
|
||||
```
|
||||
"metricDataRepository": {
|
||||
"kind": "influxdb",
|
||||
"url": "http://localhost:8086",
|
||||
"token": "egLfcf7fx0FESqFYU3RpAAbj",
|
||||
"bucket": "ClusterCockpit",
|
||||
"org": "ClusterCockpit",
|
||||
"skiptls": false
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
## Usage
|
||||
## Credentials for logging into clustercockpit
|
||||
|
||||
Credentials for the preconfigured demo user are:
|
||||
* User: `demo`
|
||||
* Password: `AdminDev`
|
||||
* Password: `demo`
|
||||
|
||||
Credentials for the preconfigured LDAP user are:
|
||||
* User: `ldapuser`
|
||||
* Password: `ldapuser`
|
||||
|
||||
You can also login as regular user using any credential in the LDAP user directory at `./data/ldap/users.ldif`.
|
||||
|
||||
## Preconfigured setup between docker services and ClusterCockpit components
|
||||
|
||||
When you are done cloning the cc-backend repo and once you execute `setupDev.sh` file, it will copy a preconfigured `config.json` from `misc/config.json` and replace the `cc-backend/config.json`, which will be used by cc-backend, once you start the server.
|
||||
The preconfigured config.json attaches to:
|
||||
#### 1. MariaDB docker service on port 3306 (database: ccbackend)
|
||||
#### 2. OpenLDAP docker service on port 389
|
||||
#### 3. cc-metric-store docker service on port 8084
|
||||
|
||||
cc-metric-store also has a preconfigured `config.json` in `cc-metric-store/config.json` which attaches to NATS docker service on port 4222 and subscribes to topic 'hpc-nats'.
|
||||
|
||||
Basically, all the ClusterCockpit components and the docker services attach to each other like lego pieces.
|
||||
|
||||
## Docker commands to access the services
|
||||
|
||||
> Note: You need to be in cc-docker directory in order to execute any docker command
|
||||
|
||||
You can view all docker processes running on either of the VM instance by using this command:
|
||||
|
||||
```
|
||||
$ docker ps
|
||||
```
|
||||
|
||||
Now that you can see the docker services, and if you want to manually access the docker services, you have to run **`bash`** command in those running services.
|
||||
|
||||
> **`Example`**: You want to run slurm commands like `sinfo` or `squeue` or `scontrol` on slurm controller, you cannot directly access it.
|
||||
|
||||
You need to **`bash`** into the running service by using the following command:
|
||||
|
||||
```
|
||||
$ docker exec -it <docker service name> bash
|
||||
|
||||
#example
|
||||
$ docker exec -it slurmctld bash
|
||||
|
||||
#or
|
||||
$ docker exec -it mariadb bash
|
||||
```
|
||||
|
||||
Once you start a **`bash`** on any docker service, then you may execute any service related commands in that **`bash`**.
|
||||
|
||||
But for Cluster Cockpit development, you only need ports to access these docker services. You have to use `localhost:<port>` when trying to access any docker service. You may need to configure the `cc-backend/config.json` based on these docker services and ports.
|
||||
|
||||
## Slurm setup in cc-docker
|
||||
|
||||
### 1. Slurm controller
|
||||
|
||||
Currently slurm controller is aware of the 1 node that we have setup in our mini cluster i.e. node01.
|
||||
|
||||
In order to execute slurm commands, you may need to **`bash`** into the **`slurmctld`** docker service.
|
||||
|
||||
```
|
||||
$ docker exec -it slurmctld bash
|
||||
```
|
||||
|
||||
Then you may be able to run slurm controller commands. A few examples without output are:
|
||||
|
||||
```
|
||||
$ sinfo
|
||||
|
||||
or
|
||||
|
||||
$ squeue
|
||||
|
||||
or
|
||||
|
||||
$ scontrol show nodes
|
||||
```
|
||||
|
||||
### 2. Slurm rest service
|
||||
|
||||
You do not need to **`bash`** into the slurmrestd service but can directly access the rest API via localhost:6820. A simple example on how to CURL to the slurm rest API is given in the `curl_slurmrestd.sh`.
|
||||
|
||||
You can directly use `curl_slurmrestd.sh` with a never expiring JWT token ( can be found in /data/slurm/secret/jwt_token.txt )
|
||||
|
||||
You may also use the never expiring token directly from the file for any of your custom CURL commands.
|
||||
|
||||
## Known Issues
|
||||
|
||||
* `docker-compose` installed on Ubuntu (18.04, 20.04) via `apt-get` can not correctly parse `docker-compose.yml` due to version differences. Install latest version of `docker-compose` from https://docs.docker.com/compose/install/ instead.
|
||||
* You need to ensure that no other web server is running on ports 8080 (cc-backend), 8082 (cc-metric-store), 8086 (InfluxDB), 4222 and 8222 (Nats), or 3306 (MariaDB). If one or more ports are already in use, you have to adapt the related config accordingly.
|
||||
* Existing VPN connections sometimes cause problems with docker. If `docker-compose` does not start up correctly, try disabling any active VPN connection. Refer to https://stackoverflow.com/questions/45692255/how-make-openvpn-work-with-docker for further information.
|
||||
|
||||
## Docker services and restarting the services
|
||||
|
||||
You can find all the docker services in `docker-compose.yml`. Feel free to modify it.
|
||||
|
||||
Whenever you modify it, please use
|
||||
|
||||
```
|
||||
$ docker compose down
|
||||
```
|
||||
|
||||
in order to shut down all the services in all the VM’s (maininstance, nodeinstance, nodeinstance2) and then start all the services by using
|
||||
|
||||
```
|
||||
$ docker compose up
|
||||
```
|
||||
|
||||
|
||||
|
||||
TODO: Update job archive and all other metric data.
|
||||
The job archive with 1867 jobs originates from the second half of 2020.
|
||||
Roughly 2700 jobs from the first week of 2021 are loaded with data from InfluxDB.
|
||||
|
@ -1,10 +1,12 @@
|
||||
FROM golang:1.17
|
||||
FROM golang:1.22.4
|
||||
|
||||
RUN apt-get update
|
||||
RUN apt-get -y install git
|
||||
|
||||
RUN rm -rf /cc-metric-store
|
||||
|
||||
RUN git clone https://github.com/ClusterCockpit/cc-metric-store.git /cc-metric-store
|
||||
RUN cd /cc-metric-store && go build
|
||||
RUN cd /cc-metric-store && go build ./cmd/cc-metric-store
|
||||
|
||||
# Reactivate when latest commit is available
|
||||
#RUN go get -d -v github.com/ClusterCockpit/cc-metric-store
|
||||
|
@ -1,28 +1,201 @@
|
||||
{
|
||||
"metrics": {
|
||||
"clock": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"cpi": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"cpu_load": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"flops_any": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"flops_dp": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"flops_sp": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"ib_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"lustre_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"mem_bw": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"mem_used": { "frequency": 60, "aggregation": null, "scope": "node" },
|
||||
"rapl_power": { "frequency": 60, "aggregation": null, "scope": "node" }
|
||||
"debug_metric": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"clock": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_idle": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_iowait": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_irq": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_system": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_user": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"nv_mem_util": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"nv_temp": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"nv_sm_clock": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"acc_utilization": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"acc_mem_used": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"acc_power": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"flops_any": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"flops_dp": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"flops_sp": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"ib_recv": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"ib_xmit": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"ib_recv_pkts": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"ib_xmit_pkts": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"cpu_power": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"core_power": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"mem_power": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"ipc": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
},
|
||||
"cpu_load": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"lustre_close": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"lustre_open": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"lustre_statfs": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"lustre_read_bytes": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"lustre_write_bytes": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"net_bw": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"file_bw": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"mem_bw": {
|
||||
"frequency": 60,
|
||||
"aggregation": "sum"
|
||||
},
|
||||
"mem_cached": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"mem_used": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"net_bytes_in": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"net_bytes_out": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"nfs4_read": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"nfs4_total": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"nfs4_write": {
|
||||
"frequency": 60,
|
||||
"aggregation": null
|
||||
},
|
||||
"vectorization_ratio": {
|
||||
"frequency": 60,
|
||||
"aggregation": "avg"
|
||||
}
|
||||
},
|
||||
"checkpoints": {
|
||||
"interval": 100000000000,
|
||||
"interval": "12h",
|
||||
"directory": "/data/checkpoints",
|
||||
"restore": 100000000000
|
||||
"restore": "48h"
|
||||
},
|
||||
"archive": {
|
||||
"interval": 100000000000,
|
||||
"interval": "50h",
|
||||
"directory": "/data/archive"
|
||||
},
|
||||
"retention-in-memory": 100000000000,
|
||||
"http-api-address": "0.0.0.0:8081",
|
||||
"nats": "nats://cc-nats:4222",
|
||||
"http-api": {
|
||||
"address": "0.0.0.0:8084",
|
||||
"https-cert-file": null,
|
||||
"https-key-file": null
|
||||
},
|
||||
"retention-in-memory": "48h",
|
||||
"nats": [
|
||||
{
|
||||
"address": "nats://nats:4222",
|
||||
"username": "root",
|
||||
"password": "root",
|
||||
"subscriptions": [
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "fritz"
|
||||
},
|
||||
{
|
||||
"subscribe-to": "hpc-nats",
|
||||
"cluster-tag": "alex"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"jwt-public-key": "kzfYrYy+TzpanWZHJ5qSdMj5uKUWgq74BWhQG6copP0="
|
||||
}
|
34
data/init.sh
34
data/init.sh
@ -1,34 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
|
||||
if [ -d symfony ]; then
|
||||
echo "Data already initialized!"
|
||||
echo -n "Perform a fresh initialisation? [yes to proceed / no to exit] "
|
||||
read -r answer
|
||||
if [ "$answer" == "yes" ]; then
|
||||
echo "Cleaning directories ..."
|
||||
rm -rf symfony
|
||||
rm -rf job-archive
|
||||
rm -rf influxdb/data/*
|
||||
rm -rf sqldata/*
|
||||
echo "done."
|
||||
else
|
||||
echo "Aborting ..."
|
||||
exit
|
||||
fi
|
||||
fi
|
||||
|
||||
mkdir symfony
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive_stable.tar.xz
|
||||
tar xJf job-archive_stable.tar.xz
|
||||
rm ./job-archive_stable.tar.xz
|
||||
|
||||
# 101 is the uid and gid of the user and group www-data in the cc-php container running php-fpm.
|
||||
# For a demo with no new jobs it is enough to give www read permissions on that directory.
|
||||
# echo "This script needs to chown the job-archive directory so that the application can write to it:"
|
||||
# sudo chown -R 82:82 ./job-archive
|
||||
|
||||
mkdir -p influxdb/data
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/influxdbv2-data_stable.tar.xz
|
||||
cd influxdb/data
|
||||
tar xJf ../../influxdbv2-data_stable.tar.xz
|
||||
rm ../../influxdbv2-data_stable.tar.xz
|
1027
data/ldap/users.ldif
1027
data/ldap/users.ldif
File diff suppressed because it is too large
Load Diff
@ -1,5 +0,0 @@
|
||||
[mysqld]
|
||||
innodb_buffer_pool_size=4096M
|
||||
innodb_log_file_size=64M
|
||||
innodb_lock_wait_timeout=900
|
||||
max_allowed_packet=16M
|
@ -1,48 +0,0 @@
|
||||
# slurm.conf file generated by configurator.html.
|
||||
# Put this file on all nodes of your cluster.
|
||||
# See the slurm.conf man page for more information.
|
||||
#
|
||||
ClusterName=snowflake
|
||||
SlurmctldHost=slurmctld
|
||||
SlurmUser=slurm
|
||||
SlurmctldPort=6817
|
||||
SlurmdPort=6818
|
||||
MpiDefault=none
|
||||
ProctrackType=proctrack/linuxproc
|
||||
ReturnToService=1
|
||||
SlurmctldPidFile=/var/run/slurmctld.pid
|
||||
SlurmdPidFile=/var/run/slurmd.pid
|
||||
SlurmdSpoolDir=/var/spool/slurm/d
|
||||
StateSaveLocation=/var/spool/slurm/ctld
|
||||
SwitchType=switch/none
|
||||
TaskPlugin=task/affinity
|
||||
#
|
||||
# TIMERS
|
||||
InactiveLimit=0
|
||||
KillWait=30
|
||||
MinJobAge=300
|
||||
SlurmctldTimeout=120
|
||||
SlurmdTimeout=300
|
||||
Waittime=0
|
||||
#
|
||||
# SCHEDULING
|
||||
SchedulerType=sched/backfill
|
||||
SelectType=select/cons_tres
|
||||
#
|
||||
# LOGGING AND ACCOUNTING
|
||||
AccountingStorageHost=slurmdb
|
||||
AccountingStoragePort=6819
|
||||
AccountingStorageType=accounting_storage/slurmdbd
|
||||
AccountingStorageUser=slurm
|
||||
AccountingStoreFlags=job_script,job_comment,job_env,job_extra
|
||||
JobCompType=jobcomp/none
|
||||
JobAcctGatherFrequency=30
|
||||
JobAcctGatherType=jobacct_gather/linux
|
||||
SlurmctldDebug=info
|
||||
SlurmctldLogFile=/var/log/slurmctld.log
|
||||
SlurmdDebug=info
|
||||
SlurmdLogFile=/var/log/slurmd.log
|
||||
#
|
||||
# COMPUTE NODES
|
||||
NodeName=node0[1-2] CPUs=1 State=UNKNOWN
|
||||
PartitionName=main Nodes=ALL Default=YES MaxTime=INFINITE State=UP
|
139
dataGenerationScript.sh
Executable file
139
dataGenerationScript.sh
Executable file
@ -0,0 +1,139 @@
|
||||
#!/bin/bash
|
||||
echo ""
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo "| This is Data generation script for docker services |"
|
||||
echo "| Starting file required by docker services in data/ |"
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
|
||||
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints
|
||||
# After this, migrateTimestamp.pl will run from setupDev.sh. This will update the timestamps
|
||||
# for all the checkpoint files, which then can be read by cc-metric-store.
|
||||
# cc-metric-store reads only data upto certain time, like 48 hours of data.
|
||||
# These checkpoint files have timestamp older than 48 hours and needs to be updated with
|
||||
# migrateTimestamp.pl file, which will be automatically invoked from setupDev.sh.
|
||||
if [ ! -d data/cc-metric-store-source ]; then
|
||||
mkdir -p data/cc-metric-store-source/checkpoints
|
||||
cd data/cc-metric-store-source/checkpoints
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
|
||||
tar xf cc-metric-store-checkpoints.tar.xz
|
||||
rm cc-metric-store-checkpoints.tar.xz
|
||||
cd ../../../
|
||||
else
|
||||
echo "'data/cc-metric-store-source' already exists!"
|
||||
fi
|
||||
|
||||
# A simple configuration file for mariadb docker service.
|
||||
# Required because you can specify only one database per docker service.
|
||||
# This file mentions the database to be created for cc-backend.
|
||||
# This file automatically picked by mariadb after the docker service starts.
|
||||
if [ ! -d data/mariadb ]; then
|
||||
mkdir -p data/mariadb
|
||||
cat > data/mariadb/01.databases.sql <<EOF
|
||||
CREATE DATABASE IF NOT EXISTS \`ccbackend\`;
|
||||
EOF
|
||||
else
|
||||
echo "'data/mariadb' already exists!"
|
||||
fi
|
||||
|
||||
# A simple configuration file for openldap docker service.
|
||||
# Creates a simple user 'ldapuser' with password 'ldapuser'.
|
||||
# This file automatically picked by openldap after the docker service starts.
|
||||
if [ ! -d data/ldap ]; then
|
||||
mkdir -p data/ldap
|
||||
cat > data/ldap/add_users.ldif <<EOF
|
||||
dn: ou=users,dc=example,dc=com
|
||||
objectClass: organizationalUnit
|
||||
ou: users
|
||||
|
||||
dn: uid=ldapuser,ou=users,dc=example,dc=com
|
||||
objectClass: inetOrgPerson
|
||||
objectClass: posixAccount
|
||||
objectClass: top
|
||||
cn: Ldap User
|
||||
sn: User
|
||||
uid: ldapuser
|
||||
uidNumber: 1
|
||||
gidNumber: 1
|
||||
homeDirectory: /home/ldapuser
|
||||
userPassword: {SSHA}sQRqFQtuiupej7J/rbrQrTwYEHDduV+N
|
||||
EOF
|
||||
|
||||
else
|
||||
echo "'data/ldap' already exists!"
|
||||
fi
|
||||
|
||||
# A simple configuration file for nats docker service.
|
||||
# Required because we need to execute custom commands after nats docker service starts.
|
||||
# This file automatically executed when the nats docker service starts.
|
||||
# After docker service starts, there is an infinite while loop that publises data for 'fritz' and 'alex' cluster
|
||||
# to subject 'hpc-nats' every 1 minute. Random data is generated only for node level metrics, not hardware level metrics.
|
||||
if [ ! -d data/nats ]; then
|
||||
mkdir -p data/nats
|
||||
cat > data/nats/docker-entrypoint.sh <<EOF
|
||||
#!/bin/sh
|
||||
set -e
|
||||
|
||||
# Start NATS server in the background
|
||||
nats-server --user root --pass root --http_port 8222 &
|
||||
|
||||
# Wait for NATS to be ready
|
||||
until nc -z 0.0.0.0 4222; do
|
||||
echo "Waiting for NATS to start..."
|
||||
sleep 1
|
||||
done
|
||||
|
||||
echo "NATS is up and running. Executing custom script..."
|
||||
|
||||
apk add curl
|
||||
curl -sf https://binaries.nats.dev/nats-io/natscli/nats@latest | sh
|
||||
|
||||
# This is a dummy data generation loop, that inserts data for given nodes at 1 min interval
|
||||
while true; do
|
||||
|
||||
# Timestamp in seconds
|
||||
timestamp="\$(date '+%s')"
|
||||
|
||||
# Generate data for alex cluster. Push to sample_alex.txt
|
||||
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
|
||||
for hostname in a0603 a0903 a0832 a0329 a0702 a0122 a1624 a0731 a0224 a0704 a0631 a0225 a0222 a0427 a0603 a0429 a0833 a0705 a0901 a0601 a0227 a0804 a0322 a0226 a0126 a0129 a0605 a0801 a0934; do
|
||||
echo "\$metric,cluster=alex,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_alex.txt
|
||||
done
|
||||
done
|
||||
|
||||
# Nats client will publish the data from sample_alex.txt to 'hpc-nats' subject on this nats server
|
||||
./nats pub hpc-nats "\$(cat sample_alex.txt)" -s nats://0.0.0.0:4222 --user root --password root
|
||||
|
||||
# Generate data for fritz cluster. Push to sample_fritz.txt
|
||||
for metric in cpu_irq cpu_load mem_cached net_bytes_in cpu_user cpu_idle nfs4_read mem_used nfs4_write nfs4_total ib_xmit ib_xmit_pkts net_bytes_out cpu_iowait ib_recv cpu_system ib_recv_pkts; do
|
||||
for hostname in f0201 f0202 f0203 f0204 f0205 f0206 f0207 f0208 f0209 f0210 f0211 f0212 f0213 f0214 f0215 f0217 f0218 f0219 f0220 f0221 f0222 f0223 f0224 f0225 f0226 f0227 f0228 f0229; do
|
||||
echo "\$metric,cluster=fritz,hostname=\$hostname,type=node value=\$((1 + RANDOM % 100)).0 \$timestamp" >>sample_fritz.txt
|
||||
done
|
||||
done
|
||||
|
||||
# Nats client will publish the data from sample_fritz.txt to 'hpc-nats' subject on this nats server
|
||||
./nats pub hpc-nats "\$(cat sample_fritz.txt)" -s nats://0.0.0.0:4222 --user root --password root
|
||||
|
||||
rm sample_alex.txt
|
||||
rm sample_fritz.txt
|
||||
|
||||
sleep 1m
|
||||
|
||||
done
|
||||
EOF
|
||||
|
||||
else
|
||||
echo "'data/nats' already exists!"
|
||||
fi
|
||||
|
||||
# prepare folders for influxdb3
|
||||
if [ ! -d data/influxdb ]; then
|
||||
mkdir -p data/influxdb/data
|
||||
mkdir -p data/influxdb/config
|
||||
else
|
||||
echo "'data/influxdb' already exists!"
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo "| Finished generating relevant files for docker services in data/ |"
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
114
docker-compose.yml
Normal file → Executable file
114
docker-compose.yml
Normal file → Executable file
@ -3,15 +3,19 @@ services:
|
||||
container_name: nats
|
||||
image: nats:alpine
|
||||
ports:
|
||||
- "4222:4222"
|
||||
- "8222:8222"
|
||||
- "0.0.0.0:4222:4222"
|
||||
- "0.0.0.0:8222:8222"
|
||||
- "0.0.0.0:6222:6222"
|
||||
volumes:
|
||||
- ${DATADIR}/nats:/data
|
||||
entrypoint: ["/bin/sh", "/data/docker-entrypoint.sh"]
|
||||
|
||||
cc-metric-store:
|
||||
container_name: cc-metric-store
|
||||
build:
|
||||
context: ./cc-metric-store
|
||||
ports:
|
||||
- "8084:8084"
|
||||
- "0.0.0.0:8084:8084"
|
||||
volumes:
|
||||
- ${DATADIR}/cc-metric-store:/data
|
||||
depends_on:
|
||||
@ -19,8 +23,8 @@ services:
|
||||
|
||||
influxdb:
|
||||
container_name: influxdb
|
||||
image: influxdb
|
||||
command: ["--reporting-disabled"]
|
||||
image: influxdb:latest
|
||||
command: ["--reporting-disabled", "--log-level=debug"]
|
||||
environment:
|
||||
DOCKER_INFLUXDB_INIT_MODE: setup
|
||||
DOCKER_INFLUXDB_INIT_USERNAME: devel
|
||||
@ -30,7 +34,7 @@ services:
|
||||
DOCKER_INFLUXDB_INIT_RETENTION: 100w
|
||||
DOCKER_INFLUXDB_INIT_ADMIN_TOKEN: ${INFLUXDB_ADMIN_TOKEN}
|
||||
ports:
|
||||
- "127.0.0.1:${INFLUXDB_PORT}:8086"
|
||||
- "0.0.0.0:8086:8086"
|
||||
volumes:
|
||||
- ${DATADIR}/influxdb/data:/var/lib/influxdb2
|
||||
- ${DATADIR}/influxdb/config:/etc/influxdb2
|
||||
@ -40,9 +44,15 @@ services:
|
||||
image: osixia/openldap:1.5.0
|
||||
command: --copy-service --loglevel debug
|
||||
environment:
|
||||
- LDAP_ADMIN_PASSWORD=${LDAP_ADMIN_PASSWORD}
|
||||
- LDAP_ORGANISATION=${LDAP_ORGANISATION}
|
||||
- LDAP_DOMAIN=${LDAP_DOMAIN}
|
||||
- LDAP_ADMIN_PASSWORD=mashup
|
||||
- LDAP_ORGANISATION=Example Organization
|
||||
- LDAP_DOMAIN=example.com
|
||||
- LDAP_LOGGING=true
|
||||
- LDAP_CONNECTION=default
|
||||
- LDAP_CONNECTIONS=default
|
||||
- LDAP_DEFAULT_HOSTS=0.0.0.0
|
||||
ports:
|
||||
- "0.0.0.0:389:389"
|
||||
volumes:
|
||||
- ${DATADIR}/ldap:/container/service/slapd/assets/config/bootstrap/ldif/custom
|
||||
|
||||
@ -51,36 +61,18 @@ services:
|
||||
image: mariadb:latest
|
||||
command: ["--default-authentication-plugin=mysql_native_password"]
|
||||
environment:
|
||||
MARIADB_ROOT_PASSWORD: ${MARIADB_ROOT_PASSWORD}
|
||||
MARIADB_ROOT_PASSWORD: root
|
||||
MARIADB_DATABASE: slurm_acct_db
|
||||
MARIADB_USER: slurm
|
||||
MARIADB_PASSWORD: demo
|
||||
ports:
|
||||
- "127.0.0.1:${MARIADB_PORT}:3306"
|
||||
- "0.0.0.0:3306:3306"
|
||||
volumes:
|
||||
- ${DATADIR}/mariadb:/etc/mysql/conf.d
|
||||
# - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
|
||||
- ${DATADIR}/mariadb:/docker-entrypoint-initdb.d
|
||||
cap_add:
|
||||
- SYS_NICE
|
||||
|
||||
# mysql:
|
||||
# container_name: mysql
|
||||
# image: mysql:8.0.22
|
||||
# command: ["--default-authentication-plugin=mysql_native_password"]
|
||||
# environment:
|
||||
# MYSQL_ROOT_PASSWORD: ${MYSQL_ROOT_PASSWORD}
|
||||
# MYSQL_DATABASE: ${MYSQL_DATABASE}
|
||||
# MYSQL_USER: ${MYSQL_USER}
|
||||
# MYSQL_PASSWORD: ${MYSQL_PASSWORD}
|
||||
# ports:
|
||||
# - "127.0.0.1:${MYSQL_PORT}:3306"
|
||||
# # volumes:
|
||||
# # - ${DATADIR}/sql-init:/docker-entrypoint-initdb.d
|
||||
# # - ${DATADIR}/sqldata:/var/lib/mysql
|
||||
# cap_add:
|
||||
# - SYS_NICE
|
||||
|
||||
slurm-controller:
|
||||
slurmctld:
|
||||
container_name: slurmctld
|
||||
hostname: slurmctld
|
||||
build:
|
||||
@ -89,40 +81,66 @@ services:
|
||||
volumes:
|
||||
- ${DATADIR}/slurm/home:/home
|
||||
- ${DATADIR}/slurm/secret:/.secret
|
||||
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
|
||||
- /etc/timezone:/etc/timezone:ro
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
- ${DATADIR}/slurm/state:/var/lib/slurm/d
|
||||
ports:
|
||||
- "6817:6817"
|
||||
|
||||
slurm-database:
|
||||
container_name: slurmdb
|
||||
hostname: slurmdb
|
||||
slurmdbd:
|
||||
container_name: slurmdbd
|
||||
hostname: slurmdbd
|
||||
build:
|
||||
context: ./slurm/database
|
||||
depends_on:
|
||||
- mariadb
|
||||
- slurm-controller
|
||||
- slurmctld
|
||||
privileged: true
|
||||
volumes:
|
||||
- ${DATADIR}/slurm/home:/home
|
||||
- ${DATADIR}/slurm/secret:/.secret
|
||||
- ./slurm/database/slurmdbd.conf:/home/config/slurmdbd.conf
|
||||
- /etc/timezone:/etc/timezone:ro
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
ports:
|
||||
- "6819:6819"
|
||||
|
||||
slurm-worker01:
|
||||
node01:
|
||||
container_name: node01
|
||||
hostname: node01
|
||||
build:
|
||||
context: ./slurm/worker
|
||||
depends_on:
|
||||
- slurm-controller
|
||||
- slurmctld
|
||||
privileged: true
|
||||
volumes:
|
||||
- ${DATADIR}/slurm/home:/home
|
||||
- ${DATADIR}/slurm/secret:/.secret
|
||||
- ./slurm/worker/cgroup.conf:/home/config/cgroup.conf
|
||||
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
|
||||
- /etc/timezone:/etc/timezone:ro
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
ports:
|
||||
- "6818:6818"
|
||||
|
||||
# slurm-worker02:
|
||||
# container_name: node02
|
||||
# hostname: node02
|
||||
# build:
|
||||
# context: ./slurm/worker
|
||||
# depends_on:
|
||||
# - slurm-controller
|
||||
# privileged: true
|
||||
# volumes:
|
||||
# - ${DATADIR}/slurm/home:/home
|
||||
# - ${DATADIR}/slurm/secret:/.secret
|
||||
slurmrestd:
|
||||
container_name: slurmrestd
|
||||
hostname: slurmrestd
|
||||
build:
|
||||
context: ./slurm/rest
|
||||
environment:
|
||||
- SLURM_JWT=daemon
|
||||
- SLURMRESTD_DEBUG=9
|
||||
depends_on:
|
||||
- slurmctld
|
||||
privileged: true
|
||||
volumes:
|
||||
- ${DATADIR}/slurm/home:/home
|
||||
- ${DATADIR}/slurm/secret:/.secret
|
||||
- ./slurm/controller/slurm.conf:/home/config/slurm.conf
|
||||
- ./slurm/rest/slurmrestd.conf:/home/config/slurmrestd.conf
|
||||
- /etc/timezone:/etc/timezone:ro
|
||||
- /etc/localtime:/etc/localtime:ro
|
||||
ports:
|
||||
- "6820:6820"
|
@ -1,5 +0,0 @@
|
||||
SLURM_VERSION=22.05.6
|
||||
ARCH=aarch64
|
||||
MUNGE_UID=981
|
||||
SLURM_UID=982
|
||||
WORKER_UID=1000
|
@ -9,7 +9,6 @@ use File::Slurp;
|
||||
use Data::Dumper;
|
||||
use Time::Piece;
|
||||
use Sort::Versions;
|
||||
use REST::Client;
|
||||
|
||||
### JOB-ARCHIVE
|
||||
my $localtime = localtime;
|
||||
@ -19,80 +18,80 @@ my $archiveSrc = './data/job-archive-source';
|
||||
my @ArchiveClusters;
|
||||
|
||||
# Get clusters by job-archive/$subfolder
|
||||
opendir my $dh, $archiveSrc or die "can't open directory: $!";
|
||||
while ( readdir $dh ) {
|
||||
chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive';
|
||||
# opendir my $dh, $archiveSrc or die "can't open directory: $!";
|
||||
# while ( readdir $dh ) {
|
||||
# chomp; next if $_ eq '.' or $_ eq '..' or $_ eq 'job-archive' or $_ eq 'version.txt';
|
||||
|
||||
my $cluster = $_;
|
||||
push @ArchiveClusters, $cluster;
|
||||
}
|
||||
# my $cluster = $_;
|
||||
# push @ArchiveClusters, $cluster;
|
||||
# }
|
||||
|
||||
# start for jobarchive
|
||||
foreach my $cluster ( @ArchiveClusters ) {
|
||||
print "Starting to update start- and stoptimes in job-archive for $cluster\n";
|
||||
# # start for jobarchive
|
||||
# foreach my $cluster ( @ArchiveClusters ) {
|
||||
# print "Starting to update start- and stoptimes in job-archive for $cluster\n";
|
||||
|
||||
opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!";
|
||||
while ( readdir $dhLevel1 ) {
|
||||
chomp; next if $_ eq '.' or $_ eq '..';
|
||||
my $level1 = $_;
|
||||
# opendir my $dhLevel1, "$archiveSrc/$cluster" or die "can't open directory: $!";
|
||||
# while ( readdir $dhLevel1 ) {
|
||||
# chomp; next if $_ eq '.' or $_ eq '..';
|
||||
# my $level1 = $_;
|
||||
|
||||
if ( -d "$archiveSrc/$cluster/$level1" ) {
|
||||
opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!";
|
||||
while ( readdir $dhLevel2 ) {
|
||||
chomp; next if $_ eq '.' or $_ eq '..';
|
||||
my $level2 = $_;
|
||||
my $jobSource = "$archiveSrc/$cluster/$level1/$level2";
|
||||
my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/";
|
||||
my $jobOrigin = $jobSource;
|
||||
# check if files are directly accessible (old format) else get subfolders as file and update path
|
||||
if ( ! -e "$jobSource/meta.json") {
|
||||
my @folders = read_dir($jobSource);
|
||||
if (!@folders) {
|
||||
next;
|
||||
}
|
||||
# Only use first subfolder for now TODO
|
||||
$jobSource = "$jobSource/".$folders[0];
|
||||
}
|
||||
# check if subfolder contains file, else remove source and skip
|
||||
if ( ! -e "$jobSource/meta.json") {
|
||||
# rmtree $jobOrigin;
|
||||
next;
|
||||
}
|
||||
# if ( -d "$archiveSrc/$cluster/$level1" ) {
|
||||
# opendir my $dhLevel2, "$archiveSrc/$cluster/$level1" or die "can't open directory: $!";
|
||||
# while ( readdir $dhLevel2 ) {
|
||||
# chomp; next if $_ eq '.' or $_ eq '..';
|
||||
# my $level2 = $_;
|
||||
# my $jobSource = "$archiveSrc/$cluster/$level1/$level2";
|
||||
# my $jobTarget = "$archiveTarget/$cluster/$level1/$level2/";
|
||||
# my $jobOrigin = $jobSource;
|
||||
# # check if files are directly accessible (old format) else get subfolders as file and update path
|
||||
# if ( ! -e "$jobSource/meta.json") {
|
||||
# my @folders = read_dir($jobSource);
|
||||
# if (!@folders) {
|
||||
# next;
|
||||
# }
|
||||
# # Only use first subfolder for now TODO
|
||||
# $jobSource = "$jobSource/".$folders[0];
|
||||
# }
|
||||
# # check if subfolder contains file, else remove source and skip
|
||||
# if ( ! -e "$jobSource/meta.json") {
|
||||
# # rmtree $jobOrigin;
|
||||
# next;
|
||||
# }
|
||||
|
||||
my $rawstr = read_file("$jobSource/meta.json");
|
||||
my $json = decode_json($rawstr);
|
||||
# my $rawstr = read_file("$jobSource/meta.json");
|
||||
# my $json = decode_json($rawstr);
|
||||
|
||||
# NOTE Start meta.json iteration here
|
||||
# my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT;
|
||||
# Set new startTime: Between 5 days and 1 day before now
|
||||
# # NOTE Start meta.json iteration here
|
||||
# # my $random_number = int(rand(UPPERLIMIT)) + LOWERLIMIT;
|
||||
# # Set new startTime: Between 5 days and 1 day before now
|
||||
|
||||
# Remove id from attributes
|
||||
$json->{startTime} = $epochtime - (int(rand(432000)) + 86400);
|
||||
$json->{stopTime} = $json->{startTime} + $json->{duration};
|
||||
# # Remove id from attributes
|
||||
# $json->{startTime} = $epochtime - (int(rand(432000)) + 86400);
|
||||
# $json->{stopTime} = $json->{startTime} + $json->{duration};
|
||||
|
||||
# Add starttime subfolder to target path
|
||||
$jobTarget .= $json->{startTime};
|
||||
# # Add starttime subfolder to target path
|
||||
# $jobTarget .= $json->{startTime};
|
||||
|
||||
# target is not directory
|
||||
if ( not -d $jobTarget ){
|
||||
# print "Writing files\n";
|
||||
# print "$cluster/$level1/$level2\n";
|
||||
make_path($jobTarget);
|
||||
# # target is not directory
|
||||
# if ( not -d $jobTarget ){
|
||||
# # print "Writing files\n";
|
||||
# # print "$cluster/$level1/$level2\n";
|
||||
# make_path($jobTarget);
|
||||
|
||||
my $outstr = encode_json($json);
|
||||
write_file("$jobTarget/meta.json", $outstr);
|
||||
# my $outstr = encode_json($json);
|
||||
# write_file("$jobTarget/meta.json", $outstr);
|
||||
|
||||
my $datstr = read_file("$jobSource/data.json");
|
||||
write_file("$jobTarget/data.json", $datstr);
|
||||
} else {
|
||||
# rmtree $jobSource;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
print "Done for job-archive\n";
|
||||
sleep(1);
|
||||
# my $datstr = read_file("$jobSource/data.json.gz");
|
||||
# write_file("$jobTarget/data.json.gz", $datstr);
|
||||
# } else {
|
||||
# # rmtree $jobSource;
|
||||
# }
|
||||
# }
|
||||
# }
|
||||
# }
|
||||
# }
|
||||
# print "Done for job-archive\n";
|
||||
# sleep(1);
|
||||
|
||||
## CHECKPOINTS
|
||||
chomp(my $checkpointStart=`date --date 'TZ="Europe/Berlin" 0:00 7 days ago' +%s`);
|
||||
|
77
misc/config.json
Normal file
77
misc/config.json
Normal file
@ -0,0 +1,77 @@
|
||||
{
|
||||
"addr": "127.0.0.1:8080",
|
||||
"short-running-jobs-duration": 300,
|
||||
"archive": {
|
||||
"kind": "file",
|
||||
"path": "./var/job-archive"
|
||||
},
|
||||
"jwts": {
|
||||
"max-age": "2000h"
|
||||
},
|
||||
"db-driver": "mysql",
|
||||
"db": "root:root@tcp(0.0.0.0:3306)/ccbackend",
|
||||
"ldap": {
|
||||
"url": "ldap://0.0.0.0",
|
||||
"user_base": "ou=users,dc=example,dc=com",
|
||||
"search_dn": "cn=admin,dc=example,dc=com",
|
||||
"user_bind": "uid={username},ou=users,dc=example,dc=com",
|
||||
"user_filter": "(&(objectclass=posixAccount))",
|
||||
"syncUserOnLogin": true
|
||||
},
|
||||
"enable-resampling": {
|
||||
"trigger": 30,
|
||||
"resolutions": [
|
||||
600,
|
||||
300,
|
||||
120,
|
||||
60
|
||||
]
|
||||
},
|
||||
"emission-constant": 317,
|
||||
"clusters": [
|
||||
{
|
||||
"name": "fritz",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://0.0.0.0:8084",
|
||||
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"name": "alex",
|
||||
"metricDataRepository": {
|
||||
"kind": "cc-metric-store",
|
||||
"url": "http://0.0.0.0:8084",
|
||||
"token": "eyJ0eXAiOiJKV1QiLCJhbGciOiJFZERTQSJ9.eyJ1c2VyIjoiYWRtaW4iLCJyb2xlcyI6WyJST0xFX0FETUlOIiwiUk9MRV9BTkFMWVNUIiwiUk9MRV9VU0VSIl19.d-3_3FZTsadPjDEdsWrrQ7nS0edMAR4zjl-eK7rJU3HziNBfI9PDHDIpJVHTNN5E5SlLGLFXctWyKAkwhXL-Dw"
|
||||
},
|
||||
"filterRanges": {
|
||||
"numNodes": {
|
||||
"from": 1,
|
||||
"to": 64
|
||||
},
|
||||
"duration": {
|
||||
"from": 0,
|
||||
"to": 86400
|
||||
},
|
||||
"startTime": {
|
||||
"from": "2022-01-01T00:00:00Z",
|
||||
"to": null
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
3
misc/curl_slurmrestd.sh
Executable file
3
misc/curl_slurmrestd.sh
Executable file
@ -0,0 +1,3 @@
|
||||
SLURM_JWT=$(cat data/slurm/secret/jwt_token.txt)
|
||||
curl -X 'GET' -v 'http://localhost:6820/slurm/v0.0.39/node/node01' --location --silent --show-error -H "X-SLURM-USER-NAME: root" -H "X-SLURM-USER-TOKEN: $SLURM_JWT"
|
||||
# curl -v --unix-socket data/slurm/tmp/slurmrestd.socket 'http://localhost:6820/slurm/v0.0.39/ping'
|
27
misc/jwt_verifier.py
Normal file
27
misc/jwt_verifier.py
Normal file
@ -0,0 +1,27 @@
|
||||
#!/usr/bin/env python3
|
||||
import sys
|
||||
import os
|
||||
import pprint
|
||||
import json
|
||||
import time
|
||||
from datetime import datetime, timedelta, timezone
|
||||
|
||||
from jwt import JWT
|
||||
from jwt.jwa import HS256
|
||||
from jwt.jwk import jwk_from_dict
|
||||
from jwt.utils import b64decode,b64encode
|
||||
|
||||
if len(sys.argv) != 2:
|
||||
sys.exit("verify_jwt.py [JWT Token]");
|
||||
|
||||
with open("data/slurm/secret/jwt_hs256.key", "rb") as f:
|
||||
priv_key = f.read()
|
||||
|
||||
signing_key = jwk_from_dict({
|
||||
'kty': 'oct',
|
||||
'k': b64encode(priv_key)
|
||||
})
|
||||
|
||||
a = JWT()
|
||||
b = a.decode(sys.argv[1], signing_key, algorithms=["HS256"])
|
||||
print(b)
|
40
scripts/prerequisite_installation_script.sh
Normal file
40
scripts/prerequisite_installation_script.sh
Normal file
@ -0,0 +1,40 @@
|
||||
#!/bin/bash -l
|
||||
|
||||
sudo apt-get update
|
||||
sudo apt-get upgrade -f -y
|
||||
|
||||
# Add Docker's official GPG key:
|
||||
sudo apt-get update
|
||||
sudo apt-get install ca-certificates curl
|
||||
sudo install -m 0755 -d /etc/apt/keyrings
|
||||
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
|
||||
sudo chmod a+r /etc/apt/keyrings/docker.asc
|
||||
|
||||
# Add the repository to Apt sources:
|
||||
echo \
|
||||
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
|
||||
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
|
||||
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
|
||||
sudo apt-get update
|
||||
|
||||
sudo apt-get install -f -y gcc
|
||||
sudo apt-get install -f -y npm
|
||||
sudo apt-get install -f -y make
|
||||
sudo apt-get install -f -y gh
|
||||
sudo apt-get install -f -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
||||
sudo apt-get install -f -y docker-compose
|
||||
sudo apt install perl -f -y libdatetime-perl libjson-perl
|
||||
sudo apt-get install -f -y golang-go
|
||||
|
||||
sudo cpan Cpanel::JSON::XS
|
||||
sudo cpan File::Slurp
|
||||
sudo cpan Data::Dumper
|
||||
sudo cpan Time::Piece
|
||||
sudo cpan Sort::Versions
|
||||
|
||||
sudo groupadd docker
|
||||
sudo usermod -aG docker ubuntu
|
||||
|
||||
sudo shutdown -r -t 0
|
||||
|
||||
|
126
setupDev.sh
126
setupDev.sh
@ -1,48 +1,42 @@
|
||||
#!/bin/bash
|
||||
echo ""
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo "| Welcome to cc-docker automatic deployment script. |"
|
||||
echo "| Make sure you have sudo rights to run docker services |"
|
||||
echo "| This script assumes that docker command is added to sudo group |"
|
||||
echo "| This means that docker commands do not explicitly require |"
|
||||
echo "| 'sudo' keyword to run. You can use this following command: |"
|
||||
echo "| |"
|
||||
echo "| > sudo groupadd docker |"
|
||||
echo "| > sudo usermod -aG docker $USER |"
|
||||
echo "| |"
|
||||
echo "| This will add docker to the sudo usergroup and all the docker |"
|
||||
echo "| command will run as sudo by default without requiring |"
|
||||
echo "| 'sudo' keyword. |"
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo ""
|
||||
|
||||
# Check cc-backend, touch job.db if exists
|
||||
# Check cc-backend if exists
|
||||
if [ ! -d cc-backend ]; then
|
||||
echo "'cc-backend' not yet prepared! Please clone cc-backend repository before starting this script."
|
||||
echo -n "Stopped."
|
||||
exit
|
||||
else
|
||||
cd cc-backend
|
||||
if [ ! -d var ]; then
|
||||
mkdir var
|
||||
touch var/job.db
|
||||
else
|
||||
echo "'cc-backend/var' exists. Cautiously exiting."
|
||||
echo -n "Stopped."
|
||||
exit
|
||||
fi
|
||||
fi
|
||||
|
||||
|
||||
# Download unedited job-archive to ./data/job-archive-source
|
||||
if [ ! -d data/job-archive-source ]; then
|
||||
cd data
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
|
||||
tar xf job-archive-demo.tar
|
||||
mv ./job-archive ./job-archive-source
|
||||
rm ./job-archive-demo.tar
|
||||
cd ..
|
||||
else
|
||||
echo "'data/job-archive-source' already exists!"
|
||||
# Creates data directory if it does not exists.
|
||||
# Contains all the mount points required by all the docker services
|
||||
# and their static files.
|
||||
if [ ! -d data ]; then
|
||||
mkdir -m777 data
|
||||
fi
|
||||
|
||||
# Download unedited checkpoint files to ./data/cc-metric-store-source/checkpoints
|
||||
if [ ! -d data/cc-metric-store-source ]; then
|
||||
mkdir -p data/cc-metric-store-source/checkpoints
|
||||
cd data/cc-metric-store-source/checkpoints
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/cc-metric-store-checkpoints.tar.xz
|
||||
tar xf cc-metric-store-checkpoints.tar.xz
|
||||
rm cc-metric-store-checkpoints.tar.xz
|
||||
cd ../../../
|
||||
else
|
||||
echo "'data/cc-metric-store-source' already exists!"
|
||||
fi
|
||||
# Invokes the dataGenerationScript.sh, which then populates the required
|
||||
# static files by the docker services. These static files are required by docker services after startup.
|
||||
chmod u+x dataGenerationScript.sh
|
||||
./dataGenerationScript.sh
|
||||
|
||||
# Update timestamps
|
||||
# Update timestamps for all the checkpoints in data/cc-metric-store-source
|
||||
# and dumps new files in data/cc-metric-store.
|
||||
perl ./migrateTimestamps.pl
|
||||
|
||||
# Create archive folder for rewritten ccms checkpoints
|
||||
@ -51,32 +45,54 @@ if [ ! -d data/cc-metric-store/archive ]; then
|
||||
fi
|
||||
|
||||
# cleanup sources
|
||||
# rm -r ./data/job-archive-source
|
||||
# rm -r ./data/cc-metric-store-source
|
||||
|
||||
# prepare folders for influxdb2
|
||||
if [ ! -d data/influxdb ]; then
|
||||
mkdir -p data/influxdb/data
|
||||
mkdir -p data/influxdb/config/influx-configs
|
||||
else
|
||||
echo "'data/influxdb' already exists!"
|
||||
if [ -d data/cc-metric-store-source ]; then
|
||||
rm -r data/cc-metric-store-source
|
||||
fi
|
||||
|
||||
# Check dotenv-file and docker-compose-yml, copy accordingly if not present and build docker services
|
||||
if [ ! -d .env ]; then
|
||||
cp templates/env.default ./.env
|
||||
fi
|
||||
# Just in case user forgot manually shutdown the docker services.
|
||||
docker-compose down
|
||||
docker-compose down --remove-orphans
|
||||
|
||||
if [ ! -d docker-compose.yml ]; then
|
||||
cp templates/docker-compose.yml.default ./docker-compose.yml
|
||||
fi
|
||||
# This automatically builds the base docker image for slurm.
|
||||
# All the slurm docker service in docker-compose.yml refer to
|
||||
# the base image created from this directory.
|
||||
cd slurm/base/
|
||||
make
|
||||
cd ../..
|
||||
|
||||
# Starts all the docker services from docker-compose.yml.
|
||||
docker-compose build
|
||||
./cc-backend/cc-backend --init-db --add-user demo:admin:AdminDev
|
||||
docker-compose up -d
|
||||
|
||||
cd cc-backend
|
||||
if [ ! -d var ]; then
|
||||
wget https://hpc-mover.rrze.uni-erlangen.de/HPC-Data/0x7b58aefb/eig7ahyo6fo2bais0ephuf2aitohv1ai/job-archive-demo.tar
|
||||
tar xf job-archive-demo.tar
|
||||
rm ./job-archive-demo.tar
|
||||
|
||||
cp ./configs/env-template.txt .env
|
||||
cp -f ../misc/config.json config.json
|
||||
|
||||
make
|
||||
|
||||
./cc-backend -migrate-db
|
||||
./cc-backend --init-db --add-user demo:admin:demo
|
||||
cd ..
|
||||
else
|
||||
cd ..
|
||||
echo "'cc-backend/var' exists. Cautiously exiting."
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo "| Check logs for each slurm service by using these commands: |"
|
||||
echo "| docker-compose logs slurmctld |"
|
||||
echo "| docker-compose logs slurmdbd |"
|
||||
echo "| docker-compose logs slurmrestd |"
|
||||
echo "| docker-compose logs node01 |"
|
||||
echo "|======================================================================================|"
|
||||
echo "| Setup complete, containers are up by default: Shut down with 'docker-compose down'. |"
|
||||
echo "| Use './cc-backend/cc-backend -server' to start cc-backend. |"
|
||||
echo "| Use scripts in /scripts to load data into influx or mariadb. |"
|
||||
echo "|--------------------------------------------------------------------------------------|"
|
||||
echo ""
|
||||
echo "Setup complete, containers are up by default: Shut down with 'docker-compose down'."
|
||||
echo "Use './cc-backend/cc-backend' to start cc-backend."
|
||||
echo "Use scripts in /scripts to load data into influx or mariadb."
|
||||
# ./cc-backend/cc-backend
|
||||
|
@ -1,41 +1,39 @@
|
||||
FROM rockylinux:8
|
||||
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
|
||||
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
|
||||
|
||||
ENV SLURM_VERSION=22.05.6
|
||||
ENV ARCH=aarch64
|
||||
ENV SLURM_VERSION=24.05.3
|
||||
ENV HTTP_PARSER_VERSION=2.8.0
|
||||
|
||||
RUN yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm -y
|
||||
RUN yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
|
||||
RUN ARCH=$(uname -m) && yum install -y https://rpmfind.net/linux/almalinux/8.10/PowerTools/x86_64/os/Packages/http-parser-devel-2.8.0-9.el8.$ARCH.rpm
|
||||
|
||||
RUN groupadd -g 981 munge \
|
||||
&& useradd -m -c "MUNGE Uid 'N' Gid Emporium" -d /var/lib/munge -u 981 -g munge -s /sbin/nologin munge \
|
||||
&& groupadd -g 982 slurm \
|
||||
&& useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 982 -g slurm -s /bin/bash slurm \
|
||||
&& groupadd -g 1000 worker \
|
||||
&& useradd -m -c "Workflow user" -d /home/worker -u 1000 -g worker -s /bin/bash worker
|
||||
&& groupadd -g 1000 slurm \
|
||||
&& useradd -m -c "Slurm workload manager" -d /var/lib/slurm -u 1000 -g slurm -s /bin/bash slurm \
|
||||
&& groupadd -g 982 worker \
|
||||
&& useradd -m -c "Workflow user" -d /home/worker -u 982 -g worker -s /bin/bash worker
|
||||
|
||||
RUN yum install -y munge munge-libs
|
||||
RUN dnf --enablerepo=powertools install munge-devel -y
|
||||
RUN yum install rng-tools -y
|
||||
RUN yum install -y munge munge-libs rng-tools \
|
||||
python3 gcc openssl openssl-devel \
|
||||
openssh-server openssh-clients dbus-devel \
|
||||
pam-devel numactl numactl-devel hwloc sudo \
|
||||
lua readline-devel ncurses-devel man2html \
|
||||
autoconf automake json-c-devel libjwt-devel \
|
||||
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
|
||||
|
||||
RUN yum install -y python3 gcc openssl openssl-devel \
|
||||
openssh-server openssh-clients dbus-devel \
|
||||
pam-devel numactl numactl-devel hwloc sudo \
|
||||
lua readline-devel ncurses-devel man2html \
|
||||
libibmad libibumad rpm-build perl-ExtUtils-MakeMaker.noarch rpm-build make wget
|
||||
RUN dnf --enablerepo=powertools install -y munge-devel rrdtool-devel lua-devel hwloc-devel mariadb-server mariadb-devel
|
||||
|
||||
RUN dnf --enablerepo=powertools install rrdtool-devel lua-devel hwloc-devel rpm-build -y
|
||||
RUN dnf install mariadb-server mariadb-devel -y
|
||||
RUN mkdir /usr/local/slurm-tmp
|
||||
RUN cd /usr/local/slurm-tmp
|
||||
RUN wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2
|
||||
RUN rpmbuild -ta slurm-${SLURM_VERSION}.tar.bz2
|
||||
RUN mkdir -p /usr/local/slurm-tmp \
|
||||
&& cd /usr/local/slurm-tmp \
|
||||
&& wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
|
||||
&& rpmbuild -ta --with slurmrestd --with jwt slurm-${SLURM_VERSION}.tar.bz2
|
||||
|
||||
WORKDIR /root/rpmbuild/RPMS/${ARCH}
|
||||
RUN yum -y --nogpgcheck localinstall \
|
||||
slurm-${SLURM_VERSION}-1.el8.${ARCH}.rpm \
|
||||
slurm-perlapi-${SLURM_VERSION}-1.el8.${ARCH}.rpm \
|
||||
slurm-slurmctld-${SLURM_VERSION}-1.el8.${ARCH}.rpm
|
||||
WORKDIR /
|
||||
RUN ARCH=$(uname -m) \
|
||||
&& yum -y --nogpgcheck localinstall \
|
||||
/root/rpmbuild/RPMS/$ARCH/slurm-${SLURM_VERSION}*.$ARCH.rpm \
|
||||
/root/rpmbuild/RPMS/$ARCH/slurm-perlapi-${SLURM_VERSION}*.$ARCH.rpm \
|
||||
/root/rpmbuild/RPMS/$ARCH/slurm-slurmctld-${SLURM_VERSION}*.$ARCH.rpm
|
||||
|
||||
VOLUME ["/home", "/.secret"]
|
||||
# 22: SSH
|
||||
@ -43,4 +41,5 @@ VOLUME ["/home", "/.secret"]
|
||||
# 6817: SlurmCtlD
|
||||
# 6818: SlurmD
|
||||
# 6819: SlurmDBD
|
||||
EXPOSE 22 6817 6818 6819
|
||||
# 6820: SlurmRestD
|
||||
EXPOSE 22 6817 6818 6819 6820
|
||||
|
@ -1,6 +1,6 @@
|
||||
include ../../.env
|
||||
IMAGE = clustercockpit/slurm.base
|
||||
|
||||
SLURM_VERSION = 24.05.3
|
||||
.PHONY: build clean
|
||||
|
||||
build:
|
||||
|
@ -1,5 +1,5 @@
|
||||
FROM clustercockpit/slurm.base:22.05.6
|
||||
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
|
||||
FROM clustercockpit/slurm.base:24.05.3
|
||||
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
|
||||
|
||||
# clean up
|
||||
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
&& rm -rf /var/cache/yum
|
||||
|
||||
COPY docker-entrypoint.sh /docker-entrypoint.sh
|
||||
CMD ["/usr/sbin/init"]
|
||||
ENTRYPOINT ["/docker-entrypoint.sh"]
|
||||
|
@ -1,6 +1,26 @@
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
# Determine the system architecture dynamically
|
||||
ARCH=$(uname -m)
|
||||
SLURM_VERSION="24.05.3"
|
||||
SLURM_JWT=daemon
|
||||
SLURMRESTD_SECURITY=disable_user_check
|
||||
|
||||
_delete_secrets() {
|
||||
if [ -f /.secret/munge.key ]; then
|
||||
echo "Removing secrets"
|
||||
sudo rm -rf /.secret/munge.key
|
||||
sudo rm -rf /.secret/worker-secret.tar.gz
|
||||
sudo rm -rf /.secret/setup-worker-ssh.sh
|
||||
sudo rm -rf /.secret/jwt_hs256.key
|
||||
sudo rm -rf /.secret/jwt_token.txt
|
||||
|
||||
echo "Done removing secrets"
|
||||
ls /.secret/
|
||||
fi
|
||||
}
|
||||
|
||||
# start sshd server
|
||||
_sshd_host() {
|
||||
if [ ! -d /var/run/sshd ]; then
|
||||
@ -17,7 +37,7 @@ _ssh_worker() {
|
||||
mkdir -p /home/worker
|
||||
chown -R worker:worker /home/worker
|
||||
fi
|
||||
cat > /home/worker/setup-worker-ssh.sh <<EOF2
|
||||
cat >/home/worker/setup-worker-ssh.sh <<EOF2
|
||||
mkdir -p ~/.ssh
|
||||
chmod 0700 ~/.ssh
|
||||
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
|
||||
@ -52,7 +72,7 @@ _munge_start() {
|
||||
/usr/sbin/create-munge-key -r -f
|
||||
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
|
||||
chown munge: /etc/munge/munge.key
|
||||
chmod 400 /etc/munge/munge.key
|
||||
chmod 600 /etc/munge/munge.key
|
||||
sudo -u munge /sbin/munged
|
||||
munge -n
|
||||
munge -n | unmunge
|
||||
@ -61,6 +81,10 @@ _munge_start() {
|
||||
|
||||
# copy secrets to /.secret directory for other nodes
|
||||
_copy_secrets() {
|
||||
while [ ! -f /home/worker/worker-secret.tar.gz ]; do
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
|
||||
cp /home/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
|
||||
cp /etc/munge/munge.key /.secret/munge.key
|
||||
@ -68,24 +92,86 @@ _copy_secrets() {
|
||||
rm -f /home/worker/setup-worker-ssh.sh
|
||||
}
|
||||
|
||||
_openssl_jwt_key() {
|
||||
|
||||
mkdir -p /var/spool/slurm/statesave
|
||||
dd if=/dev/random of=/var/spool/slurm/statesave/jwt_hs256.key bs=32 count=1
|
||||
chown slurm:slurm /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chown slurm:slurm /var/spool/slurm/statesave
|
||||
chmod 0755 /var/spool/slurm/statesave
|
||||
cp /var/spool/slurm/statesave/jwt_hs256.key /.secret/jwt_hs256.key
|
||||
chmod 777 /.secret/jwt_hs256.key
|
||||
}
|
||||
|
||||
_generate_jwt_token() {
|
||||
|
||||
secret_key=$(cat /var/spool/slurm/statesave/jwt_hs256.key)
|
||||
start_time=$(date +%s)
|
||||
exp_time=$((start_time + 100000000))
|
||||
base64url() {
|
||||
# Don't wrap, make URL-safe, delete trailer.
|
||||
base64 -w 0 | tr '+/' '-_' | tr -d '='
|
||||
}
|
||||
|
||||
jwt_header=$(echo -n '{"alg":"HS256","typ":"JWT"}' | base64url)
|
||||
|
||||
jwt_claims=$(cat <<EOF |
|
||||
{
|
||||
"sun": "root",
|
||||
"exp": $exp_time,
|
||||
"iat": $start_time
|
||||
}
|
||||
EOF
|
||||
jq -Mcj '.' | base64url)
|
||||
# jq -Mcj => Monochrome output, compact output, join lines
|
||||
|
||||
jwt_signature=$(echo -n "${jwt_header}.${jwt_claims}" |
|
||||
openssl dgst -sha256 -hmac "$secret_key" -binary | base64url)
|
||||
|
||||
# Use the same colours as jwt.io, more-or-less.
|
||||
echo "$(tput setaf 1)${jwt_header}$(tput sgr0).$(tput setaf 5)${jwt_claims}$(tput sgr0).$(tput setaf 6)${jwt_signature}$(tput sgr0)"
|
||||
|
||||
jwt="${jwt_header}.${jwt_claims}.${jwt_signature}"
|
||||
|
||||
echo $jwt | cat >/.secret/jwt_token.txt
|
||||
chmod 777 /.secret/jwt_token.txt
|
||||
}
|
||||
|
||||
# run slurmctld
|
||||
_slurmctld() {
|
||||
cd /root/rpmbuild/RPMS/aarch64
|
||||
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-torque-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm
|
||||
cd /root/rpmbuild/RPMS/$ARCH
|
||||
|
||||
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm
|
||||
echo "checking for slurmdbd.conf"
|
||||
while [ ! -f /.secret/slurmdbd.conf ]; do
|
||||
echo -n "."
|
||||
echo "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm
|
||||
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
|
||||
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
|
||||
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /var/spool /var/lib /var/run/slurm/d /var/run/slurm/ctld /var/lib/slurm/d /var/lib/slurm/ctld
|
||||
mkdir -p /etc/config
|
||||
chown -R slurm: /etc/config
|
||||
|
||||
touch /var/log/slurmctld.log
|
||||
chown slurm: /var/log/slurmctld.log
|
||||
chown -R slurm: /var/log/slurmctld.log
|
||||
touch /var/log/slurmd.log
|
||||
chown -R slurm: /var/log/slurmd.log
|
||||
|
||||
touch /var/lib/slurm/d/job_state
|
||||
chown -R slurm: /var/lib/slurm/d/job_state
|
||||
touch /var/lib/slurm/d/fed_mgr_state
|
||||
chown -R slurm: /var/lib/slurm/d/fed_mgr_state
|
||||
touch /var/run/slurm/d/slurmctld.pid
|
||||
chown -R slurm: /var/run/slurm/d/slurmctld.pid
|
||||
touch /var/run/slurm/d/slurmd.pid
|
||||
chown -R slurm: /var/run/slurm/d/slurmd.pid
|
||||
|
||||
if [[ ! -f /home/config/slurm.conf ]]; then
|
||||
echo "### Missing slurm.conf ###"
|
||||
exit
|
||||
@ -95,15 +181,43 @@ _slurmctld() {
|
||||
chown slurm: /etc/slurm/slurm.conf
|
||||
chmod 600 /etc/slurm/slurm.conf
|
||||
fi
|
||||
sacctmgr -i add cluster "snowflake"
|
||||
|
||||
sudo yum install -y nc
|
||||
sudo yum install -y procps
|
||||
sudo yum install -y iputils
|
||||
sudo yum install -y lsof
|
||||
sudo yum install -y jq
|
||||
|
||||
_openssl_jwt_key
|
||||
|
||||
if [ ! -f /.secret/jwt_hs256.key ]; then
|
||||
echo "### Missing jwt.key ###"
|
||||
exit 1
|
||||
else
|
||||
cp /.secret/jwt_hs256.key /etc/config/jwt_hs256.key
|
||||
chown slurm: /etc/config/jwt_hs256.key
|
||||
chmod 0600 /etc/config/jwt_hs256.key
|
||||
fi
|
||||
|
||||
_generate_jwt_token
|
||||
|
||||
while ! nc -z slurmdbd 6819; do
|
||||
echo "Waiting for slurmdbd to be ready..."
|
||||
sleep 2
|
||||
done
|
||||
|
||||
sacctmgr -i add cluster name=linux
|
||||
sleep 2s
|
||||
echo "Starting slurmctld"
|
||||
cp -f /etc/slurm/slurm.conf /.secret/
|
||||
/usr/sbin/slurmctld
|
||||
/usr/sbin/slurmctld -Dvv
|
||||
echo "Started slurmctld"
|
||||
}
|
||||
|
||||
### main ###
|
||||
_delete_secrets
|
||||
_sshd_host
|
||||
|
||||
_ssh_worker
|
||||
_munge_start
|
||||
_copy_secrets
|
||||
|
108
slurm/controller/slurm.conf
Normal file
108
slurm/controller/slurm.conf
Normal file
@ -0,0 +1,108 @@
|
||||
# slurm.conf
|
||||
#
|
||||
# See the slurm.conf man page for more information.
|
||||
#
|
||||
ClusterName=linux
|
||||
ControlMachine=slurmctld
|
||||
ControlAddr=slurmctld
|
||||
#BackupController=
|
||||
#BackupAddr=
|
||||
#
|
||||
SlurmUser=slurm
|
||||
#SlurmdUser=root
|
||||
SlurmctldPort=6817
|
||||
SlurmdPort=6818
|
||||
AuthType=auth/munge
|
||||
#JobCredentialPrivateKey=
|
||||
#JobCredentialPublicCertificate=
|
||||
StateSaveLocation=/var/lib/slurm/d
|
||||
SlurmdSpoolDir=/var/spool/slurm/d
|
||||
SwitchType=switch/none
|
||||
MpiDefault=none
|
||||
SlurmctldPidFile=/var/run/slurm/d/slurmctld.pid
|
||||
SlurmdPidFile=/var/run/slurm/d/slurmd.pid
|
||||
ProctrackType=proctrack/linuxproc
|
||||
AuthAltTypes=auth/jwt
|
||||
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
|
||||
#PluginDir=
|
||||
#CacheGroups=0
|
||||
#FirstJobId=
|
||||
ReturnToService=0
|
||||
#MaxJobCount=
|
||||
#PlugStackConfig=
|
||||
#PropagatePrioProcess=
|
||||
#PropagateResourceLimits=
|
||||
#PropagateResourceLimitsExcept=
|
||||
#Prolog=
|
||||
#Epilog=
|
||||
#SrunProlog=
|
||||
#SrunEpilog=
|
||||
#TaskProlog=
|
||||
#TaskEpilog=
|
||||
TaskPlugin=task/affinity
|
||||
#TrackWCKey=no
|
||||
#TreeWidth=50
|
||||
#TmpFS=
|
||||
#UsePAM=
|
||||
#
|
||||
# TIMERS
|
||||
SlurmctldTimeout=300
|
||||
SlurmdTimeout=300
|
||||
InactiveLimit=0
|
||||
MinJobAge=300
|
||||
KillWait=30
|
||||
Waittime=0
|
||||
#
|
||||
# SCHEDULING
|
||||
SchedulerType=sched/backfill
|
||||
#SchedulerAuth=
|
||||
#SchedulerPort=
|
||||
#SchedulerRootFilter=
|
||||
# SelectType=select/con_res
|
||||
SelectTypeParameters=CR_CPU_Memory
|
||||
# FastSchedule=1
|
||||
#PriorityType=priority/multifactor
|
||||
#PriorityDecayHalfLife=14-0
|
||||
#PriorityUsageResetPeriod=14-0
|
||||
#PriorityWeightFairshare=100000
|
||||
#PriorityWeightAge=1000
|
||||
#PriorityWeightPartition=10000
|
||||
#PriorityWeightJobSize=1000
|
||||
#PriorityMaxAge=1-0
|
||||
#
|
||||
# LOGGING
|
||||
SlurmctldDebug=6
|
||||
SlurmctldLogFile=/var/log/slurm/slurmctld.log
|
||||
SlurmdDebug=6
|
||||
SlurmdLogFile=/var/log/slurm/slurmd.log
|
||||
JobCompType=jobcomp/filetxt
|
||||
JobCompLoc=/var/log/slurm/jobcomp.log
|
||||
#
|
||||
# ACCOUNTING
|
||||
JobAcctGatherType=jobacct_gather/linux
|
||||
#JobAcctGatherType=jobacct_gather/cgroup
|
||||
#ProctrackType=proctrack/cgroup
|
||||
|
||||
JobAcctGatherFrequency=30
|
||||
#
|
||||
AccountingStorageType=accounting_storage/slurmdbd
|
||||
AccountingStorageHost=slurmdbd
|
||||
AccountingStoragePort=6819
|
||||
#AccountingStorageLoc=slurm_acct_db
|
||||
#AccountingStoragePass=
|
||||
#AccountingStorageUser=
|
||||
#
|
||||
|
||||
# COMPUTE NODES
|
||||
PartitionName=DEFAULT Nodes=node01
|
||||
PartitionName=debug Nodes=node01 Default=YES MaxTime=INFINITE State=UP
|
||||
|
||||
# # COMPUTE NODES
|
||||
# NodeName=c[1-2] RealMemory=1000 State=UNKNOWN
|
||||
NodeName=node01 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1
|
||||
|
||||
# #
|
||||
# # PARTITIONS
|
||||
# PartitionName=normal Default=yes Nodes=c[1-2] Priority=50 DefMemPerCPU=500 Shared=NO MaxNodes=2 MaxTime=5-00:00:00 DefaultTime=5-00:00:00 State=UP
|
||||
|
||||
#PrEpPlugins=pika
|
@ -1,5 +1,5 @@
|
||||
FROM clustercockpit/slurm.base:22.05.6
|
||||
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
|
||||
FROM clustercockpit/slurm.base:24.05.3
|
||||
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
|
||||
|
||||
# clean up
|
||||
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
&& rm -rf /var/cache/yum
|
||||
|
||||
COPY docker-entrypoint.sh /docker-entrypoint.sh
|
||||
CMD ["/usr/sbin/init"]
|
||||
ENTRYPOINT ["/docker-entrypoint.sh"]
|
||||
|
@ -1,6 +1,10 @@
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
# Determine the system architecture dynamically
|
||||
ARCH=$(uname -m)
|
||||
SLURM_VERSION="24.05.3"
|
||||
SLURM_JWT=daemon
|
||||
SLURM_ACCT_DB_SQL=/slurm_acct_db.sql
|
||||
|
||||
# start sshd server
|
||||
@ -48,12 +52,16 @@ _wait_for_worker() {
|
||||
|
||||
# run slurmdbd
|
||||
_slurmdbd() {
|
||||
cd /root/rpmbuild/RPMS/aarch64
|
||||
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmdbd-22.05.6-1.el8.aarch64.rpm
|
||||
cd /root/rpmbuild/RPMS/$ARCH
|
||||
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmdbd-$SLURM_VERSION*.$ARCH.rpm
|
||||
mkdir -p /var/spool/slurm/d /var/log/slurm /etc/slurm
|
||||
chown slurm: /var/spool/slurm/d /var/log/slurm
|
||||
chown -R slurm: /var/spool/slurm/d /var/log/slurm
|
||||
|
||||
mkdir -p /etc/config
|
||||
chown -R slurm: /etc/config
|
||||
|
||||
if [[ ! -f /home/config/slurmdbd.conf ]]; then
|
||||
echo "### Missing slurmdbd.conf ###"
|
||||
exit
|
||||
@ -62,10 +70,31 @@ _slurmdbd() {
|
||||
cp /home/config/slurmdbd.conf /etc/slurm/slurmdbd.conf
|
||||
chown slurm: /etc/slurm/slurmdbd.conf
|
||||
chmod 600 /etc/slurm/slurmdbd.conf
|
||||
fi
|
||||
echo "Starting slurmdbd"
|
||||
cp /etc/slurm/slurmdbd.conf /.secret/slurmdbd.conf
|
||||
/usr/sbin/slurmdbd
|
||||
fi
|
||||
|
||||
echo "checking for jwt.key"
|
||||
while [ ! -f /.secret/jwt_hs256.key ]; do
|
||||
echo "."
|
||||
sleep 1
|
||||
done
|
||||
|
||||
mkdir -p /var/spool/slurm/statesave
|
||||
chown slurm:slurm /var/spool/slurm/statesave
|
||||
chmod 0755 /var/spool/slurm/statesave
|
||||
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chmod 0600 /var/spool/slurm/statesave/jwt_hs256.key
|
||||
|
||||
echo ""
|
||||
|
||||
sudo yum install -y nc
|
||||
sudo yum install -y procps
|
||||
sudo yum install -y iputils
|
||||
|
||||
echo "Starting slurmdbd"
|
||||
/usr/sbin/slurmdbd -Dvv
|
||||
echo "Started slurmdbd"
|
||||
}
|
||||
|
||||
### main ###
|
||||
|
@ -1,3 +1,8 @@
|
||||
#
|
||||
# Example slurmdbd.conf file.
|
||||
#
|
||||
# See the slurmdbd.conf man page for more information.
|
||||
#
|
||||
# Archive info
|
||||
#ArchiveJobs=yes
|
||||
#ArchiveDir="/tmp"
|
||||
@ -8,16 +13,19 @@
|
||||
#
|
||||
# Authentication info
|
||||
AuthType=auth/munge
|
||||
AuthInfo=/var/run/munge/munge.socket.2
|
||||
#
|
||||
#AuthInfo=/var/run/munge/munge.socket.2
|
||||
AuthAltTypes=auth/jwt
|
||||
AuthAltParameters=jwt_key=/var/spool/slurm/statesave/jwt_hs256.key
|
||||
# slurmDBD info
|
||||
DbdAddr=slurmdb
|
||||
DbdHost=slurmdb
|
||||
DbdAddr=slurmdbd
|
||||
DbdHost=slurmdbd
|
||||
DbdPort=6819
|
||||
SlurmUser=slurm
|
||||
#MessageTimeout=300
|
||||
DebugLevel=4
|
||||
#DefaultQOS=normal,standby
|
||||
LogFile=/var/log/slurm/slurmdbd.log
|
||||
PidFile=/var/run/slurmdbd.pid
|
||||
# PidFile=/var/run/slurmdbd/slurmdbd.pid
|
||||
#PluginDir=/usr/lib/slurm
|
||||
#PrivateData=accounts,users,usage,jobs
|
||||
#TrackWCKey=yes
|
||||
@ -25,7 +33,6 @@ PidFile=/var/run/slurmdbd.pid
|
||||
# Database info
|
||||
StorageType=accounting_storage/mysql
|
||||
StorageHost=mariadb
|
||||
StoragePort=3306
|
||||
StoragePass=demo
|
||||
StorageUser=slurm
|
||||
StoragePass=demo
|
||||
StorageLoc=slurm_acct_db
|
@ -1,5 +1,5 @@
|
||||
FROM clustercockpit/slurm.base:22.05.6
|
||||
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
|
||||
FROM clustercockpit/slurm.base:24.05.3
|
||||
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
|
||||
|
||||
# clean up
|
||||
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
@ -7,4 +7,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
&& rm -rf /var/cache/yum
|
||||
|
||||
COPY docker-entrypoint.sh /docker-entrypoint.sh
|
||||
CMD ["/usr/sbin/init"]
|
||||
ENTRYPOINT ["/docker-entrypoint.sh"]
|
||||
|
@ -1,6 +1,12 @@
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
# Determine the system architecture dynamically
|
||||
ARCH=$(uname -m)
|
||||
SLURM_VERSION="24.05.3"
|
||||
# SLURMRESTD="/tmp/slurmrestd.socket"
|
||||
SLURM_JWT=daemon
|
||||
|
||||
# start sshd server
|
||||
_sshd_host() {
|
||||
if [ ! -d /var/run/sshd ]; then
|
||||
@ -10,99 +16,127 @@ _sshd_host() {
|
||||
/usr/sbin/sshd
|
||||
}
|
||||
|
||||
# setup worker ssh to be passwordless
|
||||
_ssh_worker() {
|
||||
if [[ ! -d /home/worker ]]; then
|
||||
mkdir -p /home/worker
|
||||
chown -R worker:worker /home/worker
|
||||
# start munge using existing key
|
||||
_munge_start_using_key() {
|
||||
if [ ! -f /.secret/munge.key ]; then
|
||||
echo -n "checking for munge.key"
|
||||
while [ ! -f /.secret/munge.key ]; do
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
fi
|
||||
cat > /home/worker/setup-worker-ssh.sh <<EOF2
|
||||
mkdir -p ~/.ssh
|
||||
chmod 0700 ~/.ssh
|
||||
ssh-keygen -b 2048 -t rsa -f ~/.ssh/id_rsa -q -N "" -C "$(whoami)@$(hostname)-$(date -I)"
|
||||
cat ~/.ssh/id_rsa.pub > ~/.ssh/authorized_keys
|
||||
chmod 0640 ~/.ssh/authorized_keys
|
||||
cat >> ~/.ssh/config <<EOF
|
||||
Host *
|
||||
StrictHostKeyChecking no
|
||||
UserKnownHostsFile /dev/null
|
||||
LogLevel QUIET
|
||||
EOF
|
||||
chmod 0644 ~/.ssh/config
|
||||
cd ~/
|
||||
tar -czvf ~/worker-secret.tar.gz .ssh
|
||||
cd -
|
||||
EOF2
|
||||
chmod +x /home/worker/setup-worker-ssh.sh
|
||||
chown worker: /home/worker/setup-worker-ssh.sh
|
||||
sudo -u worker /home/worker/setup-worker-ssh.sh
|
||||
}
|
||||
|
||||
# start munge and generate key
|
||||
_munge_start() {
|
||||
cp /.secret/munge.key /etc/munge/munge.key
|
||||
chown -R munge: /etc/munge /var/lib/munge /var/log/munge /var/run/munge
|
||||
chmod 0700 /etc/munge
|
||||
chmod 0711 /var/lib/munge
|
||||
chmod 0700 /var/log/munge
|
||||
chmod 0755 /var/run/munge
|
||||
/sbin/create-munge-key -f
|
||||
rngd -r /dev/urandom
|
||||
/usr/sbin/create-munge-key -r -f
|
||||
sh -c "dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key"
|
||||
chown munge: /etc/munge/munge.key
|
||||
chmod 400 /etc/munge/munge.key
|
||||
sudo -u munge /sbin/munged
|
||||
munge -n
|
||||
munge -n | unmunge
|
||||
remunge
|
||||
}
|
||||
|
||||
# copy secrets to /.secret directory for other nodes
|
||||
_copy_secrets() {
|
||||
cp /home/worker/worker-secret.tar.gz /.secret/worker-secret.tar.gz
|
||||
cp thome/worker/setup-worker-ssh.sh /.secret/setup-worker-ssh.sh
|
||||
cp /etc/munge/munge.key /.secret/munge.key
|
||||
rm -f /home/worker/worker-secret.tar.gz
|
||||
rm -f /home/worker/setup-worker-ssh.sh
|
||||
_enable_slurmrestd() {
|
||||
|
||||
cat >/usr/lib/systemd/system/slurmrestd.service <<EOF
|
||||
[Unit]
|
||||
Description=Slurm REST daemon
|
||||
After=network-online.target slurmctld.service
|
||||
Wants=network-online.target
|
||||
ConditionPathExists=/etc/slurm/slurm.conf
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
EnvironmentFile=-/etc/sysconfig/slurmrestd
|
||||
EnvironmentFile=-/etc/default/slurmrestd
|
||||
# slurmrestd should not run as root or the slurm user.
|
||||
# Please either use the -u and -g options in /etc/sysconfig/slurmrestd or
|
||||
# /etc/default/slurmrestd, or explicitly set the User and Group in this file
|
||||
# an unpriviledged user to run as.
|
||||
User=slurm
|
||||
Restart=always
|
||||
RestartSec=5
|
||||
# Group=
|
||||
# Default to listen on both socket and slurmrestd port
|
||||
ExecStart=/usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt $SLURMRESTD_OPTIONS -vvvvvv -s dbv0.0.39,v0.0.39 0.0.0.0:6820
|
||||
# Enable auth/jwt be default, comment out the line to disable it for slurmrestd
|
||||
Environment="SLURM_JWT=daemon"
|
||||
ExecReload=/bin/kill -HUP $MAINPID
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
|
||||
EOF
|
||||
}
|
||||
|
||||
# run slurmctld
|
||||
_slurmctld() {
|
||||
cd /root/rpmbuild/RPMS/aarch64
|
||||
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-torque-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmctld-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmrestd-22.05.6-1.el8.aarch64.rpm
|
||||
# run slurmrestd
|
||||
_slurmrestd() {
|
||||
cd /root/rpmbuild/RPMS/$ARCH
|
||||
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-torque-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmctld-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmrestd-$SLURM_VERSION*.$ARCH.rpm
|
||||
|
||||
echo -n "checking for slurmdbd.conf"
|
||||
while [ ! -f /.secret/slurmdbd.conf ]; do
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
mkdir -p /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm /etc/slurm
|
||||
chown -R slurm: /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
|
||||
touch /var/log/slurmctld.log
|
||||
chown slurm: /var/log/slurmctld.log
|
||||
if [[ ! -f /home/config/slurm.conf ]]; then
|
||||
|
||||
mkdir -p /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
|
||||
chown -R slurm: /etc/config /var/spool/slurm /var/spool/slurm/restd /var/spool/slurm/restd/rest /var/run/slurm
|
||||
chmod 755 /var/run/slurm
|
||||
|
||||
touch /var/log/slurmrestd.log
|
||||
chown slurm: /var/log/slurmrestd.log
|
||||
|
||||
if [[ ! -f /home/config/slurmrestd.conf ]]; then
|
||||
echo "### Missing slurm.conf ###"
|
||||
exit
|
||||
else
|
||||
echo "### use provided slurm.conf ###"
|
||||
cp /home/config/slurm.conf /etc/slurm/slurm.conf
|
||||
echo "### use provided slurmrestd.conf ###"
|
||||
cp /home/config/slurmrestd.conf /etc/config/slurmrestd.conf
|
||||
cp /home/config/slurm.conf /etc/config/slurm.conf
|
||||
fi
|
||||
sacctmgr -i add cluster "snowflake"
|
||||
|
||||
echo "checking for jwt.key"
|
||||
while [ ! -f /.secret/jwt_hs256.key ]; do
|
||||
echo "."
|
||||
sleep 1
|
||||
done
|
||||
|
||||
sudo yum install -y nc
|
||||
sudo yum install -y procps
|
||||
sudo yum install -y iputils
|
||||
sudo yum install -y lsof
|
||||
sudo yum install -y socat
|
||||
|
||||
mkdir -p /var/spool/slurm/statesave
|
||||
chown slurm:slurm /var/spool/slurm/statesave
|
||||
chmod 0755 /var/spool/slurm/statesave
|
||||
cp /.secret/jwt_hs256.key /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chown slurm: /var/spool/slurm/statesave/jwt_hs256.key
|
||||
chmod 0400 /var/spool/slurm/statesave/jwt_hs256.key
|
||||
|
||||
echo ""
|
||||
|
||||
sleep 2s
|
||||
/usr/sbin/slurmctld
|
||||
cp -f /etc/slurm/slurm.conf /.secret/
|
||||
echo "Starting slurmrestd"
|
||||
# _enable_slurmrestd
|
||||
# sudo ln -s /usr/lib/systemd/system/slurmrestd.service /etc/systemd/system/multi-user.target.wants/slurmrestd.service
|
||||
|
||||
SLURMRESTD_SECURITY=disable_user_check SLURMRESTD_DEBUG=9 /usr/sbin/slurmrestd -f /etc/config/slurmrestd.conf -a rest_auth/jwt -s dbv0.0.39,v0.0.39 -u slurm 0.0.0.0:6820
|
||||
echo "Started slurmrestd"
|
||||
}
|
||||
|
||||
### main ###
|
||||
_sshd_host
|
||||
_ssh_worker
|
||||
_munge_start
|
||||
_copy_secrets
|
||||
_slurmctld
|
||||
_munge_start_using_key
|
||||
_slurmrestd
|
||||
|
||||
tail -f /dev/null
|
||||
|
4
slurm/rest/slurmrestd.conf
Normal file
4
slurm/rest/slurmrestd.conf
Normal file
@ -0,0 +1,4 @@
|
||||
#
|
||||
# Example slurmdbd.conf file.
|
||||
#
|
||||
include /etc/config/slurm.conf
|
@ -1,5 +1,5 @@
|
||||
FROM clustercockpit/slurm.base:22.05.6
|
||||
MAINTAINER Jan Eitzinger <jan.eitzinger@fau.de>
|
||||
FROM clustercockpit/slurm.base:24.05.3
|
||||
LABEL org.opencontainers.image.authors="jan.eitzinger@fau.de"
|
||||
|
||||
# clean up
|
||||
RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
@ -8,4 +8,5 @@ RUN rm -f /root/rpmbuild/RPMS/slurm-*.rpm \
|
||||
|
||||
WORKDIR /home/worker
|
||||
COPY docker-entrypoint.sh /docker-entrypoint.sh
|
||||
CMD ["/usr/sbin/init"]
|
||||
ENTRYPOINT ["/docker-entrypoint.sh"]
|
||||
|
@ -1,3 +1,4 @@
|
||||
CgroupPlugin=disabled
|
||||
ConstrainCores=yes
|
||||
ConstrainDevices=no
|
||||
ConstrainRAMSpace=yes
|
@ -1,6 +1,10 @@
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
# Determine the system architecture dynamically
|
||||
ARCH=$(uname -m)
|
||||
SLURM_VERSION="24.05.3"
|
||||
|
||||
# start sshd server
|
||||
_sshd_host() {
|
||||
if [ ! -d /var/run/sshd ]; then
|
||||
@ -12,6 +16,10 @@ _sshd_host() {
|
||||
|
||||
# start munge using existing key
|
||||
_munge_start_using_key() {
|
||||
sudo yum install -y nc
|
||||
sudo yum install -y procps
|
||||
sudo yum install -y iputils
|
||||
|
||||
echo -n "cheking for munge.key"
|
||||
while [ ! -f /.secret/munge.key ]; do
|
||||
echo -n "."
|
||||
@ -32,49 +40,67 @@ _munge_start_using_key() {
|
||||
|
||||
# wait for worker user in shared /home volume
|
||||
_wait_for_worker() {
|
||||
echo "checking for id_rsa.pub"
|
||||
if [ ! -f /home/worker/.ssh/id_rsa.pub ]; then
|
||||
echo -n "checking for id_rsa.pub"
|
||||
echo "checking for id_rsa.pub"
|
||||
while [ ! -f /home/worker/.ssh/id_rsa.pub ]; do
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
fi
|
||||
echo "done checking for id_rsa.pub"
|
||||
|
||||
}
|
||||
|
||||
_start_dbus() {
|
||||
dbus-uuidgen > /var/lib/dbus/machine-id
|
||||
dbus-uuidgen >/var/lib/dbus/machine-id
|
||||
mkdir -p /var/run/dbus
|
||||
dbus-daemon --config-file=/usr/share/dbus-1/system.conf --print-address
|
||||
}
|
||||
|
||||
# run slurmd
|
||||
_slurmd() {
|
||||
cd /root/rpmbuild/RPMS/aarch64
|
||||
yum -y --nogpgcheck localinstall slurm-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-perlapi-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-slurmd-22.05.6-1.el8.aarch64.rpm \
|
||||
slurm-torque-22.05.6-1.el8.aarch64.rpm
|
||||
cd /root/rpmbuild/RPMS/$ARCH
|
||||
yum -y --nogpgcheck localinstall slurm-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-perlapi-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-slurmd-$SLURM_VERSION*.$ARCH.rpm \
|
||||
slurm-torque-$SLURM_VERSION*.$ARCH.rpm
|
||||
|
||||
echo "checking for slurm.conf"
|
||||
if [ ! -f /.secret/slurm.conf ]; then
|
||||
echo -n "checking for slurm.conf"
|
||||
echo "checking for slurm.conf"
|
||||
while [ ! -f /.secret/slurm.conf ]; do
|
||||
echo -n "."
|
||||
sleep 1
|
||||
done
|
||||
echo ""
|
||||
fi
|
||||
mkdir -p /var/spool/slurm/d /etc/slurm
|
||||
chown slurm: /var/spool/slurm/d
|
||||
echo "found slurm.conf"
|
||||
|
||||
# sudo yum install -y nc
|
||||
# sudo yum install -y procps
|
||||
# sudo yum install -y iputils
|
||||
|
||||
mkdir -p /var/spool/slurm/d /etc/slurm /var/run/slurm/d /var/log/slurm
|
||||
chown slurm: /var/spool/slurm/d /var/run/slurm/d /var/log/slurm
|
||||
cp /home/config/cgroup.conf /etc/slurm/cgroup.conf
|
||||
chown slurm: /etc/slurm/cgroup.conf
|
||||
chmod 600 /etc/slurm/cgroup.conf
|
||||
cp /home/config/slurm.conf /etc/slurm/slurm.conf
|
||||
chown slurm: /etc/slurm/slurm.conf
|
||||
chmod 600 /etc/slurm/slurm.conf
|
||||
touch /var/log/slurmd.log
|
||||
chown slurm: /var/log/slurmd.log
|
||||
echo -n "Starting slurmd"
|
||||
/usr/sbin/slurmd
|
||||
touch /var/log/slurm/slurmd.log
|
||||
chown slurm: /var/log/slurm/slurmd.log
|
||||
|
||||
touch /var/run/slurm/d/slurmd.pid
|
||||
chmod 600 /var/run/slurm/d/slurmd.pid
|
||||
chown slurm: /var/run/slurm/d/slurmd.pid
|
||||
|
||||
echo "Starting slurmd"
|
||||
/usr/sbin/slurmstepd infinity &
|
||||
/usr/sbin/slurmd -Dvv
|
||||
echo "Started slurmd"
|
||||
}
|
||||
|
||||
### main ###
|
||||
|
Loading…
x
Reference in New Issue
Block a user