Joachim Meyer
e38e6fa5b9
Don't start on JobStatus chng & stop on R -> IDLE.
...
This is handling edgecases, where a startd died and after a timeout of 1hr the job is requeued.
The IDLE -> RUNNING state then already has all other attributes (like RemoteHost) set, from the last execution.
We should wait for `CpusProvisioned`, though, so that all required values are already up-to-date.
Additionally, since timed-out jobs are not held, but instead just requeued, stop jobs as well when the state changes from `RUNNING` to `IDLE`.
2022-12-19 11:24:24 +01:00
Joachim Meyer
6128b58cbd
Fix AssignedGPUs
parsing.
2022-12-16 13:42:04 +01:00
Joachim Meyer
80619b6154
Limit verbosity a bit (SCHEDD_DEBUG=D_FULLDEBUG).
2022-12-16 09:35:10 +01:00
Joachim Meyer
b530d9034e
Restructure.
2022-12-15 16:20:26 +01:00
Joachim Meyer
a3ca962d84
Add Schedd plugin to synch with CC.
...
This should be much more reliable, albeit being more prone to crash a HTCondor component (the schedd) if there's a bug...
2022-12-15 16:13:45 +01:00
Joachim Meyer
d83f263dba
Also stop jobs that ended with shadow exception.
2022-11-29 15:48:05 +01:00
Joachim Meyer
6175affa55
ULOG_JOB_RECONNECT_FAILED is a stop reason.
2022-11-16 13:07:01 +01:00
Joachim Meyer
09334ab4f1
Offset ArrayJobId with submitnode id..
2022-11-09 14:27:53 +01:00
Joachim Meyer
9571f3cda6
Don't check outdated cc job data.
2022-11-09 11:47:59 +01:00
Joachim Meyer
9e96f65977
Handle held / requeued jobs.
...
Requires cc-backend patch proposed in:
https://github.com/ClusterCockpit/cc-backend/issues/30
(Upstream assumes startTime to be non-unique if they happened in the same 24hrs).
2022-11-09 10:30:04 +01:00
Joachim Meyer
21cdece420
Use value from actually used schema.
2022-11-08 17:45:47 +01:00
Joachim Meyer
253784d94d
If no ToE, use eventtime.
2022-11-08 17:40:49 +01:00
Joachim Meyer
35c6ee3b47
Disable event 4.
2022-11-08 17:27:51 +01:00
Joachim Meyer
fe641ca357
Fix file name
2022-11-08 16:47:59 +01:00
Joachim Meyer
308df9907e
Start revamping to use htcondor EventLog not slurm
2022-11-04 16:25:48 +01:00
Michael Schwarz
c9aa4095fe
Add systemd service and timer to start this script every minute
2022-09-06 15:02:28 +02:00
Michael Schwarz
57593358a2
Ignore tagged jobs
2022-08-30 11:00:25 +02:00
Michael Schwarz
631ed6c8b6
Little bugfix, there might be failed jobs without a step
2022-08-30 11:00:04 +02:00
Michael Schwarz
483bc0da1d
Fix some layout issues in Readme.md
2022-08-25 16:09:22 +02:00
Michael Schwarz
54fbc4fa93
Initial commit
2022-08-25 15:38:06 +02:00
oscarminus
84d49f2807
Initial commit
2022-08-25 15:30:56 +02:00