Commit Graph

25 Commits

Author SHA1 Message Date
Joachim Meyer 4259611495 GPU map needs all 8 leading 0s.. 2023-02-08 09:10:25 +01:00
Joachim Meyer 063713d18c We have the gpu map for that... :/ 2023-02-08 08:21:01 +01:00
Joachim Meyer 4c22c4c6eb Need all 8 zeros for cc-metric-collector. 2023-02-07 17:01:34 +01:00
Joachim Meyer 019bfa5ee9 JobCurrentStartDate & EnteredCurrentStatus differ.
When starting a job they then and when differ by 1s or so.. so using JobCurrentStartDate in the end does not match the EnteredCurrentStatus from the start, CC fails to stop the right job.
2022-12-20 17:15:01 +01:00
Joachim Meyer e38e6fa5b9 Don't start on JobStatus chng & stop on R -> IDLE.
This is handling edgecases, where a startd died and after a timeout of 1hr the job is requeued.
The IDLE -> RUNNING state then already has all other attributes (like RemoteHost)  set, from the last execution.
We should wait for `CpusProvisioned`, though, so that all required values are already up-to-date.

Additionally, since timed-out jobs are not held, but instead just requeued, stop jobs as well when the state changes from `RUNNING` to `IDLE`.
2022-12-19 11:24:24 +01:00
Joachim Meyer 6128b58cbd Fix AssignedGPUs parsing. 2022-12-16 13:42:04 +01:00
Joachim Meyer 80619b6154 Limit verbosity a bit (SCHEDD_DEBUG=D_FULLDEBUG). 2022-12-16 09:35:10 +01:00
Joachim Meyer b530d9034e Restructure. 2022-12-15 16:20:26 +01:00
Joachim Meyer a3ca962d84 Add Schedd plugin to synch with CC.
This should be much more reliable, albeit being more prone to crash a HTCondor component (the schedd) if there's a bug...
2022-12-15 16:13:45 +01:00
Joachim Meyer d83f263dba Also stop jobs that ended with shadow exception. 2022-11-29 15:48:05 +01:00
Joachim Meyer 6175affa55 ULOG_JOB_RECONNECT_FAILED is a stop reason. 2022-11-16 13:07:01 +01:00
Joachim Meyer 09334ab4f1 Offset ArrayJobId with submitnode id.. 2022-11-09 14:27:53 +01:00
Joachim Meyer 9571f3cda6 Don't check outdated cc job data. 2022-11-09 11:47:59 +01:00
Joachim Meyer 9e96f65977 Handle held / requeued jobs.
Requires cc-backend patch proposed in:
https://github.com/ClusterCockpit/cc-backend/issues/30
(Upstream assumes startTime to be non-unique if they happened in the same 24hrs).
2022-11-09 10:30:04 +01:00
Joachim Meyer 21cdece420 Use value from actually used schema. 2022-11-08 17:45:47 +01:00
Joachim Meyer 253784d94d If no ToE, use eventtime. 2022-11-08 17:40:49 +01:00
Joachim Meyer 35c6ee3b47 Disable event 4. 2022-11-08 17:27:51 +01:00
Joachim Meyer fe641ca357 Fix file name 2022-11-08 16:47:59 +01:00
Joachim Meyer 308df9907e Start revamping to use htcondor EventLog not slurm 2022-11-04 16:25:48 +01:00
Michael Schwarz c9aa4095fe Add systemd service and timer to start this script every minute 2022-09-06 15:02:28 +02:00
Michael Schwarz 57593358a2 Ignore tagged jobs 2022-08-30 11:00:25 +02:00
Michael Schwarz 631ed6c8b6 Little bugfix, there might be failed jobs without a step 2022-08-30 11:00:04 +02:00
Michael Schwarz 483bc0da1d Fix some layout issues in Readme.md 2022-08-25 16:09:22 +02:00
Michael Schwarz 54fbc4fa93 Initial commit 2022-08-25 15:38:06 +02:00
oscarminus 84d49f2807 Initial commit 2022-08-25 15:30:56 +02:00