Commit Graph

32 Commits

Author SHA1 Message Date
Joachim Meyer
e7f54aa1b7 Don't monitor jobs with +NoMonitoring=true. 2024-07-03 14:06:58 -07:00
Joachim Meyer
0a70035977 Add MIT license file. 2023-07-20 11:16:38 +02:00
Joachim Meyer
6cc171d7aa Cleanup unused systemd files. 2023-07-20 11:12:24 +02:00
Joachim Meyer
8c28b2f287 Add timeout state for 12hr preemted jobs. 2023-07-06 09:33:24 +02:00
Joachim Meyer
e70d377047 Make gpu map script return the normalized node names already. 2023-06-29 10:01:19 +02:00
Joachim Meyer
7a0c41e378 Make stop script more robust.
And handle pagination.
2023-05-08 09:40:50 +02:00
Joachim Meyer
89f6440e0f Add new script to stop jobs that were (for soome reason) not stopped in CC. 2023-03-22 10:51:00 +01:00
Joachim Meyer
4259611495 GPU map needs all 8 leading 0s.. 2023-02-08 09:10:25 +01:00
Joachim Meyer
063713d18c We have the gpu map for that... :/ 2023-02-08 08:21:01 +01:00
Joachim Meyer
4c22c4c6eb Need all 8 zeros for cc-metric-collector. 2023-02-07 17:01:34 +01:00
Joachim Meyer
019bfa5ee9 JobCurrentStartDate & EnteredCurrentStatus differ.
When starting a job they then and when differ by 1s or so.. so using JobCurrentStartDate in the end does not match the EnteredCurrentStatus from the start, CC fails to stop the right job.
2022-12-20 17:15:01 +01:00
Joachim Meyer
e38e6fa5b9 Don't start on JobStatus chng & stop on R -> IDLE.
This is handling edgecases, where a startd died and after a timeout of 1hr the job is requeued.
The IDLE -> RUNNING state then already has all other attributes (like RemoteHost)  set, from the last execution.
We should wait for `CpusProvisioned`, though, so that all required values are already up-to-date.

Additionally, since timed-out jobs are not held, but instead just requeued, stop jobs as well when the state changes from `RUNNING` to `IDLE`.
2022-12-19 11:24:24 +01:00
Joachim Meyer
6128b58cbd Fix AssignedGPUs parsing. 2022-12-16 13:42:04 +01:00
Joachim Meyer
80619b6154 Limit verbosity a bit (SCHEDD_DEBUG=D_FULLDEBUG). 2022-12-16 09:35:10 +01:00
Joachim Meyer
b530d9034e Restructure. 2022-12-15 16:20:26 +01:00
Joachim Meyer
a3ca962d84 Add Schedd plugin to synch with CC.
This should be much more reliable, albeit being more prone to crash a HTCondor component (the schedd) if there's a bug...
2022-12-15 16:13:45 +01:00
Joachim Meyer
d83f263dba Also stop jobs that ended with shadow exception. 2022-11-29 15:48:05 +01:00
Joachim Meyer
6175affa55 ULOG_JOB_RECONNECT_FAILED is a stop reason. 2022-11-16 13:07:01 +01:00
Joachim Meyer
09334ab4f1 Offset ArrayJobId with submitnode id.. 2022-11-09 14:27:53 +01:00
Joachim Meyer
9571f3cda6 Don't check outdated cc job data. 2022-11-09 11:47:59 +01:00
Joachim Meyer
9e96f65977 Handle held / requeued jobs.
Requires cc-backend patch proposed in:
https://github.com/ClusterCockpit/cc-backend/issues/30
(Upstream assumes startTime to be non-unique if they happened in the same 24hrs).
2022-11-09 10:30:04 +01:00
Joachim Meyer
21cdece420 Use value from actually used schema. 2022-11-08 17:45:47 +01:00
Joachim Meyer
253784d94d If no ToE, use eventtime. 2022-11-08 17:40:49 +01:00
Joachim Meyer
35c6ee3b47 Disable event 4. 2022-11-08 17:27:51 +01:00
Joachim Meyer
fe641ca357 Fix file name 2022-11-08 16:47:59 +01:00
Joachim Meyer
308df9907e Start revamping to use htcondor EventLog not slurm 2022-11-04 16:25:48 +01:00
Michael Schwarz
c9aa4095fe Add systemd service and timer to start this script every minute 2022-09-06 15:02:28 +02:00
Michael Schwarz
57593358a2 Ignore tagged jobs 2022-08-30 11:00:25 +02:00
Michael Schwarz
631ed6c8b6 Little bugfix, there might be failed jobs without a step 2022-08-30 11:00:04 +02:00
Michael Schwarz
483bc0da1d Fix some layout issues in Readme.md 2022-08-25 16:09:22 +02:00
Michael Schwarz
54fbc4fa93 Initial commit 2022-08-25 15:38:06 +02:00
oscarminus
84d49f2807
Initial commit 2022-08-25 15:30:56 +02:00