Commit Graph

10 Commits

Author SHA1 Message Date
Joachim Meyer
e7f54aa1b7 Don't monitor jobs with +NoMonitoring=true. 2024-07-03 14:06:58 -07:00
Joachim Meyer
8c28b2f287 Add timeout state for 12hr preemted jobs. 2023-07-06 09:33:24 +02:00
Joachim Meyer
e70d377047 Make gpu map script return the normalized node names already. 2023-06-29 10:01:19 +02:00
Joachim Meyer
063713d18c We have the gpu map for that... :/ 2023-02-08 08:21:01 +01:00
Joachim Meyer
4c22c4c6eb Need all 8 zeros for cc-metric-collector. 2023-02-07 17:01:34 +01:00
Joachim Meyer
019bfa5ee9 JobCurrentStartDate & EnteredCurrentStatus differ.
When starting a job they then and when differ by 1s or so.. so using JobCurrentStartDate in the end does not match the EnteredCurrentStatus from the start, CC fails to stop the right job.
2022-12-20 17:15:01 +01:00
Joachim Meyer
e38e6fa5b9 Don't start on JobStatus chng & stop on R -> IDLE.
This is handling edgecases, where a startd died and after a timeout of 1hr the job is requeued.
The IDLE -> RUNNING state then already has all other attributes (like RemoteHost)  set, from the last execution.
We should wait for `CpusProvisioned`, though, so that all required values are already up-to-date.

Additionally, since timed-out jobs are not held, but instead just requeued, stop jobs as well when the state changes from `RUNNING` to `IDLE`.
2022-12-19 11:24:24 +01:00
Joachim Meyer
6128b58cbd Fix AssignedGPUs parsing. 2022-12-16 13:42:04 +01:00
Joachim Meyer
80619b6154 Limit verbosity a bit (SCHEDD_DEBUG=D_FULLDEBUG). 2022-12-16 09:35:10 +01:00
Joachim Meyer
b530d9034e Restructure. 2022-12-15 16:20:26 +01:00