Commit Graph

2710 Commits

Author SHA1 Message Date
Christoph Kluge
bb6915771d fix: clarify title 2026-03-18 13:23:33 +01:00
8b0881fb17 Exclude down nodes from HealthCheck
Entire-Checkpoint: 0c3347168c79
2026-03-18 11:20:12 +01:00
Christoph Kluge
33beb3c806 fix: simplify stats query condition
- caused expensive subquery without need in frontend
2026-03-18 11:07:57 +01:00
c1d51959d5 Change dtermineState to enforce priority order
Make exception if node is idle + down, then final state is idle

Entire-Checkpoint: 92c797737df8
2026-03-18 10:57:06 +01:00
3328d2ca11 Update go version in CLAUDE.md 2026-03-18 10:37:32 +01:00
8f10eba771 Extend CLAUDE.md
Entire-Checkpoint: 17cdf997acff
2026-03-18 10:05:09 +01:00
c449996559 Add context to log message
Entire-Checkpoint: 55d95cdef0d4
2026-03-18 09:43:41 +01:00
51ae2a5d10 Remove tracked .entire/metadata/ files from git
These conversation transcript files were committed before the gitignore
rule existed. They are now properly ignored via .entire/.gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-18 07:10:01 +01:00
6ebc9e88fa Add more context information to auth failed log
Entire-Checkpoint: 2187cd89cb78
2026-03-18 06:56:01 +01:00
8b132ed7f8 fix: Blocking ReceiveNats call
Entire-Checkpoint: 38a235c86ceb
2026-03-18 06:47:45 +01:00
bf1a8a174e fix: Shard WAL consumer for higher throughput
Entire-Checkpoint: e583b7b11439
2026-03-18 06:32:14 +01:00
50aed595cf fix: metricstore NATS contention
Entire-Checkpoint: 7e68050cab59
2026-03-18 06:14:15 +01:00
33bc19c732 Upgrade cc-lib 2026-03-18 05:52:58 +01:00
045f81f985 Prepare release v1.5.2
Entire-Checkpoint: 9286f4c43ab5
2026-03-18 05:31:49 +01:00
d46e6371fc Add log about checkpoint archiving
Entire-Checkpoint: bf29af79b268
2026-03-18 05:22:39 +01:00
02f82c2c0b fix: Prevent memory spikes in parquet writer for metricstore move policy
Entire-Checkpoint: 4a675b8352a2
2026-03-18 05:08:37 +01:00
3314b8e284 Ignore ErrNoRows error. Include calling function in log.
Entire-Checkpoint: 20746187d135
2026-03-16 20:09:44 +01:00
6855d62bf2 Make log in scanRow more descriptive. No log for common no rows error
Entire-Checkpoint: 858b34ef56b8
2026-03-16 20:03:27 +01:00
7f3eb443d9 Include calling function in error message
Entire-Checkpoint: a4948d0fe7a3
2026-03-16 15:42:38 +01:00
bab6eb4c3a Convert Warn message on missing metrics to Debug level 2026-03-16 15:35:24 +01:00
09d0ba71d2 Provide idential nodestate functionality in NATS API
Entire-Checkpoint: 3a40b75edd68
2026-03-16 12:13:14 +01:00
df93dbed63 Add busyTimeout config setting
Entire-Checkpoint: 81097a6c52a2
2026-03-16 11:30:21 +01:00
e4f3fa9ba0 Wrap SyncJobs in transaction
Entire-Checkpoint: d4f6c79a8dc1
2026-03-16 11:25:49 +01:00
51517f8031 Reduce insert pressure in db. Increase sqlite timeout value
Entire-Checkpoint: a1e2931d4deb
2026-03-16 11:17:47 +01:00
0aad8f01c8 Upgrade cc-lib
Fixes panic in AddNodeScope

Entire-Checkpoint: afef27e07ec9
2026-03-16 08:55:56 +01:00
973ca87bd1 Extend known issues in ReleaseNotes 2026-03-15 07:02:54 +01:00
045311eec0 Prepare release 1.5.1
Entire-Checkpoint: baed7fbee099
2026-03-13 17:30:03 +01:00
e38396a081 Upgrade dependencies. Rebuild GraphQL.
Entire-Checkpoint: f770853c9fa0
2026-03-13 17:22:34 +01:00
e83bd2babd Consolidate migrations
Entire-Checkpoint: a3dba4105838
2026-03-13 17:14:13 +01:00
Christoph Kluge
ba366d0d72 use inline literals in simple queries, add downgrade optimize 2026-03-13 15:16:19 +01:00
f15f1452cc Inline jobstate literal in query
Entire-Checkpoint: 35f06df74b51
2026-03-13 15:16:07 +01:00
df2a13def2 Merge branch 'hotfix' of github.com:ClusterCockpit/cc-backend into hotfix 2026-03-13 14:34:11 +01:00
d586fe4b43 Optimize usage dashboard: partial indexes, request cache, parallel histograms
- Add migration 14: partial covering indexes WHERE job_state='running'
  for user/project/subcluster groupings (tiny B-tree vs full table)
- Inline literal state value in BuildWhereClause so SQLite matches
  partial indexes instead of parameterized placeholders
- Add per-request statsGroupCache (sync.Once per filter+groupBy key)
  so identical grouped stats queries execute only once per GQL operation
- Parallelize 4 histogram queries in AddHistograms using errgroup
- Consolidate frontend from 6 GQL aliases to 2, sort+slice top-10
  client-side via $derived

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 5b26a6e5ff10
2026-03-13 14:31:37 +01:00
Christoph Kluge
bc214f6cea add nullsafes to frontend 2026-03-13 14:20:45 +01:00
cbe46c3524 Merge branch 'hotfix' of github.com:ClusterCockpit/cc-backend into hotfix 2026-03-13 13:17:34 +01:00
0037d969b2 Consolidate UsageDash into single GraphQL query
Merge three separate queries (topJobsQuery, topNodesQuery, topAccsQuery)
into one topStatsQuery with 6 aliased jobsStatistics fields, reducing
3 HTTP round trips to 1 on the status dashboard.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 40d806a3240c
2026-03-13 13:14:29 +01:00
dd3e5427f4 Add covering indexes for status/dashboard queries (migration 13)
Adds composite covering indexes on (cluster, job_state, <group_col>, ...)
for user, project, and subcluster groupings to enable index-only scans
for status views. Drops subsumed 3-column indexes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 3d8def28e96e
2026-03-13 13:12:54 +01:00
Christoph Kluge
e666980184 fix typo 2026-03-13 12:07:43 +01:00
Christoph Kluge
c238f68af6 reduce unnecessary complexity 2026-03-13 12:05:16 +01:00
Christoph Kluge
58c0c79f72 handle single job state queries as simple stringquery
- this will improve index usage for single state queries
2026-03-13 12:03:06 +01:00
Christoph Kluge
c23d7bd5e5 remove non-required sorting params
- caused expensive DB scans without use or need
2026-03-13 11:27:45 +01:00
Christoph Kluge
41114f7eda reorder frontend coded filters to match db indices 2026-03-13 10:48:38 +01:00
Christoph Kluge
a877937a25 add missing downgarde index drop, add optimize after downgrades 2026-03-13 10:11:11 +01:00
39ab12784c Make checkpointInterval an option config option again.
Also applies small fixes

Entire-Checkpoint: c11d1a65fae4
2026-03-13 09:07:38 +01:00
b214e1755a Add buffered I/O to WAL writes and fix MemoryCap comment
WAL writes now go through bufio.Writer instead of raw syscalls per record,
reducing I/O overhead. Buffers are flushed on rotate, drain, and shutdown.
Fixed misleading MemoryCap comment ("Max bytes" → "Max memory in GB").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: b38dc35e5334
2026-03-13 09:05:24 +01:00
a4f9ba6975 Apply correct log level
Entire-Checkpoint: 8288af281b94
2026-03-13 07:58:57 +01:00
8234ad3126 fix: Fix metricstore memory explosion from broken emergency free and batch aborts
- Fix MemoryUsageTracker: remove premature bufferPool.Clear() that prevented
  mem.Alloc from decreasing, replace broken ForceFree loop (100 iterations
  with no GC) with progressive time-based Free at 75%/50%/25% retention,
  add bufferPool.Clear()+GC between steps so memory stats update correctly
- Enable debug.FreeOSMemory() after emergency freeing to return memory to OS
- Add adaptive ticker: 30s checks when memory >80% of cap, normal otherwise
- Reduce default memory check interval from 1h to 5min
- Don't abort entire NATS batch on single write error (out-of-order timestamp),
  log warning and continue processing remaining lines
- Prune empty levels from tree after free() to reduce overhead
- Include buffer struct overhead in sizeInBytes() for more accurate reporting

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 7ce28627fc1d
2026-03-13 07:57:35 +01:00
126f65879a Update Release informations
Entire-Checkpoint: 9f282c3d9570
2026-03-13 06:23:33 +01:00
3aacc669b6 Remove debug timer 2026-03-13 06:18:21 +01:00
96fc44a649 fix: Optimize project stat query 2026-03-13 06:06:38 +01:00