39ab12784c
Make checkpointInterval an option config option again.
...
Also applies small fixes
Entire-Checkpoint: c11d1a65fae4
2026-03-13 09:07:38 +01:00
b214e1755a
Add buffered I/O to WAL writes and fix MemoryCap comment
...
WAL writes now go through bufio.Writer instead of raw syscalls per record,
reducing I/O overhead. Buffers are flushed on rotate, drain, and shutdown.
Fixed misleading MemoryCap comment ("Max bytes" → "Max memory in GB").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
Entire-Checkpoint: b38dc35e5334
2026-03-13 09:05:24 +01:00
8234ad3126
fix: Fix metricstore memory explosion from broken emergency free and batch aborts
...
- Fix MemoryUsageTracker: remove premature bufferPool.Clear() that prevented
mem.Alloc from decreasing, replace broken ForceFree loop (100 iterations
with no GC) with progressive time-based Free at 75%/50%/25% retention,
add bufferPool.Clear()+GC between steps so memory stats update correctly
- Enable debug.FreeOSMemory() after emergency freeing to return memory to OS
- Add adaptive ticker: 30s checks when memory >80% of cap, normal otherwise
- Reduce default memory check interval from 1h to 5min
- Don't abort entire NATS batch on single write error (out-of-order timestamp),
log warning and continue processing remaining lines
- Prune empty levels from tree after free() to reduce overhead
- Include buffer struct overhead in sizeInBytes() for more accurate reporting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com >
Entire-Checkpoint: 7ce28627fc1d
2026-03-13 07:57:35 +01:00
845d0111af
Further consolidate and improve ccms query builder
...
Entire-Checkpoint: d10e6221ee4f
2026-03-04 17:31:36 +01:00
26982088c3
Consolidate code for external and internal ccms buildQueries function
...
Entire-Checkpoint: fc3be444ef4c
2026-03-04 16:43:05 +01:00
67a17b5306
Reduce noise in info log
2026-03-04 15:14:35 +01:00
39635ea123
Cleanup metricstore options
...
Entire-Checkpoint: 2f9a4e1c2e87
2026-03-04 10:37:43 +01:00
Aditya Ujeniya
74ab51f409
Patch bufferPool with no limits to pool size
2026-03-03 09:51:04 +01:00
688ad507a2
Merge branch 'optimize-checkpoint-wal' into dev
2026-03-03 06:58:28 +01:00
Christoph Kluge
718ff60221
clarify ccms logs
2026-03-02 16:24:38 +01:00
Aditya Ujeniya
a243e17499
Update to shutdown worker for WAL checkpointing mode
2026-03-02 15:27:06 +01:00
1ec41d8389
Review and improve buffer pool implmentation. Add unit tests.
2026-02-28 19:34:33 +01:00
888d7fb235
Merge branch 'optimize-checkpoint-wal' of github.com:ClusterCockpit/cc-backend into optimize-checkpoint-wal
2026-02-27 17:40:34 +01:00
adebffd251
Replace the old zip archive options for the metricstore node data by parquet files
2026-02-27 17:40:32 +01:00
Aditya Ujeniya
2e5d85c223
Udpate testcase
2026-02-27 15:09:06 +01:00
Aditya Ujeniya
07b989cb81
Add new bufferPool implementation
2026-02-27 14:44:32 +01:00
a418abc7d5
Run go fix
2026-02-27 14:40:26 +01:00
a1db8263d7
Document line protocol. Optimize REST writeMetric path
2026-02-27 12:30:27 +01:00
4c3cd8e66a
Merge branch 'dev' into optimize-checkpoint-wal
2026-02-27 09:30:32 +01:00
6ecb934967
Switch to CC line-protocol package. Update cc-lib.
2026-02-27 08:55:33 +01:00
ca0f9a42c7
Introduce metric store binary checkpoints with write ahead log
2026-02-26 10:08:40 +01:00
cc21e0e62c
Make json the default checkpoint format
2026-02-25 07:38:19 +01:00
Christoph Kluge
bae7ec11b4
migrate changes from cc-backend PR#364
2026-02-20 15:10:02 +01:00
6035b62734
Run go fix
2026-02-17 21:04:17 +01:00
Aditya Ujeniya
1cf2c41bd7
Resize the buffers and put them into the pool
2026-02-16 18:21:45 +01:00
Aditya Ujeniya
2eeefc2720
Add healthCheck support for external CCMS
2026-02-16 16:57:17 +01:00
865cd3db54
Prersist faulty nodestate metric lists to db
2026-02-12 08:48:15 +01:00
8d6c6b819b
Update and port to cc-lib
2026-02-11 07:06:06 +01:00
a8194de492
Add diagnostic output for healthcheck
2026-02-07 06:17:34 +01:00
a8d385a1ee
Update HealthCheck again Still WIP
2026-02-06 16:35:02 +01:00
5579b6f40c
Adopt unit test to new API
2026-02-06 16:11:10 +01:00
7123a8c1cc
Updated HealthCheck implementation WIP
2026-02-06 16:04:01 +01:00
f671d8df90
Add counts in healthcheck for logging output
2026-02-06 09:25:09 +01:00
Aditya Ujeniya
fcb37b0367
Update to count healthy metrics
2026-02-06 08:45:36 +01:00
0984c1d431
Add debug log with degrade and missing metrics for healthcheck
2026-02-06 07:21:04 +01:00
5d7dd62b72
Update unit test for new HealthCheck update
2026-02-04 12:53:24 +01:00
46fb52d67e
Adopt documentation
2026-02-04 12:30:33 +01:00
Aditya Ujeniya
39b8356683
Optimized CCMS healthcheck
2026-02-04 10:24:45 +01:00
42ce598865
Merge branch 'dev' of github.com:ClusterCockpit/cc-backend into dev
2026-02-03 18:35:35 +01:00
0d62a300e7
Intermediate state of node Healthcheck
...
TODOS:
* Remove error handling from routine and simplify API call
* Use map for hardware level metrics
2026-02-03 18:35:17 +01:00
Aditya Ujeniya
3cf88f757c
Update to checkpoint loader in CCMS
2026-02-03 16:25:48 +01:00
248f11f4f8
Change API of Node HealthState
2026-02-03 14:55:12 +01:00
00a41373e8
Add monitoring healthstate support in nodestate API.
2026-02-03 12:23:24 +01:00
Aditya Ujeniya
a71341064e
Update to MetricStore HealthCheck API
2026-01-30 23:24:16 +01:00
Christoph Kluge
32f0664012
add indicator to nodeView state, cap bubble size in roofline
2026-01-30 14:32:41 +01:00
Christoph Kluge
4deec9a170
no append if ErrNoHostOrMetric fired
2026-01-29 15:18:50 +01:00
Aditya Ujeniya
7101d2bb3b
Handle the metric/host not found case differently
2026-01-28 17:47:38 +01:00
0d857b49a2
Disable explicit GC calls
2026-01-28 11:21:27 +01:00
eb5aa9ad02
Disable explicit GC calls
2026-01-28 11:21:02 +01:00
9d15a87c88
Take into account the real allocated heap memory in MemoryUsageTracker
2026-01-27 18:23:09 +01:00