cc-backend

mirror of https://github.com/ClusterCockpit/cc-backend synced 2026-07-16 20:40:36 +02:00

Author	SHA1	Message	Date
moebiusband	66707bbf15	Update metricstore documentation Entire-Checkpoint: 99f20c1edd90	2026-03-29 21:38:04 +02:00
moebiusband	fc47b12fed	fix: Pause WAL writes during binary checkpoint to prevent message drops WAL writes during checkpoint are redundant since the binary snapshot captures all in-memory data. Pausing eliminates channel saturation (1.4M+ dropped messages) caused by disk I/O contention between checkpoint writes and WAL staging. Also removes direct WAL file deletion in checkpoint workers that raced with the staging goroutine. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: 34d698f40bac	2026-03-29 11:13:39 +02:00
moebiusband	937984d11f	fix: WAL rotation skipped for all nodes due to non-blocking send on small channel RotateWALFiles used a non-blocking send (select/default) on rotation channels buffered at 64. With thousands of nodes and few shards, the channel fills instantly and nearly all hosts are skipped, leaving WAL files unrotated indefinitely. Replace with a blocking send using a shared 2-minute deadline so the checkpoint goroutine waits for the staging goroutine to drain the channel instead of immediately giving up. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: a1ec897216fa	2026-03-28 06:55:45 +01:00
moebiusband	cc3d03bb5b	fix: Unbound growth of wal files in case of checkpointing error Entire-Checkpoint: 95a89a7127c5	2026-03-28 06:26:21 +01:00
moebiusband	ac0a4cc39a	Increase shutdown timeouts and WAL flush interval Entire-Checkpoint: 94ee2fb97830	2026-03-27 09:56:34 +01:00
Aditya Ujeniya	6e97ac8b28	Verbose logs for DataDoesNotAlign error in CCMS	2026-03-26 14:13:12 +01:00
moebiusband	97d65a9e5c	Fix bugs in WAL journal pipeline Entire-Checkpoint: 8fe0de4e6ac2	2026-03-26 07:25:36 +01:00
moebiusband	e759810051	Add shutdown timings. Do not drain WAL buffers on shutdown Entire-Checkpoint: d4b497002f54	2026-03-26 07:02:37 +01:00
moebiusband	6f7dda53ee	Cleanup Entire-Checkpoint: ed68d32218ac	2026-03-24 07:03:46 +01:00
moebiusband	0325d9e866	fix: Increase throughput for WAL writers Entire-Checkpoint: ddd40d290c56	2026-03-24 06:53:12 +01:00
moebiusband	e41d1251ba	fix: Continue on error Entire-Checkpoint: 6000eb5a5bb8	2026-03-23 06:37:24 +01:00
moebiusband	586c902044	Restructure metricstore cleanup archiving to stay withinh 32k parquet-go limit Entire-Checkpoint: 1660b8cf2571	2026-03-23 06:32:24 +01:00
moebiusband	01ec70baa8	Iterate over subCluster MetricConfig directly so that removed metrics are not included Entire-Checkpoint: efb6f0a96069	2026-03-20 11:39:34 +01:00
moebiusband	09501df3c2	fix: reduce memory usage in parquet checkpoint archiver Stream CheckpointFile trees directly to parquet rows instead of materializing all rows in a giant intermediate slice. This eliminates ~1.9GB per host of redundant allocations (repeated string headers) and removes the expensive sort on millions of 104-byte structs. Key changes: - Replace flattenCheckpointFile + sortParquetRows + WriteHostRows with streaming WriteCheckpointFile that walks the tree with sorted keys - Reduce results channel buffer from len(hostEntries) to 2 for back-pressure (at most NumWorkers+2 results in flight) - Workers send CheckpointFile trees instead of []ParquetMetricRow - Write rows in small 1024-element batches via reusable buffer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: f31dc1847539	2026-03-18 17:32:16 +01:00
moebiusband	8b132ed7f8	fix: Blocking ReceiveNats call Entire-Checkpoint: 38a235c86ceb	2026-03-18 06:47:45 +01:00
moebiusband	bf1a8a174e	fix: Shard WAL consumer for higher throughput Entire-Checkpoint: e583b7b11439	2026-03-18 06:32:14 +01:00
moebiusband	50aed595cf	fix: metricstore NATS contention Entire-Checkpoint: 7e68050cab59	2026-03-18 06:14:15 +01:00
moebiusband	d46e6371fc	Add log about checkpoint archiving Entire-Checkpoint: bf29af79b268	2026-03-18 05:22:39 +01:00
moebiusband	02f82c2c0b	fix: Prevent memory spikes in parquet writer for metricstore move policy Entire-Checkpoint: 4a675b8352a2	2026-03-18 05:08:37 +01:00
moebiusband	bab6eb4c3a	Convert Warn message on missing metrics to Debug level	2026-03-16 15:35:24 +01:00
moebiusband	51517f8031	Reduce insert pressure in db. Increase sqlite timeout value Entire-Checkpoint: a1e2931d4deb	2026-03-16 11:17:47 +01:00
moebiusband	39ab12784c	Make checkpointInterval an option config option again. Also applies small fixes Entire-Checkpoint: c11d1a65fae4	2026-03-13 09:07:38 +01:00
moebiusband	b214e1755a	Add buffered I/O to WAL writes and fix MemoryCap comment WAL writes now go through bufio.Writer instead of raw syscalls per record, reducing I/O overhead. Buffers are flushed on rotate, drain, and shutdown. Fixed misleading MemoryCap comment ("Max bytes" → "Max memory in GB"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: b38dc35e5334	2026-03-13 09:05:24 +01:00
moebiusband	8234ad3126	fix: Fix metricstore memory explosion from broken emergency free and batch aborts - Fix MemoryUsageTracker: remove premature bufferPool.Clear() that prevented mem.Alloc from decreasing, replace broken ForceFree loop (100 iterations with no GC) with progressive time-based Free at 75%/50%/25% retention, add bufferPool.Clear()+GC between steps so memory stats update correctly - Enable debug.FreeOSMemory() after emergency freeing to return memory to OS - Add adaptive ticker: 30s checks when memory >80% of cap, normal otherwise - Reduce default memory check interval from 1h to 5min - Don't abort entire NATS batch on single write error (out-of-order timestamp), log warning and continue processing remaining lines - Prune empty levels from tree after free() to reduce overhead - Include buffer struct overhead in sizeInBytes() for more accurate reporting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Entire-Checkpoint: 7ce28627fc1d	2026-03-13 07:57:35 +01:00
moebiusband	845d0111af	Further consolidate and improve ccms query builder Entire-Checkpoint: d10e6221ee4f	2026-03-04 17:31:36 +01:00
moebiusband	26982088c3	Consolidate code for external and internal ccms buildQueries function Entire-Checkpoint: fc3be444ef4c	2026-03-04 16:43:05 +01:00
Christoph Kluge	9672903d41	fix panic caused by concurrent map writes	2026-03-04 15:54:18 +01:00
moebiusband	67a17b5306	Reduce noise in info log	2026-03-04 15:14:35 +01:00
moebiusband	39635ea123	Cleanup metricstore options Entire-Checkpoint: 2f9a4e1c2e87	2026-03-04 10:37:43 +01:00
Aditya Ujeniya	74ab51f409	Patch bufferPool with no limits to pool size	2026-03-03 09:51:04 +01:00
moebiusband	688ad507a2	Merge branch 'optimize-checkpoint-wal' into dev	2026-03-03 06:58:28 +01:00
Christoph Kluge	718ff60221	clarify ccms logs	2026-03-02 16:24:38 +01:00
Aditya Ujeniya	a243e17499	Update to shutdown worker for WAL checkpointing mode	2026-03-02 15:27:06 +01:00
moebiusband	1ec41d8389	Review and improve buffer pool implmentation. Add unit tests.	2026-02-28 19:34:33 +01:00
moebiusband	888d7fb235	Merge branch 'optimize-checkpoint-wal' of github.com:ClusterCockpit/cc-backend into optimize-checkpoint-wal	2026-02-27 17:40:34 +01:00
moebiusband	adebffd251	Replace the old zip archive options for the metricstore node data by parquet files	2026-02-27 17:40:32 +01:00
Aditya Ujeniya	2e5d85c223	Udpate testcase	2026-02-27 15:09:06 +01:00
Aditya Ujeniya	07b989cb81	Add new bufferPool implementation	2026-02-27 14:44:32 +01:00
moebiusband	a418abc7d5	Run go fix	2026-02-27 14:40:26 +01:00
moebiusband	a1db8263d7	Document line protocol. Optimize REST writeMetric path	2026-02-27 12:30:27 +01:00
moebiusband	4c3cd8e66a	Merge branch 'dev' into optimize-checkpoint-wal	2026-02-27 09:30:32 +01:00
moebiusband	6ecb934967	Switch to CC line-protocol package. Update cc-lib.	2026-02-27 08:55:33 +01:00
moebiusband	ca0f9a42c7	Introduce metric store binary checkpoints with write ahead log	2026-02-26 10:08:40 +01:00
moebiusband	cc21e0e62c	Make json the default checkpoint format	2026-02-25 07:38:19 +01:00
moebiusband	dadcb983e7	Merge branch 'dev' into add_GetMemoryDomainsBySocket_2026	2026-02-23 18:47:03 +01:00
moebiusband	03c65e06f6	Allow finer control for omit tagged jobs in retention policies	2026-02-23 08:46:47 +01:00
Christoph Kluge	bae7ec11b4	migrate changes from cc-backend PR#364	2026-02-20 15:10:02 +01:00
moebiusband	2da35909c1	Optimize sort order in job parquet files	2026-02-18 08:13:00 +01:00
moebiusband	2e24fde430	Optimize sort order in nodestate parquet files	2026-02-18 08:06:00 +01:00
moebiusband	757be60b22	Switch to zstd compression for parquet writers	2026-02-18 07:55:28 +01:00

1 2 3 4 5 ...

354 Commits