WAL writes during checkpoint are redundant since the binary snapshot
captures all in-memory data. Pausing eliminates channel saturation
(1.4M+ dropped messages) caused by disk I/O contention between
checkpoint writes and WAL staging. Also removes direct WAL file
deletion in checkpoint workers that raced with the staging goroutine.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 34d698f40bac
RotateWALFiles used a non-blocking send (select/default) on rotation
channels buffered at 64. With thousands of nodes and few shards, the
channel fills instantly and nearly all hosts are skipped, leaving WAL
files unrotated indefinitely.
Replace with a blocking send using a shared 2-minute deadline so the
checkpoint goroutine waits for the staging goroutine to drain the
channel instead of immediately giving up.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: a1ec897216fa
Stream CheckpointFile trees directly to parquet rows instead of
materializing all rows in a giant intermediate slice. This eliminates
~1.9GB per host of redundant allocations (repeated string headers)
and removes the expensive sort on millions of 104-byte structs.
Key changes:
- Replace flattenCheckpointFile + sortParquetRows + WriteHostRows with
streaming WriteCheckpointFile that walks the tree with sorted keys
- Reduce results channel buffer from len(hostEntries) to 2 for
back-pressure (at most NumWorkers+2 results in flight)
- Workers send CheckpointFile trees instead of []ParquetMetricRow
- Write rows in small 1024-element batches via reusable buffer
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: f31dc1847539
WAL writes now go through bufio.Writer instead of raw syscalls per record,
reducing I/O overhead. Buffers are flushed on rotate, drain, and shutdown.
Fixed misleading MemoryCap comment ("Max bytes" → "Max memory in GB").
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: b38dc35e5334
- Fix MemoryUsageTracker: remove premature bufferPool.Clear() that prevented
mem.Alloc from decreasing, replace broken ForceFree loop (100 iterations
with no GC) with progressive time-based Free at 75%/50%/25% retention,
add bufferPool.Clear()+GC between steps so memory stats update correctly
- Enable debug.FreeOSMemory() after emergency freeing to return memory to OS
- Add adaptive ticker: 30s checks when memory >80% of cap, normal otherwise
- Reduce default memory check interval from 1h to 5min
- Don't abort entire NATS batch on single write error (out-of-order timestamp),
log warning and continue processing remaining lines
- Prune empty levels from tree after free() to reduce overhead
- Include buffer struct overhead in sizeInBytes() for more accurate reporting
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Entire-Checkpoint: 7ce28627fc1d