mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2026-03-24 00:27:29 +01:00
Checkpoint: e368e6d8abf3
Entire-Session: 520afa6a-6a70-437b-96c1-35c40ed3ec48 Entire-Strategy: manual-commit Entire-Agent: Claude Code Ephemeral-branch: entire/301e590-e3b0c4
This commit is contained in:
1
e3/68e6d8abf3/1/content_hash.txt
Normal file
1
e3/68e6d8abf3/1/content_hash.txt
Normal file
@@ -0,0 +1 @@
|
|||||||
|
sha256:f187013ac2acf6db7e6f13db2bfe1ab2c10050fe3d8ffd3d41122449dcf54b3c
|
||||||
24
e3/68e6d8abf3/1/context.md
Normal file
24
e3/68e6d8abf3/1/context.md
Normal file
@@ -0,0 +1,24 @@
|
|||||||
|
# Session Context
|
||||||
|
|
||||||
|
## User Prompts
|
||||||
|
|
||||||
|
### Prompt 1
|
||||||
|
|
||||||
|
Implement the following plan:
|
||||||
|
|
||||||
|
# Fix SQLite Memory Not Released After Query Timeout
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
On the production 20M-row database, when a query runs into a timeout (due to full-table scan with wrong index), the memory allocated by SQLite is **not released afterwards**. The process stays bloated until restarted. This is caused by three compounding issues in the current SQLite configuration.
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### 1. `_cache_size=1000000000` is effectively unlimited (~4TB)
|
||||||
|
|
||||||
|
**File:** `i...
|
||||||
|
|
||||||
|
### Prompt 2
|
||||||
|
|
||||||
|
Our server has 512GB main memory. Does it make sense to make cache_size and soft_heap_limit configurable to make use of the main memory capacity?
|
||||||
|
|
||||||
213
e3/68e6d8abf3/1/full.jsonl
Normal file
213
e3/68e6d8abf3/1/full.jsonl
Normal file
File diff suppressed because one or more lines are too long
34
e3/68e6d8abf3/1/metadata.json
Normal file
34
e3/68e6d8abf3/1/metadata.json
Normal file
@@ -0,0 +1,34 @@
|
|||||||
|
{
|
||||||
|
"cli_version": "0.4.8",
|
||||||
|
"checkpoint_id": "e368e6d8abf3",
|
||||||
|
"session_id": "520afa6a-6a70-437b-96c1-35c40ed3ec48",
|
||||||
|
"strategy": "manual-commit",
|
||||||
|
"created_at": "2026-03-11T05:14:07.476561Z",
|
||||||
|
"branch": "optimize-db-indices",
|
||||||
|
"checkpoints_count": 2,
|
||||||
|
"files_touched": [
|
||||||
|
"internal/repository/config.go",
|
||||||
|
"internal/repository/dbConnection.go",
|
||||||
|
"internal/repository/jobFind.go",
|
||||||
|
"internal/repository/jobQuery.go",
|
||||||
|
"internal/repository/stats.go"
|
||||||
|
],
|
||||||
|
"agent": "Claude Code",
|
||||||
|
"turn_id": "443ef781634b",
|
||||||
|
"token_usage": {
|
||||||
|
"input_tokens": 5108,
|
||||||
|
"cache_creation_tokens": 68335,
|
||||||
|
"cache_read_tokens": 1908216,
|
||||||
|
"output_tokens": 13846,
|
||||||
|
"api_call_count": 34
|
||||||
|
},
|
||||||
|
"initial_attribution": {
|
||||||
|
"calculated_at": "2026-03-11T05:14:07.210716Z",
|
||||||
|
"agent_lines": 37,
|
||||||
|
"human_added": 98,
|
||||||
|
"human_modified": 0,
|
||||||
|
"human_removed": 0,
|
||||||
|
"total_committed": 135,
|
||||||
|
"agent_percentage": 27.40740740740741
|
||||||
|
}
|
||||||
|
}
|
||||||
136
e3/68e6d8abf3/1/prompt.txt
Normal file
136
e3/68e6d8abf3/1/prompt.txt
Normal file
@@ -0,0 +1,136 @@
|
|||||||
|
Implement the following plan:
|
||||||
|
|
||||||
|
# Fix SQLite Memory Not Released After Query Timeout
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
On the production 20M-row database, when a query runs into a timeout (due to full-table scan with wrong index), the memory allocated by SQLite is **not released afterwards**. The process stays bloated until restarted. This is caused by three compounding issues in the current SQLite configuration.
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### 1. `_cache_size=1000000000` is effectively unlimited (~4TB)
|
||||||
|
|
||||||
|
**File:** `internal/repository/dbConnection.go:82`
|
||||||
|
|
||||||
|
```go
|
||||||
|
connectionURLParams.Add("_cache_size", "1000000000")
|
||||||
|
```
|
||||||
|
|
||||||
|
SQLite's `cache_size` PRAGMA interprets **positive values as page count** (default page size = 4KB). So 1,000,000,000 pages × 4KB = ~4TB. In practice, this means "never evict cached pages." After a full-table scan of 20M rows, every page touched stays in SQLite's page cache. With 4 connections (`MaxOpenConns=4`), each can independently cache gigabytes.
|
||||||
|
|
||||||
|
For comparison, the SQLite archive backend in `pkg/archive/sqliteBackend.go` uses `PRAGMA cache_size=-64000` (64MB — negative = KiB).
|
||||||
|
|
||||||
|
### 2. No query context/timeout — queries run indefinitely
|
||||||
|
|
||||||
|
**File:** `internal/repository/jobQuery.go:87`
|
||||||
|
|
||||||
|
```go
|
||||||
|
rows, err := query.RunWith(r.stmtCache).Query() // No context!
|
||||||
|
```
|
||||||
|
|
||||||
|
The `ctx` parameter is available but never passed to the database layer. Squirrel supports `.QueryContext(ctx)` but it's not used. If the HTTP request times out or the client disconnects, the query keeps running and scanning pages into cache.
|
||||||
|
|
||||||
|
### 3. No SQLite memory limit — no `soft_heap_limit` or `shrink_memory`
|
||||||
|
|
||||||
|
SQLite has built-in memory management PRAGMAs that are not configured:
|
||||||
|
- **`soft_heap_limit`** — asks SQLite to keep heap usage below N bytes (best-effort, releases cache pages to stay under limit)
|
||||||
|
- **`hard_heap_limit`** — hard cap, queries fail with SQLITE_NOMEM if exceeded
|
||||||
|
- **`shrink_memory`** — immediately releases all unused memory back to the OS
|
||||||
|
|
||||||
|
None of these are set, so SQLite allocates freely and never releases.
|
||||||
|
|
||||||
|
### 4. `temp_store = memory` amplifies the problem
|
||||||
|
|
||||||
|
**File:** `internal/repository/dbConnection.go:41`
|
||||||
|
|
||||||
|
Temporary B-tree sorts (exactly what happens during ORDER BY on a full-table scan) are stored in RAM. With 20M rows and no sort optimization, this can be gigabytes of temporary memory on top of the page cache.
|
||||||
|
|
||||||
|
### 5. Connections live for 1 hour after use
|
||||||
|
|
||||||
|
`ConnMaxIdleTime = 1 hour` means a connection that just did a massive full-table scan sits idle in the pool for up to an hour, holding all its cached pages.
|
||||||
|
|
||||||
|
## Proposed Changes
|
||||||
|
|
||||||
|
### Fix 1: Set reasonable `cache_size` (high impact, low risk)
|
||||||
|
|
||||||
|
**File:** `internal/repository/dbConnection.go:82`
|
||||||
|
|
||||||
|
Change from `1000000000` (1B pages ≈ 4TB) to `-200000` (200MB in KiB notation, per connection):
|
||||||
|
|
||||||
|
```go
|
||||||
|
connectionURLParams.Add("_cache_size", "-200000") // 200MB per connection
|
||||||
|
```
|
||||||
|
|
||||||
|
With 4 max connections: 4 × 200MB = 800MB max page cache. This is generous enough for normal queries but prevents runaway memory after full-table scans.
|
||||||
|
|
||||||
|
### Fix 2: Add `soft_heap_limit` (high impact, low risk)
|
||||||
|
|
||||||
|
**File:** `internal/repository/dbConnection.go`, in `setupSqlite()`:
|
||||||
|
|
||||||
|
```go
|
||||||
|
"soft_heap_limit = 1073741824", // 1GB soft limit across all connections
|
||||||
|
```
|
||||||
|
|
||||||
|
This is a **process-wide** limit (not per-connection). SQLite will try to release cache pages to stay under 1GB total. It's a soft limit — it won't abort queries, just evicts cache more aggressively.
|
||||||
|
|
||||||
|
### Fix 3: Pass context to database queries (medium impact, medium effort)
|
||||||
|
|
||||||
|
Change `.Query()` to `.QueryContext(ctx)`, `.QueryRow()` to `.QueryRowContext(ctx)`, and `.Scan()` to `.ScanContext(ctx)` for all query methods that already receive a `ctx` parameter. This allows HTTP request cancellation to stop the SQLite query.
|
||||||
|
|
||||||
|
**Note:** The `stmtCache` from squirrel supports `QueryContext`/`QueryRowContext`. Only methods that already have `ctx` are changed — no signature changes needed.
|
||||||
|
|
||||||
|
**Call sites to update** (methods that have `ctx` and call `.Query()`/`.QueryRow()`/`.Scan()`):
|
||||||
|
|
||||||
|
| File | Method | Line | Change |
|
||||||
|
|------|--------|------|--------|
|
||||||
|
| `jobQuery.go` | `QueryJobs` | 87 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `jobQuery.go` | `CountJobs` | 129 | `.Scan()` → `.ScanContext(ctx)` |
|
||||||
|
| `stats.go` | `JobsStatsGrouped` | 233 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `stats.go` | `JobsStats` | 358 | `.QueryRow()` → `.QueryRowContext(ctx)` |
|
||||||
|
| `stats.go` | `JobCountGrouped` | 443 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `stats.go` | `AddJobCountGrouped` | 504 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `stats.go` | `AddJobCount` | 569 | `.Scan()` → `.ScanContext(ctx)` |
|
||||||
|
| `stats.go` | `jobsStatisticsHistogram` | 758 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `stats.go` | `jobsDurationStatisticsHistogram` | 832 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `stats.go` | `jobsMetricStatisticsHistogram` | 962 | `.Query()` → `.QueryContext(ctx)` |
|
||||||
|
| `jobFind.go` | `FindByID` | 174 | `.QueryRow()` → `.QueryRowContext(ctx)` |
|
||||||
|
| `jobFind.go` | `FindByJobID` | 220 | `.QueryRow()` → `.QueryRowContext(ctx)` |
|
||||||
|
| `job.go` | `CountGroupedJobs` | 410 | `.Scan()` → `.ScanContext(ctx)` (needs ctx added to signature) |
|
||||||
|
| `job.go` | `GetJobList` | 751 | `.Query()` → `.QueryContext(ctx)` (needs ctx added to signature) |
|
||||||
|
|
||||||
|
Methods without `ctx` in their signature (e.g., `GetJobList`, `CountGroupedJobs`) can either have `ctx` added or be left for a follow-up. The priority is the methods already accepting `ctx`.
|
||||||
|
|
||||||
|
### Fix 4: Reduce `ConnMaxIdleTime`
|
||||||
|
|
||||||
|
**File:** `internal/repository/config.go`
|
||||||
|
|
||||||
|
Reduce `ConnectionMaxIdleTime` from 1 hour to 10 minutes. Idle connections with cached pages are released sooner.
|
||||||
|
|
||||||
|
## Scope — All Four Fixes
|
||||||
|
|
||||||
|
All fixes are complementary, independent, and low-risk. No disadvantages to applying all together.
|
||||||
|
|
||||||
|
## Files to Modify
|
||||||
|
|
||||||
|
- `internal/repository/dbConnection.go` — cache_size, soft_heap_limit (Fixes 1 & 2)
|
||||||
|
- `internal/repository/jobQuery.go` — QueryContext (Fix 3)
|
||||||
|
- `internal/repository/job.go` — QueryContext for other query methods (Fix 3)
|
||||||
|
- `internal/repository/stats.go` — QueryContext for stats queries (Fix 3)
|
||||||
|
- `internal/repository/config.go` — ConnMaxIdleTime (Fix 4, optional)
|
||||||
|
|
||||||
|
## Verification
|
||||||
|
|
||||||
|
1. `go build ./...` — compiles
|
||||||
|
2. `go test ./internal/repository/...` — tests pass
|
||||||
|
3. Monitor RSS memory before/after on production:
|
||||||
|
- Trigger a heavy query, observe memory spike
|
||||||
|
- Verify memory drops back down within seconds (not hours)
|
||||||
|
4. `PRAGMA cache_size;` in sqlite3 CLI on running DB — should show `-200000`
|
||||||
|
5. `PRAGMA soft_heap_limit;` — should show `1073741824`
|
||||||
|
|
||||||
|
|
||||||
|
If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/jan/.claude/projects/-Users-jan-prg-CC-cc-backend/c31c699a-f492-48f7-bcf0-35d3ceeac243.jsonl
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Our server has 512GB main memory. Does it make sense to make cache_size and soft_heap_limit configurable to make use of the main memory capacity?
|
||||||
@@ -3,7 +3,7 @@
|
|||||||
"checkpoint_id": "e368e6d8abf3",
|
"checkpoint_id": "e368e6d8abf3",
|
||||||
"strategy": "manual-commit",
|
"strategy": "manual-commit",
|
||||||
"branch": "optimize-db-indices",
|
"branch": "optimize-db-indices",
|
||||||
"checkpoints_count": 1,
|
"checkpoints_count": 3,
|
||||||
"files_touched": [
|
"files_touched": [
|
||||||
"cmd/cc-backend/main.go",
|
"cmd/cc-backend/main.go",
|
||||||
"internal/config/config.go",
|
"internal/config/config.go",
|
||||||
@@ -21,13 +21,20 @@
|
|||||||
"context": "/e3/68e6d8abf3/0/context.md",
|
"context": "/e3/68e6d8abf3/0/context.md",
|
||||||
"content_hash": "/e3/68e6d8abf3/0/content_hash.txt",
|
"content_hash": "/e3/68e6d8abf3/0/content_hash.txt",
|
||||||
"prompt": "/e3/68e6d8abf3/0/prompt.txt"
|
"prompt": "/e3/68e6d8abf3/0/prompt.txt"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"metadata": "/e3/68e6d8abf3/1/metadata.json",
|
||||||
|
"transcript": "/e3/68e6d8abf3/1/full.jsonl",
|
||||||
|
"context": "/e3/68e6d8abf3/1/context.md",
|
||||||
|
"content_hash": "/e3/68e6d8abf3/1/content_hash.txt",
|
||||||
|
"prompt": "/e3/68e6d8abf3/1/prompt.txt"
|
||||||
}
|
}
|
||||||
],
|
],
|
||||||
"token_usage": {
|
"token_usage": {
|
||||||
"input_tokens": 15,
|
"input_tokens": 5123,
|
||||||
"cache_creation_tokens": 27408,
|
"cache_creation_tokens": 95743,
|
||||||
"cache_read_tokens": 402383,
|
"cache_read_tokens": 2310599,
|
||||||
"output_tokens": 3673,
|
"output_tokens": 17519,
|
||||||
"api_call_count": 13
|
"api_call_count": 47
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
Reference in New Issue
Block a user