Checkpoint: af7afc9a29ff

Entire-Session: c31c699a-f492-48f7-bcf0-35d3ceeac243 Entire-Strategy: manual-commit Entire-Agent: Claude Code Ephemeral-branch: entire/eba3995-e3b0c4
2026-05-17 10:27:30 +02:00 · 2026-03-11 05:46:04 +01:00
parent 41a089efcf
commit a7a96333ad
6 changed files with 256 additions and 0 deletions
--- a/af/7afc9a29ff/0/prompt.txt
+++ b/af/7afc9a29ff/0/prompt.txt
@@ -0,0 +1,139 @@
+Implement the following plan:
+
+# Optimize Job Table Indexes for 20M Row Production Database
+
+## Context
+
+The `job` table has **79 indexes** (created in migrations 08/09), causing:
+1. **Wrong index selection** — without `ANALYZE` statistics, SQLite picks wrong indexes (e.g., `jobs_jobstate_energy` instead of `jobs_starttime` for ORDER BY queries), causing full-table temp B-tree sorts on 20M rows → timeouts
+2. **Excessive disk/memory overhead** — each index costs ~200-400MB at 20M rows; 79 indexes = ~16-32GB wasted
+3. **Slower writes** — every INSERT/UPDATE touches all 79 indexes
+4. **Planner confusion** — too many similar indexes make the query planner's cost estimation unreliable
+
+The `ANALYZE` fix (already added to `setupSqlite` in `dbConnection.go`) resolves the planner issue with current indexes, but the index count must be reduced for disk/write performance.
+
+## Approach: Reduce to 20 indexes
+
+The key insight from query plan analysis: with `ANALYZE` and `LIMIT`, a `(filter_col, sort_col)` index is often better than `(filter_col1, filter_col2, sort_col)` because SQLite can scan the index in sort order and cheaply filter non-matching rows, stopping at LIMIT.
+
+### Verified query plans (with ANALYZE, after this change)
+
+| # | Pattern | Index Used | Plan |
+|---|---------|-----------|------|
+| 1 | Multi-state IN + ORDER BY start_time LIMIT | `jobs_starttime` | SCAN (index order, no sort) |
+| 2 | cluster + state + sort start_time | `jobs_cluster_starttime_duration` | SEARCH |
+| 3 | hpc_user + sort start_time | `jobs_user_starttime_duration` | SEARCH |
+| 4 | cluster + state aggregation | `jobs_cluster_jobstate_duration_starttime` | COVERING SEARCH |
+| 5 | Unique lookup (job_id,cluster,start_time) | `sqlite_autoindex_job_1` | SEARCH |
+| 6 | Running jobs for cluster + duration > | `jobs_cluster_jobstate_duration_starttime` | SEARCH |
+| 7 | start_time BETWEEN range | `jobs_starttime` | SEARCH |
+| 8 | GROUP BY user with cluster | `jobs_cluster_user` | COVERING SEARCH |
+| 9 | Concurrent jobs (cluster + start_time <) | `jobs_cluster_starttime_duration` | SEARCH |
+| 10 | project IN + state IN + sort | `jobs_jobstate_project` | SEARCH + temp sort |
+| 11 | user + multi-state + sort start_time | `jobs_user_starttime_duration` | SEARCH |
+| 12 | cluster + state + sort duration | `jobs_cluster_jobstate_duration_starttime` | SEARCH |
+| 13 | cluster + state + sort num_nodes | `jobs_cluster_numnodes` | SEARCH (state filtered per-row) |
+| 14 | Tag join | `tags_tagid` + PK | SEARCH |
+| 15 | Delete before timestamp | `jobs_starttime` | COVERING SEARCH |
+| 16 | Non-running jobs (GetJobList) | `jobs_jobstate_duration_starttime` | COVERING SCAN |
+
+## Changes Required
+
+### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.up.sql` (new)
+
+```sql
+-- Drop all 77 job indexes from migration 09 (sqlite_autoindex_job_1 is UNIQUE, kept)
+-- Then create optimized set of 20
+
+-- GROUP 1: Global (1 index)
+-- #1 jobs_starttime (start_time)
+--    Default sort for unfiltered/multi-state queries, time range, delete-before
+
+-- GROUP 2: Cluster-prefixed (8 indexes)
+-- #2 jobs_cluster_starttime_duration (cluster, start_time, duration)
+--    Cluster + default sort, concurrent jobs, time range within cluster
+-- #3 jobs_cluster_duration_starttime (cluster, duration, start_time)
+--    Cluster + sort by duration
+-- #4 jobs_cluster_jobstate_duration_starttime (cluster, job_state, duration, start_time)
+--    COVERING for cluster+state aggregation; running jobs (cluster, state, duration>?)
+-- #5 jobs_cluster_jobstate_starttime_duration (cluster, job_state, start_time, duration)
+--    Cluster+state+sort start_time (single state equality)
+-- #6 jobs_cluster_user (cluster, hpc_user)
+--    COVERING for GROUP BY user with cluster filter
+-- #7 jobs_cluster_project (cluster, project)
+--    GROUP BY project with cluster filter
+-- #8 jobs_cluster_subcluster (cluster, subcluster)
+--    GROUP BY subcluster with cluster filter
+-- #9 jobs_cluster_numnodes (cluster, num_nodes)
+--    Cluster + sort by num_nodes (state filtered per-row, fast with LIMIT)
+
+-- GROUP 3: User-prefixed (1 index)
+-- #10 jobs_user_starttime_duration (hpc_user, start_time, duration)
+--     Security filter (user role) + default sort
+
+-- GROUP 4: Project-prefixed (1 index)
+-- #11 jobs_project_starttime_duration (project, start_time, duration)
+--     Security filter (manager role) + default sort
+
+-- GROUP 5: JobState-prefixed (3 indexes)
+-- #12 jobs_jobstate_project (job_state, project)
+--     State + project filter (for manager security within state query)
+-- #13 jobs_jobstate_user (job_state, hpc_user)
+--     State + user filter/aggregation
+-- #14 jobs_jobstate_duration_starttime (job_state, duration, start_time)
+--     COVERING for non-running jobs scan, state + sort duration
+
+-- GROUP 6: Rare filters (1 index)
+-- #15 jobs_arrayjobid (array_job_id)
+--     Array job lookup (rare but must be indexed)
+
+-- GROUP 7: Secondary sort columns (5 indexes)
+-- #16 jobs_cluster_numhwthreads (cluster, num_hwthreads)
+-- #17 jobs_cluster_numacc (cluster, num_acc)
+-- #18 jobs_cluster_energy (cluster, energy)
+-- #19 jobs_cluster_partition_starttime (cluster, cluster_partition, start_time)
+--     Cluster+partition + sort start_time
+-- #20 jobs_cluster_partition_jobstate (cluster, cluster_partition, job_state)
+--     Cluster+partition+state filter
+```
+
+### What's dropped and why (59 indexes removed)
+
+| Category | Count | Why redundant |
+|----------|-------|---------------|
+| cluster+partition sort/filter variants | 8 | Kept only 2 partition indexes (#19, #20); rest use cluster indexes + row filter |
+| cluster+shared (all) | 8 | `shared` is rare; cluster index + row filter is fast |
+| shared-prefixed (all) | 8 | `shared` alone is never a leading filter |
+| cluster+jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | Replaced by `(cluster, sort_col)` indexes which work for any state combo with LIMIT |
+| user sort variants (numnodes, hwthreads, acc, energy, duration) | 5 | User result sets are small; temp sort is fast |
+| project sort variants + project_user | 6 | Same reasoning as user |
+| jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | State has low cardinality; cluster+sort indexes handle these |
+| single-filter+starttime (5) + single-filter+duration (5) | 10 | Queries always have cluster/user/project filter; standalone rare |
+| standalone duration | 1 | Covered by cluster_duration_starttime |
+| duplicate arrayjob variants | 1 | Simplified to single-column (array_job_id) |
+| redundant cluster_starttime variants | 2 | Consolidated into 2 cluster+time indexes |
+| cluster_jobstate_user, cluster_jobstate_project | 2 | Covered by cluster_user/cluster_project + state row filter |
+
+### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.down.sql` (new)
+
+Recreate all 77 indexes from migration 09 for safe rollback.
+
+### File: `internal/repository/migration.go`
+
+Increment `Version` from `10` to `11`.
+
+## Verification
+
+1. `go build ./...` — compiles
+2. `go test ./internal/repository/...` — tests pass
+3. `cc-backend -migrate-db` on a test copy of production DB
+4. After migration, run `ANALYZE;` then verify all 16 query plans match the table above using:
+   ```sql
+   EXPLAIN QUERY PLAN SELECT * FROM job WHERE job.job_state IN ('completed','running','failed') ORDER BY job.start_time DESC LIMIT 50;
+   -- Should show: SCAN job USING INDEX jobs_starttime
+   ```
+5. Verify index count: `SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name='job';` → should be 21 (20 + autoindex)
+6. Compare DB file size before/after (expect ~70% reduction in index overhead)
+
+
+If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/jan/.claude/projects/-Users-jan-prg-CC-cc-backend/42401d2e-7d1c-4c0e-abe6-356cb2d48747.jsonl