Implement the following plan: # Optimize Job Table Indexes for 20M Row Production Database ## Context The `job` table has **79 indexes** (created in migrations 08/09), causing: 1. **Wrong index selection** — without `ANALYZE` statistics, SQLite picks wrong indexes (e.g., `jobs_jobstate_energy` instead of `jobs_starttime` for ORDER BY queries), causing full-table temp B-tree sorts on 20M rows → timeouts 2. **Excessive disk/memory overhead** — each index costs ~200-400MB at 20M rows; 79 indexes = ~16-32GB wasted 3. **Slower writes** — every INSERT/UPDATE touches all 79 indexes 4. **Planner confusion** — too many similar indexes make the query planner's cost estimation unreliable The `ANALYZE` fix (already added to `setupSqlite` in `dbConnection.go`) resolves the planner issue with current indexes, but the index count must be reduced for disk/write performance. ## Approach: Reduce to 20 indexes The key insight from query plan analysis: with `ANALYZE` and `LIMIT`, a `(filter_col, sort_col)` index is often better than `(filter_col1, filter_col2, sort_col)` because SQLite can scan the index in sort order and cheaply filter non-matching rows, stopping at LIMIT. ### Verified query plans (with ANALYZE, after this change) | # | Pattern | Index Used | Plan | |---|---------|-----------|------| | 1 | Multi-state IN + ORDER BY start_time LIMIT | `jobs_starttime` | SCAN (index order, no sort) | | 2 | cluster + state + sort start_time | `jobs_cluster_starttime_duration` | SEARCH | | 3 | hpc_user + sort start_time | `jobs_user_starttime_duration` | SEARCH | | 4 | cluster + state aggregation | `jobs_cluster_jobstate_duration_starttime` | COVERING SEARCH | | 5 | Unique lookup (job_id,cluster,start_time) | `sqlite_autoindex_job_1` | SEARCH | | 6 | Running jobs for cluster + duration > | `jobs_cluster_jobstate_duration_starttime` | SEARCH | | 7 | start_time BETWEEN range | `jobs_starttime` | SEARCH | | 8 | GROUP BY user with cluster | `jobs_cluster_user` | COVERING SEARCH | | 9 | Concurrent jobs (cluster + start_time <) | `jobs_cluster_starttime_duration` | SEARCH | | 10 | project IN + state IN + sort | `jobs_jobstate_project` | SEARCH + temp sort | | 11 | user + multi-state + sort start_time | `jobs_user_starttime_duration` | SEARCH | | 12 | cluster + state + sort duration | `jobs_cluster_jobstate_duration_starttime` | SEARCH | | 13 | cluster + state + sort num_nodes | `jobs_cluster_numnodes` | SEARCH (state filtered per-row) | | 14 | Tag join | `tags_tagid` + PK | SEARCH | | 15 | Delete before timestamp | `jobs_starttime` | COVERING SEARCH | | 16 | Non-running jobs (GetJobList) | `jobs_jobstate_duration_starttime` | COVERING SCAN | ## Changes Required ### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.up.sql` (new) ```sql -- Drop all 77 job indexes from migration 09 (sqlite_autoindex_job_1 is UNIQUE, kept) -- Then create optimized set of 20 -- GROUP 1: Global (1 index) -- #1 jobs_starttime (start_time) -- Default sort for unfiltered/multi-state queries, time range, delete-before -- GROUP 2: Cluster-prefixed (8 indexes) -- #2 jobs_cluster_starttime_duration (cluster, start_time, duration) -- Cluster + default sort, concurrent jobs, time range within cluster -- #3 jobs_cluster_duration_starttime (cluster, duration, start_time) -- Cluster + sort by duration -- #4 jobs_cluster_jobstate_duration_starttime (cluster, job_state, duration, start_time) -- COVERING for cluster+state aggregation; running jobs (cluster, state, duration>?) -- #5 jobs_cluster_jobstate_starttime_duration (cluster, job_state, start_time, duration) -- Cluster+state+sort start_time (single state equality) -- #6 jobs_cluster_user (cluster, hpc_user) -- COVERING for GROUP BY user with cluster filter -- #7 jobs_cluster_project (cluster, project) -- GROUP BY project with cluster filter -- #8 jobs_cluster_subcluster (cluster, subcluster) -- GROUP BY subcluster with cluster filter -- #9 jobs_cluster_numnodes (cluster, num_nodes) -- Cluster + sort by num_nodes (state filtered per-row, fast with LIMIT) -- GROUP 3: User-prefixed (1 index) -- #10 jobs_user_starttime_duration (hpc_user, start_time, duration) -- Security filter (user role) + default sort -- GROUP 4: Project-prefixed (1 index) -- #11 jobs_project_starttime_duration (project, start_time, duration) -- Security filter (manager role) + default sort -- GROUP 5: JobState-prefixed (3 indexes) -- #12 jobs_jobstate_project (job_state, project) -- State + project filter (for manager security within state query) -- #13 jobs_jobstate_user (job_state, hpc_user) -- State + user filter/aggregation -- #14 jobs_jobstate_duration_starttime (job_state, duration, start_time) -- COVERING for non-running jobs scan, state + sort duration -- GROUP 6: Rare filters (1 index) -- #15 jobs_arrayjobid (array_job_id) -- Array job lookup (rare but must be indexed) -- GROUP 7: Secondary sort columns (5 indexes) -- #16 jobs_cluster_numhwthreads (cluster, num_hwthreads) -- #17 jobs_cluster_numacc (cluster, num_acc) -- #18 jobs_cluster_energy (cluster, energy) -- #19 jobs_cluster_partition_starttime (cluster, cluster_partition, start_time) -- Cluster+partition + sort start_time -- #20 jobs_cluster_partition_jobstate (cluster, cluster_partition, job_state) -- Cluster+partition+state filter ``` ### What's dropped and why (59 indexes removed) | Category | Count | Why redundant | |----------|-------|---------------| | cluster+partition sort/filter variants | 8 | Kept only 2 partition indexes (#19, #20); rest use cluster indexes + row filter | | cluster+shared (all) | 8 | `shared` is rare; cluster index + row filter is fast | | shared-prefixed (all) | 8 | `shared` alone is never a leading filter | | cluster+jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | Replaced by `(cluster, sort_col)` indexes which work for any state combo with LIMIT | | user sort variants (numnodes, hwthreads, acc, energy, duration) | 5 | User result sets are small; temp sort is fast | | project sort variants + project_user | 6 | Same reasoning as user | | jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | State has low cardinality; cluster+sort indexes handle these | | single-filter+starttime (5) + single-filter+duration (5) | 10 | Queries always have cluster/user/project filter; standalone rare | | standalone duration | 1 | Covered by cluster_duration_starttime | | duplicate arrayjob variants | 1 | Simplified to single-column (array_job_id) | | redundant cluster_starttime variants | 2 | Consolidated into 2 cluster+time indexes | | cluster_jobstate_user, cluster_jobstate_project | 2 | Covered by cluster_user/cluster_project + state row filter | ### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.down.sql` (new) Recreate all 77 indexes from migration 09 for safe rollback. ### File: `internal/repository/migration.go` Increment `Version` from `10` to `11`. ## Verification 1. `go build ./...` — compiles 2. `go test ./internal/repository/...` — tests pass 3. `cc-backend -migrate-db` on a test copy of production DB 4. After migration, run `ANALYZE;` then verify all 16 query plans match the table above using: ```sql EXPLAIN QUERY PLAN SELECT * FROM job WHERE job.job_state IN ('completed','running','failed') ORDER BY job.start_time DESC LIMIT 50; -- Should show: SCAN job USING INDEX jobs_starttime ``` 5. Verify index count: `SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name='job';` → should be 21 (20 + autoindex) 6. Compare DB file size before/after (expect ~70% reduction in index overhead) If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/jan/.claude/projects/-Users-jan-prg-CC-cc-backend/42401d2e-7d1c-4c0e-abe6-356cb2d48747.jsonl