mirror of
https://github.com/ClusterCockpit/cc-backend
synced 2026-03-24 00:27:29 +01:00
Checkpoint: af7afc9a29ff
Entire-Session: c31c699a-f492-48f7-bcf0-35d3ceeac243 Entire-Strategy: manual-commit Entire-Agent: Claude Code Ephemeral-branch: entire/eba3995-e3b0c4
This commit is contained in:
139
af/7afc9a29ff/0/prompt.txt
Normal file
139
af/7afc9a29ff/0/prompt.txt
Normal file
@@ -0,0 +1,139 @@
|
||||
Implement the following plan:
|
||||
|
||||
# Optimize Job Table Indexes for 20M Row Production Database
|
||||
|
||||
## Context
|
||||
|
||||
The `job` table has **79 indexes** (created in migrations 08/09), causing:
|
||||
1. **Wrong index selection** — without `ANALYZE` statistics, SQLite picks wrong indexes (e.g., `jobs_jobstate_energy` instead of `jobs_starttime` for ORDER BY queries), causing full-table temp B-tree sorts on 20M rows → timeouts
|
||||
2. **Excessive disk/memory overhead** — each index costs ~200-400MB at 20M rows; 79 indexes = ~16-32GB wasted
|
||||
3. **Slower writes** — every INSERT/UPDATE touches all 79 indexes
|
||||
4. **Planner confusion** — too many similar indexes make the query planner's cost estimation unreliable
|
||||
|
||||
The `ANALYZE` fix (already added to `setupSqlite` in `dbConnection.go`) resolves the planner issue with current indexes, but the index count must be reduced for disk/write performance.
|
||||
|
||||
## Approach: Reduce to 20 indexes
|
||||
|
||||
The key insight from query plan analysis: with `ANALYZE` and `LIMIT`, a `(filter_col, sort_col)` index is often better than `(filter_col1, filter_col2, sort_col)` because SQLite can scan the index in sort order and cheaply filter non-matching rows, stopping at LIMIT.
|
||||
|
||||
### Verified query plans (with ANALYZE, after this change)
|
||||
|
||||
| # | Pattern | Index Used | Plan |
|
||||
|---|---------|-----------|------|
|
||||
| 1 | Multi-state IN + ORDER BY start_time LIMIT | `jobs_starttime` | SCAN (index order, no sort) |
|
||||
| 2 | cluster + state + sort start_time | `jobs_cluster_starttime_duration` | SEARCH |
|
||||
| 3 | hpc_user + sort start_time | `jobs_user_starttime_duration` | SEARCH |
|
||||
| 4 | cluster + state aggregation | `jobs_cluster_jobstate_duration_starttime` | COVERING SEARCH |
|
||||
| 5 | Unique lookup (job_id,cluster,start_time) | `sqlite_autoindex_job_1` | SEARCH |
|
||||
| 6 | Running jobs for cluster + duration > | `jobs_cluster_jobstate_duration_starttime` | SEARCH |
|
||||
| 7 | start_time BETWEEN range | `jobs_starttime` | SEARCH |
|
||||
| 8 | GROUP BY user with cluster | `jobs_cluster_user` | COVERING SEARCH |
|
||||
| 9 | Concurrent jobs (cluster + start_time <) | `jobs_cluster_starttime_duration` | SEARCH |
|
||||
| 10 | project IN + state IN + sort | `jobs_jobstate_project` | SEARCH + temp sort |
|
||||
| 11 | user + multi-state + sort start_time | `jobs_user_starttime_duration` | SEARCH |
|
||||
| 12 | cluster + state + sort duration | `jobs_cluster_jobstate_duration_starttime` | SEARCH |
|
||||
| 13 | cluster + state + sort num_nodes | `jobs_cluster_numnodes` | SEARCH (state filtered per-row) |
|
||||
| 14 | Tag join | `tags_tagid` + PK | SEARCH |
|
||||
| 15 | Delete before timestamp | `jobs_starttime` | COVERING SEARCH |
|
||||
| 16 | Non-running jobs (GetJobList) | `jobs_jobstate_duration_starttime` | COVERING SCAN |
|
||||
|
||||
## Changes Required
|
||||
|
||||
### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.up.sql` (new)
|
||||
|
||||
```sql
|
||||
-- Drop all 77 job indexes from migration 09 (sqlite_autoindex_job_1 is UNIQUE, kept)
|
||||
-- Then create optimized set of 20
|
||||
|
||||
-- GROUP 1: Global (1 index)
|
||||
-- #1 jobs_starttime (start_time)
|
||||
-- Default sort for unfiltered/multi-state queries, time range, delete-before
|
||||
|
||||
-- GROUP 2: Cluster-prefixed (8 indexes)
|
||||
-- #2 jobs_cluster_starttime_duration (cluster, start_time, duration)
|
||||
-- Cluster + default sort, concurrent jobs, time range within cluster
|
||||
-- #3 jobs_cluster_duration_starttime (cluster, duration, start_time)
|
||||
-- Cluster + sort by duration
|
||||
-- #4 jobs_cluster_jobstate_duration_starttime (cluster, job_state, duration, start_time)
|
||||
-- COVERING for cluster+state aggregation; running jobs (cluster, state, duration>?)
|
||||
-- #5 jobs_cluster_jobstate_starttime_duration (cluster, job_state, start_time, duration)
|
||||
-- Cluster+state+sort start_time (single state equality)
|
||||
-- #6 jobs_cluster_user (cluster, hpc_user)
|
||||
-- COVERING for GROUP BY user with cluster filter
|
||||
-- #7 jobs_cluster_project (cluster, project)
|
||||
-- GROUP BY project with cluster filter
|
||||
-- #8 jobs_cluster_subcluster (cluster, subcluster)
|
||||
-- GROUP BY subcluster with cluster filter
|
||||
-- #9 jobs_cluster_numnodes (cluster, num_nodes)
|
||||
-- Cluster + sort by num_nodes (state filtered per-row, fast with LIMIT)
|
||||
|
||||
-- GROUP 3: User-prefixed (1 index)
|
||||
-- #10 jobs_user_starttime_duration (hpc_user, start_time, duration)
|
||||
-- Security filter (user role) + default sort
|
||||
|
||||
-- GROUP 4: Project-prefixed (1 index)
|
||||
-- #11 jobs_project_starttime_duration (project, start_time, duration)
|
||||
-- Security filter (manager role) + default sort
|
||||
|
||||
-- GROUP 5: JobState-prefixed (3 indexes)
|
||||
-- #12 jobs_jobstate_project (job_state, project)
|
||||
-- State + project filter (for manager security within state query)
|
||||
-- #13 jobs_jobstate_user (job_state, hpc_user)
|
||||
-- State + user filter/aggregation
|
||||
-- #14 jobs_jobstate_duration_starttime (job_state, duration, start_time)
|
||||
-- COVERING for non-running jobs scan, state + sort duration
|
||||
|
||||
-- GROUP 6: Rare filters (1 index)
|
||||
-- #15 jobs_arrayjobid (array_job_id)
|
||||
-- Array job lookup (rare but must be indexed)
|
||||
|
||||
-- GROUP 7: Secondary sort columns (5 indexes)
|
||||
-- #16 jobs_cluster_numhwthreads (cluster, num_hwthreads)
|
||||
-- #17 jobs_cluster_numacc (cluster, num_acc)
|
||||
-- #18 jobs_cluster_energy (cluster, energy)
|
||||
-- #19 jobs_cluster_partition_starttime (cluster, cluster_partition, start_time)
|
||||
-- Cluster+partition + sort start_time
|
||||
-- #20 jobs_cluster_partition_jobstate (cluster, cluster_partition, job_state)
|
||||
-- Cluster+partition+state filter
|
||||
```
|
||||
|
||||
### What's dropped and why (59 indexes removed)
|
||||
|
||||
| Category | Count | Why redundant |
|
||||
|----------|-------|---------------|
|
||||
| cluster+partition sort/filter variants | 8 | Kept only 2 partition indexes (#19, #20); rest use cluster indexes + row filter |
|
||||
| cluster+shared (all) | 8 | `shared` is rare; cluster index + row filter is fast |
|
||||
| shared-prefixed (all) | 8 | `shared` alone is never a leading filter |
|
||||
| cluster+jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | Replaced by `(cluster, sort_col)` indexes which work for any state combo with LIMIT |
|
||||
| user sort variants (numnodes, hwthreads, acc, energy, duration) | 5 | User result sets are small; temp sort is fast |
|
||||
| project sort variants + project_user | 6 | Same reasoning as user |
|
||||
| jobstate sort variants (numnodes, hwthreads, acc, energy) | 4 | State has low cardinality; cluster+sort indexes handle these |
|
||||
| single-filter+starttime (5) + single-filter+duration (5) | 10 | Queries always have cluster/user/project filter; standalone rare |
|
||||
| standalone duration | 1 | Covered by cluster_duration_starttime |
|
||||
| duplicate arrayjob variants | 1 | Simplified to single-column (array_job_id) |
|
||||
| redundant cluster_starttime variants | 2 | Consolidated into 2 cluster+time indexes |
|
||||
| cluster_jobstate_user, cluster_jobstate_project | 2 | Covered by cluster_user/cluster_project + state row filter |
|
||||
|
||||
### File: `internal/repository/migrations/sqlite3/11_optimize-indexes.down.sql` (new)
|
||||
|
||||
Recreate all 77 indexes from migration 09 for safe rollback.
|
||||
|
||||
### File: `internal/repository/migration.go`
|
||||
|
||||
Increment `Version` from `10` to `11`.
|
||||
|
||||
## Verification
|
||||
|
||||
1. `go build ./...` — compiles
|
||||
2. `go test ./internal/repository/...` — tests pass
|
||||
3. `cc-backend -migrate-db` on a test copy of production DB
|
||||
4. After migration, run `ANALYZE;` then verify all 16 query plans match the table above using:
|
||||
```sql
|
||||
EXPLAIN QUERY PLAN SELECT * FROM job WHERE job.job_state IN ('completed','running','failed') ORDER BY job.start_time DESC LIMIT 50;
|
||||
-- Should show: SCAN job USING INDEX jobs_starttime
|
||||
```
|
||||
5. Verify index count: `SELECT COUNT(*) FROM sqlite_master WHERE type='index' AND tbl_name='job';` → should be 21 (20 + autoindex)
|
||||
6. Compare DB file size before/after (expect ~70% reduction in index overhead)
|
||||
|
||||
|
||||
If you need specific details from before exiting plan mode (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/jan/.claude/projects/-Users-jan-prg-CC-cc-backend/42401d2e-7d1c-4c0e-abe6-356cb2d48747.jsonl
|
||||
Reference in New Issue
Block a user