feat: add step-by-step migration strategy for multi-dimensional embeddings

Adds alternative migration approach for users on Supabase free tier who encounter timeouts when running the full upgrade_database.sql script. Breaks migration into 4 manageable steps to avoid memory/timeout issues when creating vector indexes on large datasets.
This commit is contained in:
Rasmus Widing
2025-09-17 11:41:06 +03:00
parent b2ec7df666
commit 1a0a850e24
5 changed files with 472 additions and 0 deletions

View File

@@ -0,0 +1,128 @@
# Archon Database Migration Guide
## Problem: Supabase SQL Editor Timeout
The full migration script times out in Supabase SQL editor due to memory-intensive vector index creation.
## Solution: Run Migration in Steps
### Method 1: Use Step-by-Step Scripts (Recommended)
Run these scripts in order in the Supabase SQL editor:
1. **Step 1**: `step1_add_columns.sql` - Adds new columns (fast, ~5 seconds)
2. **Step 2**: `step2_migrate_data.sql` - Migrates existing data (fast, ~10 seconds)
3. **Step 3**: `step3_create_functions.sql` - Creates search functions (fast, ~5 seconds)
4. **Step 4**: `step4_create_indexes_optional.sql` - Creates indexes (may timeout - OPTIONAL)
**Note**: If Step 4 times out, the system will still work using brute-force search. You can create indexes later.
### Method 2: Direct Database Connection
Connect directly to your Supabase database using psql or a database client:
#### Get Connection String
1. Go to Supabase Dashboard → Settings → Database
2. Copy the connection string (use "Session pooler" for migrations)
3. It looks like: `postgresql://postgres.[project-ref]:[password]@aws-0-[region].pooler.supabase.com:6543/postgres`
#### Using psql
```bash
# Connect to database
psql "postgresql://postgres.[project-ref]:[password]@aws-0-[region].pooler.supabase.com:6543/postgres"
# Run the full migration
\i migration/upgrade_database_with_memory_fix.sql
# Or run individual steps
\i migration/step1_add_columns.sql
\i migration/step2_migrate_data.sql
\i migration/step3_create_functions.sql
\i migration/step4_create_indexes_optional.sql
```
#### Using TablePlus/DBeaver/pgAdmin
1. Create new connection with your connection string
2. Open and run each SQL file in order
3. Monitor execution time and memory usage
### Method 3: Use Supabase CLI
```bash
# Install Supabase CLI
npm install -g supabase
# Login and link project
supabase login
supabase link --project-ref [your-project-ref]
# Run migration
supabase db push migration/upgrade_database_with_memory_fix.sql
```
### Method 4: Skip Vector Indexes Entirely
If you have a small dataset (<10,000 documents), you can skip Step 4 entirely. The system will use brute-force search which is fast enough for small datasets.
## Verification
After migration, run this query to verify:
```sql
SELECT
EXISTS(SELECT 1 FROM information_schema.columns
WHERE table_name = 'archon_crawled_pages'
AND column_name = 'embedding_1536') as has_1536_column,
EXISTS(SELECT 1 FROM information_schema.routines
WHERE routine_name = 'match_archon_crawled_pages_multi') as has_multi_function,
COUNT(*) as index_count
FROM pg_indexes
WHERE tablename IN ('archon_crawled_pages', 'archon_code_examples')
AND indexname LIKE '%embedding%';
```
Expected result:
- `has_1536_column`: true
- `has_multi_function`: true
- `index_count`: 8+ (or 0 if you skipped Step 4)
## Troubleshooting
### "Memory required" error
- Increase `maintenance_work_mem` in the script
- Use direct database connection instead of SQL editor
- Create indexes one at a time
### "Statement timeout" error
- Run scripts in smaller steps
- Use direct database connection
- Increase `statement_timeout` setting
### "Permission denied" error
- Ensure you're using the service role key
- Check database permissions in Supabase dashboard
## Post-Migration
After successful migration:
1. **Restart services**:
```bash
docker compose restart
```
2. **Test the system**:
- Check if RAG search works in the UI
- Try crawling a new website
- Verify embeddings are being created
3. **Monitor performance**:
- If searches are slow without indexes, create them via direct connection
- Consider using smaller embedding dimensions (384 or 768) for faster performance
## Need Help?
If you encounter issues:
1. Check Supabase logs: Dashboard → Logs → Postgres
2. Verify your Supabase plan has sufficient resources
3. Contact Supabase support for memory limit increases (paid plans only)

View File

@@ -0,0 +1,35 @@
-- ======================================================================
-- STEP 1: ADD COLUMNS ONLY (No indexes yet)
-- Run this first - should complete quickly
-- ======================================================================
-- Increase memory for this session
SET maintenance_work_mem = '256MB';
BEGIN;
-- Add multi-dimensional embedding columns to archon_crawled_pages
ALTER TABLE archon_crawled_pages
ADD COLUMN IF NOT EXISTS embedding_384 VECTOR(384),
ADD COLUMN IF NOT EXISTS embedding_768 VECTOR(768),
ADD COLUMN IF NOT EXISTS embedding_1024 VECTOR(1024),
ADD COLUMN IF NOT EXISTS embedding_1536 VECTOR(1536),
ADD COLUMN IF NOT EXISTS embedding_3072 VECTOR(3072),
ADD COLUMN IF NOT EXISTS llm_chat_model TEXT,
ADD COLUMN IF NOT EXISTS embedding_model TEXT,
ADD COLUMN IF NOT EXISTS embedding_dimension INTEGER;
-- Add multi-dimensional embedding columns to archon_code_examples
ALTER TABLE archon_code_examples
ADD COLUMN IF NOT EXISTS embedding_384 VECTOR(384),
ADD COLUMN IF NOT EXISTS embedding_768 VECTOR(768),
ADD COLUMN IF NOT EXISTS embedding_1024 VECTOR(1024),
ADD COLUMN IF NOT EXISTS embedding_1536 VECTOR(1536),
ADD COLUMN IF NOT EXISTS embedding_3072 VECTOR(3072),
ADD COLUMN IF NOT EXISTS llm_chat_model TEXT,
ADD COLUMN IF NOT EXISTS embedding_model TEXT,
ADD COLUMN IF NOT EXISTS embedding_dimension INTEGER;
COMMIT;
SELECT 'Step 1 Complete: Columns added successfully' AS status;

View File

@@ -0,0 +1,70 @@
-- ======================================================================
-- STEP 2: MIGRATE EXISTING DATA (if any)
-- Run after Step 1 completes
-- ======================================================================
BEGIN;
-- Migrate existing embedding data from old column (if exists)
DO $$
DECLARE
crawled_pages_count INTEGER;
code_examples_count INTEGER;
dimension_detected INTEGER;
BEGIN
-- Check if old embedding column exists
SELECT COUNT(*) INTO crawled_pages_count
FROM information_schema.columns
WHERE table_name = 'archon_crawled_pages'
AND column_name = 'embedding';
IF crawled_pages_count > 0 THEN
-- Detect dimension
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_crawled_pages
WHERE embedding IS NOT NULL
LIMIT 1;
IF dimension_detected = 1536 THEN
UPDATE archon_crawled_pages
SET embedding_1536 = embedding,
embedding_dimension = 1536,
embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
END IF;
-- Drop old column
ALTER TABLE archon_crawled_pages DROP COLUMN IF EXISTS embedding;
END IF;
-- Same for code_examples
SELECT COUNT(*) INTO code_examples_count
FROM information_schema.columns
WHERE table_name = 'archon_code_examples'
AND column_name = 'embedding';
IF code_examples_count > 0 THEN
SELECT vector_dims(embedding) INTO dimension_detected
FROM archon_code_examples
WHERE embedding IS NOT NULL
LIMIT 1;
IF dimension_detected = 1536 THEN
UPDATE archon_code_examples
SET embedding_1536 = embedding,
embedding_dimension = 1536,
embedding_model = COALESCE(embedding_model, 'text-embedding-3-small')
WHERE embedding IS NOT NULL AND embedding_1536 IS NULL;
END IF;
ALTER TABLE archon_code_examples DROP COLUMN IF EXISTS embedding;
END IF;
END $$;
-- Drop old indexes if they exist
DROP INDEX IF EXISTS idx_archon_crawled_pages_embedding;
DROP INDEX IF EXISTS idx_archon_code_examples_embedding;
COMMIT;
SELECT 'Step 2 Complete: Data migrated successfully' AS status;

View File

@@ -0,0 +1,172 @@
-- ======================================================================
-- STEP 3: CREATE SEARCH FUNCTIONS
-- Run after Step 2 completes
-- ======================================================================
BEGIN;
-- Helper function to detect embedding dimension
CREATE OR REPLACE FUNCTION detect_embedding_dimension(embedding_vector vector)
RETURNS INTEGER AS $$
BEGIN
RETURN vector_dims(embedding_vector);
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- Helper function to get column name for dimension
CREATE OR REPLACE FUNCTION get_embedding_column_name(dimension INTEGER)
RETURNS TEXT AS $$
BEGIN
CASE dimension
WHEN 384 THEN RETURN 'embedding_384';
WHEN 768 THEN RETURN 'embedding_768';
WHEN 1024 THEN RETURN 'embedding_1024';
WHEN 1536 THEN RETURN 'embedding_1536';
WHEN 3072 THEN RETURN 'embedding_3072';
ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', dimension;
END CASE;
END;
$$ LANGUAGE plpgsql IMMUTABLE;
-- Multi-dimensional search for crawled pages
CREATE OR REPLACE FUNCTION match_archon_crawled_pages_multi (
query_embedding VECTOR,
embedding_dimension INTEGER,
match_count INT DEFAULT 10,
filter JSONB DEFAULT '{}'::jsonb,
source_filter TEXT DEFAULT NULL
) RETURNS TABLE (
id BIGINT,
url VARCHAR,
chunk_number INTEGER,
content TEXT,
metadata JSONB,
source_id TEXT,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
#variable_conflict use_column
DECLARE
sql_query TEXT;
embedding_column TEXT;
BEGIN
CASE embedding_dimension
WHEN 384 THEN embedding_column := 'embedding_384';
WHEN 768 THEN embedding_column := 'embedding_768';
WHEN 1024 THEN embedding_column := 'embedding_1024';
WHEN 1536 THEN embedding_column := 'embedding_1536';
WHEN 3072 THEN embedding_column := 'embedding_3072';
ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', embedding_dimension;
END CASE;
sql_query := format('
SELECT id, url, chunk_number, content, metadata, source_id,
1 - (%I <=> $1) AS similarity
FROM archon_crawled_pages
WHERE (%I IS NOT NULL)
AND metadata @> $3
AND ($4 IS NULL OR source_id = $4)
ORDER BY %I <=> $1
LIMIT $2',
embedding_column, embedding_column, embedding_column);
RETURN QUERY EXECUTE sql_query USING query_embedding, match_count, filter, source_filter;
END;
$$;
-- Multi-dimensional search for code examples
CREATE OR REPLACE FUNCTION match_archon_code_examples_multi (
query_embedding VECTOR,
embedding_dimension INTEGER,
match_count INT DEFAULT 10,
filter JSONB DEFAULT '{}'::jsonb,
source_filter TEXT DEFAULT NULL
) RETURNS TABLE (
id BIGINT,
url VARCHAR,
chunk_number INTEGER,
content TEXT,
summary TEXT,
metadata JSONB,
source_id TEXT,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
#variable_conflict use_column
DECLARE
sql_query TEXT;
embedding_column TEXT;
BEGIN
CASE embedding_dimension
WHEN 384 THEN embedding_column := 'embedding_384';
WHEN 768 THEN embedding_column := 'embedding_768';
WHEN 1024 THEN embedding_column := 'embedding_1024';
WHEN 1536 THEN embedding_column := 'embedding_1536';
WHEN 3072 THEN embedding_column := 'embedding_3072';
ELSE RAISE EXCEPTION 'Unsupported embedding dimension: %', embedding_dimension;
END CASE;
sql_query := format('
SELECT id, url, chunk_number, content, summary, metadata, source_id,
1 - (%I <=> $1) AS similarity
FROM archon_code_examples
WHERE (%I IS NOT NULL)
AND metadata @> $3
AND ($4 IS NULL OR source_id = $4)
ORDER BY %I <=> $1
LIMIT $2',
embedding_column, embedding_column, embedding_column);
RETURN QUERY EXECUTE sql_query USING query_embedding, match_count, filter, source_filter;
END;
$$;
-- Legacy compatibility (defaults to 1536D)
CREATE OR REPLACE FUNCTION match_archon_crawled_pages (
query_embedding VECTOR(1536),
match_count INT DEFAULT 10,
filter JSONB DEFAULT '{}'::jsonb,
source_filter TEXT DEFAULT NULL
) RETURNS TABLE (
id BIGINT,
url VARCHAR,
chunk_number INTEGER,
content TEXT,
metadata JSONB,
source_id TEXT,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY SELECT * FROM match_archon_crawled_pages_multi(query_embedding, 1536, match_count, filter, source_filter);
END;
$$;
CREATE OR REPLACE FUNCTION match_archon_code_examples (
query_embedding VECTOR(1536),
match_count INT DEFAULT 10,
filter JSONB DEFAULT '{}'::jsonb,
source_filter TEXT DEFAULT NULL
) RETURNS TABLE (
id BIGINT,
url VARCHAR,
chunk_number INTEGER,
content TEXT,
summary TEXT,
metadata JSONB,
source_id TEXT,
similarity FLOAT
)
LANGUAGE plpgsql
AS $$
BEGIN
RETURN QUERY SELECT * FROM match_archon_code_examples_multi(query_embedding, 1536, match_count, filter, source_filter);
END;
$$;
COMMIT;
SELECT 'Step 3 Complete: Functions created successfully' AS status;

View File

@@ -0,0 +1,67 @@
-- ======================================================================
-- STEP 4: CREATE INDEXES (OPTIONAL - May timeout on large datasets)
-- Run this LAST, or skip and run via direct connection
-- ======================================================================
-- IMPORTANT: This step creates vector indexes which are memory-intensive
-- If this fails, you can skip it and the system will use brute-force search
-- You can create these indexes later via direct database connection
SET maintenance_work_mem = '512MB';
SET statement_timeout = '10min';
-- Create ONE index at a time to avoid memory issues
-- Comment out any that fail and continue with the next
-- Index 1 of 8
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_1536
ON archon_crawled_pages USING ivfflat (embedding_1536 vector_cosine_ops)
WITH (lists = 100);
-- Index 2 of 8
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_1536
ON archon_code_examples USING ivfflat (embedding_1536 vector_cosine_ops)
WITH (lists = 100);
-- Index 3 of 8
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_768
ON archon_crawled_pages USING ivfflat (embedding_768 vector_cosine_ops)
WITH (lists = 100);
-- Index 4 of 8
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_768
ON archon_code_examples USING ivfflat (embedding_768 vector_cosine_ops)
WITH (lists = 100);
-- Index 5 of 8
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_384
ON archon_crawled_pages USING ivfflat (embedding_384 vector_cosine_ops)
WITH (lists = 100);
-- Index 6 of 8
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_384
ON archon_code_examples USING ivfflat (embedding_384 vector_cosine_ops)
WITH (lists = 100);
-- Index 7 of 8
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_1024
ON archon_crawled_pages USING ivfflat (embedding_1024 vector_cosine_ops)
WITH (lists = 100);
-- Index 8 of 8
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_1024
ON archon_code_examples USING ivfflat (embedding_1024 vector_cosine_ops)
WITH (lists = 100);
-- Simple B-tree indexes (these are fast)
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_model ON archon_crawled_pages (embedding_model);
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_embedding_dimension ON archon_crawled_pages (embedding_dimension);
CREATE INDEX IF NOT EXISTS idx_archon_crawled_pages_llm_chat_model ON archon_crawled_pages (llm_chat_model);
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_model ON archon_code_examples (embedding_model);
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_embedding_dimension ON archon_code_examples (embedding_dimension);
CREATE INDEX IF NOT EXISTS idx_archon_code_examples_llm_chat_model ON archon_code_examples (llm_chat_model);
RESET maintenance_work_mem;
RESET statement_timeout;
SELECT 'Step 4 Complete: Indexes created (or skipped if timed out)' AS status;