mirror of
https://github.com/coleam00/Archon.git
synced 2025-12-30 21:49:30 -05:00
Fix race condition in concurrent crawling with unique source IDs (#472)
* Fix race condition in concurrent crawling with unique source IDs - Add unique hash-based source_id generation to prevent conflicts - Separate source identification from display with three fields: - source_id: 16-char SHA256 hash for unique identification - source_url: Original URL for tracking - source_display_name: Human-friendly name for UI - Add comprehensive test suite validating the fix - Migrate existing data with backward compatibility * Fix title generation to use source_display_name for better AI context - Pass source_display_name to title generation function - Use display name in AI prompt instead of hash-based source_id - Results in more specific, meaningful titles for each source * Skip AI title generation when display name is available - Use source_display_name directly as title to avoid unnecessary AI calls - More efficient and predictable than AI-generated titles - Keep AI generation only as fallback for backward compatibility * Fix critical issues from code review - Add missing os import to prevent NameError crash - Remove unused imports (pytest, Mock, patch, hashlib, urlparse, etc.) - Fix GitHub API capitalization consistency - Reuse existing DocumentStorageService instance - Update test expectations to match corrected capitalization Addresses CodeRabbit review feedback on PR #472 * Add safety improvements from code review - Truncate display names to 100 chars when used as titles - Document hash collision probability (negligible for <1M sources) Simple, pragmatic fixes per KISS principle * Fix code extraction to use hash-based source_ids and improve display names - Fixed critical bug where code extraction was using old domain-based source_ids - Updated code extraction service to accept source_id as parameter instead of extracting from URL - Added special handling for llms.txt and sitemap.xml files in display names - Added comprehensive tests for source_id handling in code extraction - Removed unused urlparse import from code_extraction_service.py This fixes the foreign key constraint errors that were preventing code examples from being stored after the source_id architecture refactor. Co-Authored-By: Claude <noreply@anthropic.com> * Fix critical variable shadowing and source_type determination issues - Fixed variable shadowing in document_storage_operations.py where source_url parameter was being overwritten by document URLs, causing incorrect source_url in database - Fixed source_type determination to use actual URLs instead of hash-based source_id - Added comprehensive tests for source URL preservation - Ensure source_type is correctly set to "file" for file uploads, "url" for web crawls The variable shadowing bug was causing sitemap sources to have the wrong source_url (last crawled page instead of sitemap URL). The source_type bug would mark all sources as "url" even for file uploads due to hash-based IDs not starting with "file_". Co-Authored-By: Claude <noreply@anthropic.com> * Fix URL canonicalization and document metrics calculation - Implement proper URL canonicalization to prevent duplicate sources - Remove trailing slashes (except root) - Remove URL fragments - Remove tracking parameters (utm_*, gclid, fbclid, etc.) - Sort query parameters for consistency - Remove default ports (80 for HTTP, 443 for HTTPS) - Normalize scheme and domain to lowercase - Fix avg_chunks_per_doc calculation to avoid division by zero - Track processed_docs count separately from total crawl_results - Handle all-empty document sets gracefully - Show processed/total in logs for better visibility - Add comprehensive tests for both fixes - 10 test cases for URL canonicalization edge cases - 4 test cases for document metrics calculation This prevents database constraint violations when crawling the same content with URL variations and provides accurate metrics in logs. * Fix synchronous extract_source_summary blocking async event loop - Run extract_source_summary in thread pool using asyncio.to_thread - Prevents blocking the async event loop during AI summary generation - Preserves exact error handling and fallback behavior - Variables (source_id, combined_content) properly passed to thread Added comprehensive tests verifying: - Function runs in thread without blocking - Error handling works correctly with fallback - Multiple sources can be processed - Thread safety with variable passing * Fix synchronous update_source_info blocking async event loop - Run update_source_info in thread pool using asyncio.to_thread - Prevents blocking the async event loop during database operations - Preserves exact error handling and fallback behavior - All kwargs properly passed to thread execution Added comprehensive tests verifying: - Function runs in thread without blocking - Error handling triggers fallback correctly - All kwargs are preserved when passed to thread - Existing extract_source_summary tests still pass * Fix race condition in source creation using upsert - Replace INSERT with UPSERT for new sources to prevent PRIMARY KEY violations - Handles concurrent crawls attempting to create the same source - Maintains existing UPDATE behavior for sources that already exist Added comprehensive tests verifying: - Concurrent source creation doesn't fail - Upsert is used for new sources (not insert) - Update is still used for existing sources - Async concurrent operations work correctly - Race conditions with delays are handled This prevents database constraint errors when multiple crawls target the same URL simultaneously. * Add migration detection UI components Add MigrationBanner component with clear user instructions for database schema updates. Add useMigrationStatus hook for periodic health check monitoring with graceful error handling. * Integrate migration banner into main app Add migration status monitoring and banner display to App.tsx. Shows migration banner when database schema updates are required. * Enhance backend startup error instructions Add detailed Docker restart instructions and migration script guidance. Improves user experience when encountering startup failures. * Add database schema caching to health endpoint Implement smart caching for schema validation to prevent repeated database queries. Cache successful validations permanently and throttle failures to 30-second intervals. Replace debug prints with proper logging. * Clean up knowledge API imports and logging Remove duplicate import statements and redundant logging. Improves code clarity and reduces log noise. * Remove unused instructions prop from MigrationBanner Clean up component API by removing instructions prop that was accepted but never rendered. Simplifies the interface and eliminates dead code while keeping the functional hardcoded migration steps. * Add schema_valid flag to migration_required health response Add schema_valid: false flag to health endpoint response when database schema migration is required. Improves API consistency without changing existing behavior. --------- Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -27,10 +27,6 @@ from ..services.crawler_manager import get_crawler
|
||||
|
||||
# Import unified logging
|
||||
from ..config.logfire_config import get_logger, safe_logfire_error, safe_logfire_info
|
||||
from ..services.crawler_manager import get_crawler
|
||||
from ..services.search.rag_service import RAGService
|
||||
from ..services.storage import DocumentStorageService
|
||||
from ..utils import get_supabase_client
|
||||
from ..utils.document_processing import extract_text_from_document
|
||||
|
||||
# Get logger for this module
|
||||
@@ -513,11 +509,6 @@ async def upload_document(
|
||||
):
|
||||
"""Upload and process a document with progress tracking."""
|
||||
try:
|
||||
# DETAILED LOGGING: Track knowledge_type parameter flow
|
||||
safe_logfire_info(
|
||||
f"📋 UPLOAD: Starting document upload | filename={file.filename} | content_type={file.content_type} | knowledge_type={knowledge_type}"
|
||||
)
|
||||
|
||||
safe_logfire_info(
|
||||
f"Starting document upload | filename={file.filename} | content_type={file.content_type} | knowledge_type={knowledge_type}"
|
||||
)
|
||||
@@ -871,7 +862,22 @@ async def get_database_metrics():
|
||||
|
||||
@router.get("/health")
|
||||
async def knowledge_health():
|
||||
"""Knowledge API health check."""
|
||||
"""Knowledge API health check with migration detection."""
|
||||
# Check for database migration needs
|
||||
from ..main import _check_database_schema
|
||||
|
||||
schema_status = await _check_database_schema()
|
||||
if not schema_status["valid"]:
|
||||
return {
|
||||
"status": "migration_required",
|
||||
"service": "knowledge-api",
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"ready": False,
|
||||
"migration_required": True,
|
||||
"message": schema_status["message"],
|
||||
"migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql"
|
||||
}
|
||||
|
||||
# Removed health check logging to reduce console noise
|
||||
result = {
|
||||
"status": "healthy",
|
||||
|
||||
@@ -246,12 +246,27 @@ async def health_check():
|
||||
"ready": False,
|
||||
}
|
||||
|
||||
# Check for required database schema
|
||||
schema_status = await _check_database_schema()
|
||||
if not schema_status["valid"]:
|
||||
return {
|
||||
"status": "migration_required",
|
||||
"service": "archon-backend",
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"ready": False,
|
||||
"migration_required": True,
|
||||
"message": schema_status["message"],
|
||||
"migration_instructions": "Open Supabase Dashboard → SQL Editor → Run: migration/add_source_url_display_name.sql",
|
||||
"schema_valid": False
|
||||
}
|
||||
|
||||
return {
|
||||
"status": "healthy",
|
||||
"service": "archon-backend",
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"ready": True,
|
||||
"credentials_loaded": True,
|
||||
"schema_valid": True,
|
||||
}
|
||||
|
||||
|
||||
@@ -262,6 +277,78 @@ async def api_health_check():
|
||||
return await health_check()
|
||||
|
||||
|
||||
# Cache schema check result to avoid repeated database queries
|
||||
_schema_check_cache = {"valid": None, "checked_at": 0}
|
||||
|
||||
async def _check_database_schema():
|
||||
"""Check if required database schema exists - only for existing users who need migration."""
|
||||
import time
|
||||
|
||||
# If we've already confirmed schema is valid, don't check again
|
||||
if _schema_check_cache["valid"] is True:
|
||||
return {"valid": True, "message": "Schema is up to date (cached)"}
|
||||
|
||||
# If we recently failed, don't spam the database (wait at least 30 seconds)
|
||||
current_time = time.time()
|
||||
if (_schema_check_cache["valid"] is False and
|
||||
current_time - _schema_check_cache["checked_at"] < 30):
|
||||
return _schema_check_cache["result"]
|
||||
|
||||
try:
|
||||
from .services.client_manager import get_supabase_client
|
||||
|
||||
client = get_supabase_client()
|
||||
|
||||
# Try to query the new columns directly - if they exist, schema is up to date
|
||||
test_query = client.table('archon_sources').select('source_url, source_display_name').limit(1).execute()
|
||||
|
||||
# Cache successful result permanently
|
||||
_schema_check_cache["valid"] = True
|
||||
_schema_check_cache["checked_at"] = current_time
|
||||
|
||||
return {"valid": True, "message": "Schema is up to date"}
|
||||
|
||||
except Exception as e:
|
||||
error_msg = str(e).lower()
|
||||
|
||||
# Log schema check error for debugging
|
||||
api_logger.debug(f"Schema check error: {type(e).__name__}: {str(e)}")
|
||||
|
||||
# Check for specific error types based on PostgreSQL error codes and messages
|
||||
|
||||
# Check for missing columns first (more specific than table check)
|
||||
missing_source_url = 'source_url' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
|
||||
missing_source_display = 'source_display_name' in error_msg and ('column' in error_msg or 'does not exist' in error_msg)
|
||||
|
||||
# Also check for PostgreSQL error code 42703 (undefined column)
|
||||
is_column_error = '42703' in error_msg or 'column' in error_msg
|
||||
|
||||
if (missing_source_url or missing_source_display) and is_column_error:
|
||||
result = {
|
||||
"valid": False,
|
||||
"message": "Database schema outdated - missing required columns from recent updates"
|
||||
}
|
||||
# Cache failed result with timestamp
|
||||
_schema_check_cache["valid"] = False
|
||||
_schema_check_cache["checked_at"] = current_time
|
||||
_schema_check_cache["result"] = result
|
||||
return result
|
||||
|
||||
# Check for table doesn't exist (less specific, only if column check didn't match)
|
||||
# Look for relation/table errors specifically
|
||||
if ('relation' in error_msg and 'does not exist' in error_msg) or ('table' in error_msg and 'does not exist' in error_msg):
|
||||
# Table doesn't exist - not a migration issue, it's a setup issue
|
||||
return {"valid": True, "message": "Table doesn't exist - handled by startup error"}
|
||||
|
||||
# Other errors don't necessarily mean migration needed
|
||||
result = {"valid": True, "message": f"Schema check inconclusive: {str(e)}"}
|
||||
# Don't cache inconclusive results - allow retry
|
||||
return result
|
||||
|
||||
|
||||
# Export for Socket.IO
|
||||
|
||||
|
||||
# Create Socket.IO app wrapper
|
||||
# This wraps the FastAPI app with Socket.IO functionality
|
||||
socket_app = create_socketio_app(app)
|
||||
|
||||
@@ -7,7 +7,6 @@ Handles extraction, processing, and storage of code examples from documents.
|
||||
import re
|
||||
from collections.abc import Callable
|
||||
from typing import Any
|
||||
from urllib.parse import urlparse
|
||||
|
||||
from ...config.logfire_config import safe_logfire_error, safe_logfire_info
|
||||
from ...services.credential_service import credential_service
|
||||
@@ -136,6 +135,7 @@ class CodeExtractionService:
|
||||
self,
|
||||
crawl_results: list[dict[str, Any]],
|
||||
url_to_full_document: dict[str, str],
|
||||
source_id: str,
|
||||
progress_callback: Callable | None = None,
|
||||
start_progress: int = 0,
|
||||
end_progress: int = 100,
|
||||
@@ -146,6 +146,7 @@ class CodeExtractionService:
|
||||
Args:
|
||||
crawl_results: List of crawled documents with url and markdown content
|
||||
url_to_full_document: Mapping of URLs to full document content
|
||||
source_id: The unique source_id for all documents
|
||||
progress_callback: Optional async callback for progress updates
|
||||
start_progress: Starting progress percentage (default: 0)
|
||||
end_progress: Ending progress percentage (default: 100)
|
||||
@@ -163,7 +164,7 @@ class CodeExtractionService:
|
||||
|
||||
# Extract code blocks from all documents
|
||||
all_code_blocks = await self._extract_code_blocks_from_documents(
|
||||
crawl_results, progress_callback, start_progress, extract_end
|
||||
crawl_results, source_id, progress_callback, start_progress, extract_end
|
||||
)
|
||||
|
||||
if not all_code_blocks:
|
||||
@@ -201,6 +202,7 @@ class CodeExtractionService:
|
||||
async def _extract_code_blocks_from_documents(
|
||||
self,
|
||||
crawl_results: list[dict[str, Any]],
|
||||
source_id: str,
|
||||
progress_callback: Callable | None = None,
|
||||
start_progress: int = 0,
|
||||
end_progress: int = 100,
|
||||
@@ -208,6 +210,10 @@ class CodeExtractionService:
|
||||
"""
|
||||
Extract code blocks from all documents.
|
||||
|
||||
Args:
|
||||
crawl_results: List of crawled documents
|
||||
source_id: The unique source_id for all documents
|
||||
|
||||
Returns:
|
||||
List of code blocks with metadata
|
||||
"""
|
||||
@@ -306,10 +312,7 @@ class CodeExtractionService:
|
||||
)
|
||||
|
||||
if code_blocks:
|
||||
# Always extract source_id from URL
|
||||
parsed_url = urlparse(source_url)
|
||||
source_id = parsed_url.netloc or parsed_url.path
|
||||
|
||||
# Use the provided source_id for all code blocks
|
||||
for block in code_blocks:
|
||||
all_code_blocks.append({
|
||||
"block": block,
|
||||
|
||||
@@ -304,10 +304,12 @@ class CrawlingService:
|
||||
url = str(request.get("url", ""))
|
||||
safe_logfire_info(f"Starting async crawl orchestration | url={url} | task_id={task_id}")
|
||||
|
||||
# Extract source_id from the original URL
|
||||
parsed_original_url = urlparse(url)
|
||||
original_source_id = parsed_original_url.netloc or parsed_original_url.path
|
||||
safe_logfire_info(f"Using source_id '{original_source_id}' from original URL '{url}'")
|
||||
# Generate unique source_id and display name from the original URL
|
||||
original_source_id = self.url_handler.generate_unique_source_id(url)
|
||||
source_display_name = self.url_handler.extract_display_name(url)
|
||||
safe_logfire_info(
|
||||
f"Generated unique source_id '{original_source_id}' and display name '{source_display_name}' from URL '{url}'"
|
||||
)
|
||||
|
||||
# Helper to update progress with mapper
|
||||
async def update_mapped_progress(
|
||||
@@ -386,6 +388,8 @@ class CrawlingService:
|
||||
original_source_id,
|
||||
doc_storage_callback,
|
||||
self._check_cancellation,
|
||||
source_url=url,
|
||||
source_display_name=source_display_name,
|
||||
)
|
||||
|
||||
# Check for cancellation after document storage
|
||||
@@ -410,6 +414,7 @@ class CrawlingService:
|
||||
code_examples_count = await self.doc_storage_ops.extract_and_store_code_examples(
|
||||
crawl_results,
|
||||
storage_results["url_to_full_document"],
|
||||
storage_results["source_id"],
|
||||
code_progress_callback,
|
||||
85,
|
||||
95,
|
||||
@@ -558,7 +563,7 @@ class CrawlingService:
|
||||
max_depth = request.get("max_depth", 1)
|
||||
# Let the strategy handle concurrency from settings
|
||||
# This will use CRAWL_MAX_CONCURRENT from database (default: 10)
|
||||
|
||||
|
||||
crawl_results = await self.crawl_recursive_with_progress(
|
||||
[url],
|
||||
max_depth=max_depth,
|
||||
|
||||
@@ -4,17 +4,13 @@ Document Storage Operations
|
||||
Handles the storage and processing of crawled documents.
|
||||
Extracted from crawl_orchestration_service.py for better modularity.
|
||||
"""
|
||||
|
||||
import asyncio
|
||||
from typing import Dict, Any, List, Optional, Callable
|
||||
from urllib.parse import urlparse
|
||||
|
||||
from ...config.logfire_config import safe_logfire_info, safe_logfire_error
|
||||
from ..storage.storage_services import DocumentStorageService
|
||||
from ..storage.document_storage_service import add_documents_to_supabase
|
||||
from ..storage.code_storage_service import (
|
||||
generate_code_summaries_batch,
|
||||
add_code_examples_to_supabase
|
||||
)
|
||||
from ..source_management_service import update_source_info, extract_source_summary
|
||||
from .code_extraction_service import CodeExtractionService
|
||||
|
||||
@@ -23,18 +19,18 @@ class DocumentStorageOperations:
|
||||
"""
|
||||
Handles document storage operations for crawled content.
|
||||
"""
|
||||
|
||||
|
||||
def __init__(self, supabase_client):
|
||||
"""
|
||||
Initialize document storage operations.
|
||||
|
||||
|
||||
Args:
|
||||
supabase_client: The Supabase client for database operations
|
||||
"""
|
||||
self.supabase_client = supabase_client
|
||||
self.doc_storage_service = DocumentStorageService(supabase_client)
|
||||
self.code_extraction_service = CodeExtractionService(supabase_client)
|
||||
|
||||
|
||||
async def process_and_store_documents(
|
||||
self,
|
||||
crawl_results: List[Dict],
|
||||
@@ -42,11 +38,13 @@ class DocumentStorageOperations:
|
||||
crawl_type: str,
|
||||
original_source_id: str,
|
||||
progress_callback: Optional[Callable] = None,
|
||||
cancellation_check: Optional[Callable] = None
|
||||
cancellation_check: Optional[Callable] = None,
|
||||
source_url: Optional[str] = None,
|
||||
source_display_name: Optional[str] = None,
|
||||
) -> Dict[str, Any]:
|
||||
"""
|
||||
Process crawled documents and store them in the database.
|
||||
|
||||
|
||||
Args:
|
||||
crawl_results: List of crawled documents
|
||||
request: The original crawl request
|
||||
@@ -54,13 +52,15 @@ class DocumentStorageOperations:
|
||||
original_source_id: The source ID for all documents
|
||||
progress_callback: Optional callback for progress updates
|
||||
cancellation_check: Optional function to check for cancellation
|
||||
|
||||
source_url: Optional original URL that was crawled
|
||||
source_display_name: Optional human-readable name for the source
|
||||
|
||||
Returns:
|
||||
Dict containing storage statistics and document mappings
|
||||
"""
|
||||
# Initialize storage service for chunking
|
||||
storage_service = DocumentStorageService(self.supabase_client)
|
||||
|
||||
# Reuse initialized storage service for chunking
|
||||
storage_service = self.doc_storage_service
|
||||
|
||||
# Prepare data for chunked storage
|
||||
all_urls = []
|
||||
all_chunk_numbers = []
|
||||
@@ -68,77 +68,85 @@ class DocumentStorageOperations:
|
||||
all_metadatas = []
|
||||
source_word_counts = {}
|
||||
url_to_full_document = {}
|
||||
|
||||
processed_docs = 0
|
||||
|
||||
# Process and chunk each document
|
||||
for doc_index, doc in enumerate(crawl_results):
|
||||
# Check for cancellation during document processing
|
||||
if cancellation_check:
|
||||
cancellation_check()
|
||||
|
||||
source_url = doc.get('url', '')
|
||||
markdown_content = doc.get('markdown', '')
|
||||
|
||||
|
||||
doc_url = doc.get("url", "")
|
||||
markdown_content = doc.get("markdown", "")
|
||||
|
||||
if not markdown_content:
|
||||
continue
|
||||
|
||||
|
||||
# Increment processed document count
|
||||
processed_docs += 1
|
||||
|
||||
# Store full document for code extraction context
|
||||
url_to_full_document[source_url] = markdown_content
|
||||
|
||||
url_to_full_document[doc_url] = markdown_content
|
||||
|
||||
# CHUNK THE CONTENT
|
||||
chunks = storage_service.smart_chunk_text(markdown_content, chunk_size=5000)
|
||||
|
||||
|
||||
# Use the original source_id for all documents
|
||||
source_id = original_source_id
|
||||
safe_logfire_info(f"Using original source_id '{source_id}' for URL '{source_url}'")
|
||||
|
||||
safe_logfire_info(f"Using original source_id '{source_id}' for URL '{doc_url}'")
|
||||
|
||||
# Process each chunk
|
||||
for i, chunk in enumerate(chunks):
|
||||
# Check for cancellation during chunk processing
|
||||
if cancellation_check and i % 10 == 0: # Check every 10 chunks
|
||||
cancellation_check()
|
||||
|
||||
all_urls.append(source_url)
|
||||
|
||||
all_urls.append(doc_url)
|
||||
all_chunk_numbers.append(i)
|
||||
all_contents.append(chunk)
|
||||
|
||||
|
||||
# Create metadata for each chunk
|
||||
word_count = len(chunk.split())
|
||||
metadata = {
|
||||
'url': source_url,
|
||||
'title': doc.get('title', ''),
|
||||
'description': doc.get('description', ''),
|
||||
'source_id': source_id,
|
||||
'knowledge_type': request.get('knowledge_type', 'documentation'),
|
||||
'crawl_type': crawl_type,
|
||||
'word_count': word_count,
|
||||
'char_count': len(chunk),
|
||||
'chunk_index': i,
|
||||
'tags': request.get('tags', [])
|
||||
"url": doc_url,
|
||||
"title": doc.get("title", ""),
|
||||
"description": doc.get("description", ""),
|
||||
"source_id": source_id,
|
||||
"knowledge_type": request.get("knowledge_type", "documentation"),
|
||||
"crawl_type": crawl_type,
|
||||
"word_count": word_count,
|
||||
"char_count": len(chunk),
|
||||
"chunk_index": i,
|
||||
"tags": request.get("tags", []),
|
||||
}
|
||||
all_metadatas.append(metadata)
|
||||
|
||||
|
||||
# Accumulate word count
|
||||
source_word_counts[source_id] = source_word_counts.get(source_id, 0) + word_count
|
||||
|
||||
|
||||
# Yield control every 10 chunks to prevent event loop blocking
|
||||
if i > 0 and i % 10 == 0:
|
||||
await asyncio.sleep(0)
|
||||
|
||||
|
||||
# Yield control after processing each document
|
||||
if doc_index > 0 and doc_index % 5 == 0:
|
||||
await asyncio.sleep(0)
|
||||
|
||||
|
||||
# Create/update source record FIRST before storing documents
|
||||
if all_contents and all_metadatas:
|
||||
await self._create_source_records(
|
||||
all_metadatas, all_contents, source_word_counts, request
|
||||
all_metadatas, all_contents, source_word_counts, request,
|
||||
source_url, source_display_name
|
||||
)
|
||||
|
||||
|
||||
safe_logfire_info(f"url_to_full_document keys: {list(url_to_full_document.keys())[:5]}")
|
||||
|
||||
|
||||
# Log chunking results
|
||||
safe_logfire_info(f"Document storage | documents={len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={len(all_contents)/len(crawl_results):.1f}")
|
||||
|
||||
avg_chunks = (len(all_contents) / processed_docs) if processed_docs > 0 else 0.0
|
||||
safe_logfire_info(
|
||||
f"Document storage | processed={processed_docs}/{len(crawl_results)} | chunks={len(all_contents)} | avg_chunks_per_doc={avg_chunks:.1f}"
|
||||
)
|
||||
|
||||
# Call add_documents_to_supabase with the correct parameters
|
||||
await add_documents_to_supabase(
|
||||
client=self.supabase_client,
|
||||
@@ -151,29 +159,31 @@ class DocumentStorageOperations:
|
||||
progress_callback=progress_callback, # Pass the callback for progress updates
|
||||
enable_parallel_batches=True, # Enable parallel processing
|
||||
provider=None, # Use configured provider
|
||||
cancellation_check=cancellation_check # Pass cancellation check
|
||||
cancellation_check=cancellation_check, # Pass cancellation check
|
||||
)
|
||||
|
||||
|
||||
# Calculate actual chunk count
|
||||
chunk_count = len(all_contents)
|
||||
|
||||
|
||||
return {
|
||||
'chunk_count': chunk_count,
|
||||
'total_word_count': sum(source_word_counts.values()),
|
||||
'url_to_full_document': url_to_full_document,
|
||||
'source_id': original_source_id
|
||||
"chunk_count": chunk_count,
|
||||
"total_word_count": sum(source_word_counts.values()),
|
||||
"url_to_full_document": url_to_full_document,
|
||||
"source_id": original_source_id,
|
||||
}
|
||||
|
||||
|
||||
async def _create_source_records(
|
||||
self,
|
||||
all_metadatas: List[Dict],
|
||||
all_contents: List[str],
|
||||
source_word_counts: Dict[str, int],
|
||||
request: Dict[str, Any]
|
||||
request: Dict[str, Any],
|
||||
source_url: Optional[str] = None,
|
||||
source_display_name: Optional[str] = None,
|
||||
):
|
||||
"""
|
||||
Create or update source records in the database.
|
||||
|
||||
|
||||
Args:
|
||||
all_metadatas: List of metadata for all chunks
|
||||
all_contents: List of all chunk contents
|
||||
@@ -184,121 +194,155 @@ class DocumentStorageOperations:
|
||||
unique_source_ids = set()
|
||||
source_id_contents = {}
|
||||
source_id_word_counts = {}
|
||||
|
||||
|
||||
for i, metadata in enumerate(all_metadatas):
|
||||
source_id = metadata['source_id']
|
||||
source_id = metadata["source_id"]
|
||||
unique_source_ids.add(source_id)
|
||||
|
||||
|
||||
# Group content by source_id for better summaries
|
||||
if source_id not in source_id_contents:
|
||||
source_id_contents[source_id] = []
|
||||
source_id_contents[source_id].append(all_contents[i])
|
||||
|
||||
|
||||
# Track word counts per source_id
|
||||
if source_id not in source_id_word_counts:
|
||||
source_id_word_counts[source_id] = 0
|
||||
source_id_word_counts[source_id] += metadata.get('word_count', 0)
|
||||
|
||||
safe_logfire_info(f"Found {len(unique_source_ids)} unique source_ids: {list(unique_source_ids)}")
|
||||
|
||||
source_id_word_counts[source_id] += metadata.get("word_count", 0)
|
||||
|
||||
safe_logfire_info(
|
||||
f"Found {len(unique_source_ids)} unique source_ids: {list(unique_source_ids)}"
|
||||
)
|
||||
|
||||
# Create source records for ALL unique source_ids
|
||||
for source_id in unique_source_ids:
|
||||
# Get combined content for this specific source_id
|
||||
source_contents = source_id_contents[source_id]
|
||||
combined_content = ''
|
||||
combined_content = ""
|
||||
for chunk in source_contents[:3]: # First 3 chunks for this source
|
||||
if len(combined_content) + len(chunk) < 15000:
|
||||
combined_content += ' ' + chunk
|
||||
combined_content += " " + chunk
|
||||
else:
|
||||
break
|
||||
|
||||
# Generate summary with fallback
|
||||
|
||||
# Generate summary with fallback (run in thread to avoid blocking async loop)
|
||||
try:
|
||||
summary = extract_source_summary(source_id, combined_content)
|
||||
# Run synchronous extract_source_summary in a thread pool
|
||||
summary = await asyncio.to_thread(
|
||||
extract_source_summary, source_id, combined_content
|
||||
)
|
||||
except Exception as e:
|
||||
safe_logfire_error(f"Failed to generate AI summary for '{source_id}': {str(e)}, using fallback")
|
||||
safe_logfire_error(
|
||||
f"Failed to generate AI summary for '{source_id}': {str(e)}, using fallback"
|
||||
)
|
||||
# Fallback to simple summary
|
||||
summary = f"Documentation from {source_id} - {len(source_contents)} pages crawled"
|
||||
|
||||
|
||||
# Update source info in database BEFORE storing documents
|
||||
safe_logfire_info(f"About to create/update source record for '{source_id}' (word count: {source_id_word_counts[source_id]})")
|
||||
safe_logfire_info(
|
||||
f"About to create/update source record for '{source_id}' (word count: {source_id_word_counts[source_id]})"
|
||||
)
|
||||
try:
|
||||
update_source_info(
|
||||
# Run synchronous update_source_info in a thread pool
|
||||
await asyncio.to_thread(
|
||||
update_source_info,
|
||||
client=self.supabase_client,
|
||||
source_id=source_id,
|
||||
summary=summary,
|
||||
word_count=source_id_word_counts[source_id],
|
||||
content=combined_content,
|
||||
knowledge_type=request.get('knowledge_type', 'technical'),
|
||||
tags=request.get('tags', []),
|
||||
knowledge_type=request.get("knowledge_type", "technical"),
|
||||
tags=request.get("tags", []),
|
||||
update_frequency=0, # Set to 0 since we're using manual refresh
|
||||
original_url=request.get('url') # Store the original crawl URL
|
||||
original_url=request.get("url"), # Store the original crawl URL
|
||||
source_url=source_url,
|
||||
source_display_name=source_display_name,
|
||||
)
|
||||
safe_logfire_info(f"Successfully created/updated source record for '{source_id}'")
|
||||
except Exception as e:
|
||||
safe_logfire_error(f"Failed to create/update source record for '{source_id}': {str(e)}")
|
||||
safe_logfire_error(
|
||||
f"Failed to create/update source record for '{source_id}': {str(e)}"
|
||||
)
|
||||
# Try a simpler approach with minimal data
|
||||
try:
|
||||
safe_logfire_info(f"Attempting fallback source creation for '{source_id}'")
|
||||
self.supabase_client.table('archon_sources').upsert({
|
||||
'source_id': source_id,
|
||||
'title': source_id, # Use source_id as title fallback
|
||||
'summary': summary,
|
||||
'total_word_count': source_id_word_counts[source_id],
|
||||
'metadata': {
|
||||
'knowledge_type': request.get('knowledge_type', 'technical'),
|
||||
'tags': request.get('tags', []),
|
||||
'auto_generated': True,
|
||||
'fallback_creation': True,
|
||||
'original_url': request.get('url')
|
||||
}
|
||||
}).execute()
|
||||
fallback_data = {
|
||||
"source_id": source_id,
|
||||
"title": source_id, # Use source_id as title fallback
|
||||
"summary": summary,
|
||||
"total_word_count": source_id_word_counts[source_id],
|
||||
"metadata": {
|
||||
"knowledge_type": request.get("knowledge_type", "technical"),
|
||||
"tags": request.get("tags", []),
|
||||
"auto_generated": True,
|
||||
"fallback_creation": True,
|
||||
"original_url": request.get("url"),
|
||||
},
|
||||
}
|
||||
|
||||
# Add new fields if provided
|
||||
if source_url:
|
||||
fallback_data["source_url"] = source_url
|
||||
if source_display_name:
|
||||
fallback_data["source_display_name"] = source_display_name
|
||||
|
||||
self.supabase_client.table("archon_sources").upsert(fallback_data).execute()
|
||||
safe_logfire_info(f"Fallback source creation succeeded for '{source_id}'")
|
||||
except Exception as fallback_error:
|
||||
safe_logfire_error(f"Both source creation attempts failed for '{source_id}': {str(fallback_error)}")
|
||||
raise Exception(f"Unable to create source record for '{source_id}'. This will cause foreign key violations. Error: {str(fallback_error)}")
|
||||
|
||||
safe_logfire_error(
|
||||
f"Both source creation attempts failed for '{source_id}': {str(fallback_error)}"
|
||||
)
|
||||
raise Exception(
|
||||
f"Unable to create source record for '{source_id}'. This will cause foreign key violations. Error: {str(fallback_error)}"
|
||||
)
|
||||
|
||||
# Verify ALL source records exist before proceeding with document storage
|
||||
if unique_source_ids:
|
||||
for source_id in unique_source_ids:
|
||||
try:
|
||||
source_check = self.supabase_client.table('archon_sources').select('source_id').eq('source_id', source_id).execute()
|
||||
source_check = (
|
||||
self.supabase_client.table("archon_sources")
|
||||
.select("source_id")
|
||||
.eq("source_id", source_id)
|
||||
.execute()
|
||||
)
|
||||
if not source_check.data:
|
||||
raise Exception(f"Source record verification failed - '{source_id}' does not exist in sources table")
|
||||
raise Exception(
|
||||
f"Source record verification failed - '{source_id}' does not exist in sources table"
|
||||
)
|
||||
safe_logfire_info(f"Source record verified for '{source_id}'")
|
||||
except Exception as e:
|
||||
safe_logfire_error(f"Source verification failed for '{source_id}': {str(e)}")
|
||||
raise
|
||||
|
||||
safe_logfire_info(f"All {len(unique_source_ids)} source records verified - proceeding with document storage")
|
||||
|
||||
|
||||
safe_logfire_info(
|
||||
f"All {len(unique_source_ids)} source records verified - proceeding with document storage"
|
||||
)
|
||||
|
||||
async def extract_and_store_code_examples(
|
||||
self,
|
||||
crawl_results: List[Dict],
|
||||
url_to_full_document: Dict[str, str],
|
||||
source_id: str,
|
||||
progress_callback: Optional[Callable] = None,
|
||||
start_progress: int = 85,
|
||||
end_progress: int = 95
|
||||
end_progress: int = 95,
|
||||
) -> int:
|
||||
"""
|
||||
Extract code examples from crawled documents and store them.
|
||||
|
||||
|
||||
Args:
|
||||
crawl_results: List of crawled documents
|
||||
url_to_full_document: Mapping of URLs to full document content
|
||||
source_id: The unique source_id for all documents
|
||||
progress_callback: Optional callback for progress updates
|
||||
start_progress: Starting progress percentage
|
||||
end_progress: Ending progress percentage
|
||||
|
||||
|
||||
Returns:
|
||||
Number of code examples stored
|
||||
"""
|
||||
result = await self.code_extraction_service.extract_and_store_code_examples(
|
||||
crawl_results,
|
||||
url_to_full_document,
|
||||
progress_callback,
|
||||
start_progress,
|
||||
end_progress
|
||||
crawl_results, url_to_full_document, source_id, progress_callback, start_progress, end_progress
|
||||
)
|
||||
|
||||
return result
|
||||
|
||||
return result
|
||||
|
||||
@@ -3,6 +3,8 @@ URL Handler Helper
|
||||
|
||||
Handles URL transformations and validations.
|
||||
"""
|
||||
|
||||
import hashlib
|
||||
import re
|
||||
from urllib.parse import urlparse
|
||||
|
||||
@@ -13,49 +15,49 @@ logger = get_logger(__name__)
|
||||
|
||||
class URLHandler:
|
||||
"""Helper class for URL operations."""
|
||||
|
||||
|
||||
@staticmethod
|
||||
def is_sitemap(url: str) -> bool:
|
||||
"""
|
||||
Check if a URL is a sitemap with error handling.
|
||||
|
||||
|
||||
Args:
|
||||
url: URL to check
|
||||
|
||||
|
||||
Returns:
|
||||
True if URL is a sitemap, False otherwise
|
||||
"""
|
||||
try:
|
||||
return url.endswith('sitemap.xml') or 'sitemap' in urlparse(url).path
|
||||
return url.endswith("sitemap.xml") or "sitemap" in urlparse(url).path
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking if URL is sitemap: {e}")
|
||||
return False
|
||||
|
||||
|
||||
@staticmethod
|
||||
def is_txt(url: str) -> bool:
|
||||
"""
|
||||
Check if a URL is a text file with error handling.
|
||||
|
||||
|
||||
Args:
|
||||
url: URL to check
|
||||
|
||||
|
||||
Returns:
|
||||
True if URL is a text file, False otherwise
|
||||
"""
|
||||
try:
|
||||
return url.endswith('.txt')
|
||||
return url.endswith(".txt")
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking if URL is text file: {e}")
|
||||
return False
|
||||
|
||||
|
||||
@staticmethod
|
||||
def is_binary_file(url: str) -> bool:
|
||||
"""
|
||||
Check if a URL points to a binary file that shouldn't be crawled.
|
||||
|
||||
|
||||
Args:
|
||||
url: URL to check
|
||||
|
||||
|
||||
Returns:
|
||||
True if URL is a binary file, False otherwise
|
||||
"""
|
||||
@@ -63,65 +65,338 @@ class URLHandler:
|
||||
# Remove query parameters and fragments for cleaner extension checking
|
||||
parsed = urlparse(url)
|
||||
path = parsed.path.lower()
|
||||
|
||||
|
||||
# Comprehensive list of binary and non-HTML file extensions
|
||||
binary_extensions = {
|
||||
# Archives
|
||||
'.zip', '.tar', '.gz', '.rar', '.7z', '.bz2', '.xz', '.tgz',
|
||||
".zip",
|
||||
".tar",
|
||||
".gz",
|
||||
".rar",
|
||||
".7z",
|
||||
".bz2",
|
||||
".xz",
|
||||
".tgz",
|
||||
# Executables and installers
|
||||
'.exe', '.dmg', '.pkg', '.deb', '.rpm', '.msi', '.app', '.appimage',
|
||||
".exe",
|
||||
".dmg",
|
||||
".pkg",
|
||||
".deb",
|
||||
".rpm",
|
||||
".msi",
|
||||
".app",
|
||||
".appimage",
|
||||
# Documents (non-HTML)
|
||||
'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.odt', '.ods',
|
||||
".pdf",
|
||||
".doc",
|
||||
".docx",
|
||||
".xls",
|
||||
".xlsx",
|
||||
".ppt",
|
||||
".pptx",
|
||||
".odt",
|
||||
".ods",
|
||||
# Images
|
||||
'.jpg', '.jpeg', '.png', '.gif', '.svg', '.webp', '.ico', '.bmp', '.tiff',
|
||||
".jpg",
|
||||
".jpeg",
|
||||
".png",
|
||||
".gif",
|
||||
".svg",
|
||||
".webp",
|
||||
".ico",
|
||||
".bmp",
|
||||
".tiff",
|
||||
# Audio/Video
|
||||
'.mp3', '.mp4', '.avi', '.mov', '.wmv', '.flv', '.webm', '.mkv', '.wav', '.flac',
|
||||
".mp3",
|
||||
".mp4",
|
||||
".avi",
|
||||
".mov",
|
||||
".wmv",
|
||||
".flv",
|
||||
".webm",
|
||||
".mkv",
|
||||
".wav",
|
||||
".flac",
|
||||
# Data files
|
||||
'.csv', '.sql', '.db', '.sqlite',
|
||||
".csv",
|
||||
".sql",
|
||||
".db",
|
||||
".sqlite",
|
||||
# Binary data
|
||||
'.iso', '.img', '.bin', '.dat',
|
||||
".iso",
|
||||
".img",
|
||||
".bin",
|
||||
".dat",
|
||||
# Development files (usually not meant to be crawled as pages)
|
||||
'.wasm', '.pyc', '.jar', '.war', '.class', '.dll', '.so', '.dylib'
|
||||
".wasm",
|
||||
".pyc",
|
||||
".jar",
|
||||
".war",
|
||||
".class",
|
||||
".dll",
|
||||
".so",
|
||||
".dylib",
|
||||
}
|
||||
|
||||
|
||||
# Check if the path ends with any binary extension
|
||||
for ext in binary_extensions:
|
||||
if path.endswith(ext):
|
||||
logger.debug(f"Skipping binary file: {url} (matched extension: {ext})")
|
||||
return True
|
||||
|
||||
|
||||
return False
|
||||
except Exception as e:
|
||||
logger.warning(f"Error checking if URL is binary file: {e}")
|
||||
# In case of error, don't skip the URL (safer to attempt crawl than miss content)
|
||||
return False
|
||||
|
||||
|
||||
@staticmethod
|
||||
def transform_github_url(url: str) -> str:
|
||||
"""
|
||||
Transform GitHub URLs to raw content URLs for better content extraction.
|
||||
|
||||
|
||||
Args:
|
||||
url: URL to transform
|
||||
|
||||
|
||||
Returns:
|
||||
Transformed URL (or original if not a GitHub file URL)
|
||||
"""
|
||||
# Pattern for GitHub file URLs
|
||||
github_file_pattern = r'https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)'
|
||||
github_file_pattern = r"https://github\.com/([^/]+)/([^/]+)/blob/([^/]+)/(.+)"
|
||||
match = re.match(github_file_pattern, url)
|
||||
if match:
|
||||
owner, repo, branch, path = match.groups()
|
||||
raw_url = f'https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}'
|
||||
raw_url = f"https://raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}"
|
||||
logger.info(f"Transformed GitHub file URL to raw: {url} -> {raw_url}")
|
||||
return raw_url
|
||||
|
||||
|
||||
# Pattern for GitHub directory URLs
|
||||
github_dir_pattern = r'https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)'
|
||||
github_dir_pattern = r"https://github\.com/([^/]+)/([^/]+)/tree/([^/]+)/(.+)"
|
||||
match = re.match(github_dir_pattern, url)
|
||||
if match:
|
||||
# For directories, we can't directly get raw content
|
||||
# Return original URL but log a warning
|
||||
logger.warning(f"GitHub directory URL detected: {url} - consider using specific file URLs or GitHub API")
|
||||
logger.warning(
|
||||
f"GitHub directory URL detected: {url} - consider using specific file URLs or GitHub API"
|
||||
)
|
||||
|
||||
return url
|
||||
|
||||
@staticmethod
|
||||
def generate_unique_source_id(url: str) -> str:
|
||||
"""
|
||||
Generate a unique source ID from URL using hash.
|
||||
|
||||
This creates a 16-character hash that is extremely unlikely to collide
|
||||
for distinct canonical URLs, solving race condition issues when multiple crawls
|
||||
target the same domain.
|
||||
|
||||
return url
|
||||
Uses 16-char SHA256 prefix (64 bits) which provides
|
||||
~18 quintillion unique values. Collision probability
|
||||
is negligible for realistic usage (<1M sources).
|
||||
|
||||
Args:
|
||||
url: The URL to generate an ID for
|
||||
|
||||
Returns:
|
||||
A 16-character hexadecimal hash string
|
||||
"""
|
||||
try:
|
||||
from urllib.parse import urlparse, urlunparse, parse_qsl, urlencode
|
||||
|
||||
# Canonicalize URL for consistent hashing
|
||||
parsed = urlparse(url.strip())
|
||||
|
||||
# Normalize scheme and netloc to lowercase
|
||||
scheme = (parsed.scheme or "").lower()
|
||||
netloc = (parsed.netloc or "").lower()
|
||||
|
||||
# Remove default ports
|
||||
if netloc.endswith(":80") and scheme == "http":
|
||||
netloc = netloc[:-3]
|
||||
if netloc.endswith(":443") and scheme == "https":
|
||||
netloc = netloc[:-4]
|
||||
|
||||
# Normalize path (remove trailing slash except for root)
|
||||
path = parsed.path or "/"
|
||||
if path.endswith("/") and len(path) > 1:
|
||||
path = path.rstrip("/")
|
||||
|
||||
# Remove common tracking parameters and sort remaining
|
||||
tracking_params = {
|
||||
"utm_source", "utm_medium", "utm_campaign", "utm_term", "utm_content",
|
||||
"gclid", "fbclid", "ref", "source"
|
||||
}
|
||||
query_items = [
|
||||
(k, v) for k, v in parse_qsl(parsed.query, keep_blank_values=True)
|
||||
if k not in tracking_params
|
||||
]
|
||||
query = urlencode(sorted(query_items))
|
||||
|
||||
# Reconstruct canonical URL (fragment is dropped)
|
||||
canonical = urlunparse((scheme, netloc, path, "", query, ""))
|
||||
|
||||
# Generate SHA256 hash and take first 16 characters
|
||||
return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:16]
|
||||
|
||||
except Exception as e:
|
||||
# Redact sensitive query params from error logs
|
||||
try:
|
||||
redacted = url.split("?", 1)[0] if "?" in url else url
|
||||
except Exception:
|
||||
redacted = "<unparseable-url>"
|
||||
|
||||
logger.error(f"Error generating unique source ID for {redacted}: {e}", exc_info=True)
|
||||
|
||||
# Fallback: use a hash of the error message + url to still get something unique
|
||||
fallback = f"error_{redacted}_{str(e)}"
|
||||
return hashlib.sha256(fallback.encode("utf-8")).hexdigest()[:16]
|
||||
|
||||
@staticmethod
|
||||
def extract_display_name(url: str) -> str:
|
||||
"""
|
||||
Extract a human-readable display name from URL.
|
||||
|
||||
This creates user-friendly names for common source patterns
|
||||
while falling back to the domain for unknown patterns.
|
||||
|
||||
Args:
|
||||
url: The URL to extract a display name from
|
||||
|
||||
Returns:
|
||||
A human-readable string suitable for UI display
|
||||
"""
|
||||
try:
|
||||
parsed = urlparse(url)
|
||||
domain = parsed.netloc.lower()
|
||||
|
||||
# Remove www prefix for cleaner display
|
||||
if domain.startswith("www."):
|
||||
domain = domain[4:]
|
||||
|
||||
# Handle empty domain (might be a file path or malformed URL)
|
||||
if not domain:
|
||||
if url.startswith("/"):
|
||||
return f"Local: {url.split('/')[-1] if '/' in url else url}"
|
||||
return url[:50] + "..." if len(url) > 50 else url
|
||||
|
||||
path = parsed.path.strip("/")
|
||||
|
||||
# Special handling for GitHub repositories and API
|
||||
if "github.com" in domain:
|
||||
# Check if it's an API endpoint
|
||||
if domain.startswith("api."):
|
||||
return "GitHub API"
|
||||
|
||||
parts = path.split("/")
|
||||
if len(parts) >= 2:
|
||||
owner = parts[0]
|
||||
repo = parts[1].replace(".git", "") # Remove .git extension if present
|
||||
return f"GitHub - {owner}/{repo}"
|
||||
elif len(parts) == 1 and parts[0]:
|
||||
return f"GitHub - {parts[0]}"
|
||||
return "GitHub"
|
||||
|
||||
# Special handling for documentation sites
|
||||
if domain.startswith("docs."):
|
||||
# Extract the service name from docs.X.com/org
|
||||
service_name = domain.replace("docs.", "").split(".")[0]
|
||||
base_name = f"{service_name.title()}" if service_name else "Documentation"
|
||||
|
||||
# Special handling for special files - preserve the filename
|
||||
if path:
|
||||
# Check for llms.txt files
|
||||
if "llms" in path.lower() and path.endswith(".txt"):
|
||||
return f"{base_name} - Llms.Txt"
|
||||
# Check for sitemap files
|
||||
elif "sitemap" in path.lower() and path.endswith(".xml"):
|
||||
return f"{base_name} - Sitemap.Xml"
|
||||
# Check for any other special .txt files
|
||||
elif path.endswith(".txt"):
|
||||
filename = path.split("/")[-1] if "/" in path else path
|
||||
return f"{base_name} - {filename.title()}"
|
||||
|
||||
return f"{base_name} Documentation" if service_name else "Documentation"
|
||||
|
||||
# Handle readthedocs.io subdomains
|
||||
if domain.endswith(".readthedocs.io"):
|
||||
project = domain.replace(".readthedocs.io", "")
|
||||
return f"{project.title()} Docs"
|
||||
|
||||
# Handle common documentation patterns
|
||||
doc_patterns = [
|
||||
("fastapi.tiangolo.com", "FastAPI Documentation"),
|
||||
("pydantic.dev", "Pydantic Documentation"),
|
||||
("python.org", "Python Documentation"),
|
||||
("djangoproject.com", "Django Documentation"),
|
||||
("flask.palletsprojects.com", "Flask Documentation"),
|
||||
("numpy.org", "NumPy Documentation"),
|
||||
("pandas.pydata.org", "Pandas Documentation"),
|
||||
]
|
||||
|
||||
for pattern, name in doc_patterns:
|
||||
if pattern in domain:
|
||||
# Add path context if available
|
||||
if path and len(path) > 1:
|
||||
# Get first meaningful path segment
|
||||
path_segment = path.split("/")[0] if "/" in path else path
|
||||
if path_segment and path_segment not in [
|
||||
"docs",
|
||||
"doc", # Added "doc" to filter list
|
||||
"documentation",
|
||||
"api",
|
||||
"en",
|
||||
]:
|
||||
return f"{name} - {path_segment.title()}"
|
||||
return name
|
||||
|
||||
# For API endpoints
|
||||
if "api." in domain or "/api" in path:
|
||||
service = domain.replace("api.", "").split(".")[0]
|
||||
return f"{service.title()} API"
|
||||
|
||||
# Special handling for sitemap.xml and llms.txt on any site
|
||||
if path:
|
||||
if "sitemap" in path.lower() and path.endswith(".xml"):
|
||||
# Get base domain name
|
||||
display = domain
|
||||
for tld in [".com", ".org", ".io", ".dev", ".net", ".ai", ".app"]:
|
||||
if display.endswith(tld):
|
||||
display = display[:-len(tld)]
|
||||
break
|
||||
display_parts = display.replace("-", " ").replace("_", " ").split(".")
|
||||
formatted = " ".join(part.title() for part in display_parts)
|
||||
return f"{formatted} - Sitemap.Xml"
|
||||
elif "llms" in path.lower() and path.endswith(".txt"):
|
||||
# Get base domain name
|
||||
display = domain
|
||||
for tld in [".com", ".org", ".io", ".dev", ".net", ".ai", ".app"]:
|
||||
if display.endswith(tld):
|
||||
display = display[:-len(tld)]
|
||||
break
|
||||
display_parts = display.replace("-", " ").replace("_", " ").split(".")
|
||||
formatted = " ".join(part.title() for part in display_parts)
|
||||
return f"{formatted} - Llms.Txt"
|
||||
|
||||
# Default: Use domain with nice formatting
|
||||
# Remove common TLDs for cleaner display
|
||||
display = domain
|
||||
for tld in [".com", ".org", ".io", ".dev", ".net", ".ai", ".app"]:
|
||||
if display.endswith(tld):
|
||||
display = display[: -len(tld)]
|
||||
break
|
||||
|
||||
# Capitalize first letter of each word
|
||||
display_parts = display.replace("-", " ").replace("_", " ").split(".")
|
||||
formatted = " ".join(part.title() for part in display_parts)
|
||||
|
||||
# Add path context if it's meaningful
|
||||
if path and len(path) > 1 and "/" not in path:
|
||||
formatted += f" - {path.title()}"
|
||||
|
||||
return formatted
|
||||
|
||||
except Exception as e:
|
||||
logger.warning(f"Error extracting display name for {url}: {e}, using URL")
|
||||
# Fallback: return truncated URL
|
||||
return url[:50] + "..." if len(url) > 50 else url
|
||||
|
||||
@@ -5,6 +5,7 @@ Handles source metadata, summaries, and management.
|
||||
Consolidates both utility functions and class-based service.
|
||||
"""
|
||||
|
||||
import os
|
||||
from typing import Any
|
||||
|
||||
from supabase import Client
|
||||
@@ -145,6 +146,7 @@ def generate_source_title_and_metadata(
|
||||
knowledge_type: str = "technical",
|
||||
tags: list[str] | None = None,
|
||||
provider: str = None,
|
||||
source_display_name: str | None = None,
|
||||
) -> tuple[str, dict[str, Any]]:
|
||||
"""
|
||||
Generate a user-friendly title and metadata for a source based on its content.
|
||||
@@ -203,8 +205,11 @@ def generate_source_title_and_metadata(
|
||||
|
||||
# Limit content for prompt
|
||||
sample_content = content[:3000] if len(content) > 3000 else content
|
||||
|
||||
# Use display name if available for better context
|
||||
source_context = source_display_name if source_display_name else source_id
|
||||
|
||||
prompt = f"""Based on this content from {source_id}, generate a concise, descriptive title (3-6 words) that captures what this source is about:
|
||||
prompt = f"""Based on this content from {source_context}, generate a concise, descriptive title (3-6 words) that captures what this source is about:
|
||||
|
||||
{sample_content}
|
||||
|
||||
@@ -230,12 +235,12 @@ Provide only the title, nothing else."""
|
||||
except Exception as e:
|
||||
search_logger.error(f"Error generating title for {source_id}: {e}")
|
||||
|
||||
# Build metadata - determine source_type from source_id pattern
|
||||
source_type = "file" if source_id.startswith("file_") else "url"
|
||||
# Build metadata - source_type will be determined by caller based on actual URL
|
||||
# Default to "url" but this should be overridden by the caller
|
||||
metadata = {
|
||||
"knowledge_type": knowledge_type,
|
||||
"tags": tags or [],
|
||||
"source_type": source_type,
|
||||
"source_type": "url", # Default, should be overridden by caller based on actual URL
|
||||
"auto_generated": True
|
||||
}
|
||||
|
||||
@@ -252,6 +257,8 @@ def update_source_info(
|
||||
tags: list[str] | None = None,
|
||||
update_frequency: int = 7,
|
||||
original_url: str | None = None,
|
||||
source_url: str | None = None,
|
||||
source_display_name: str | None = None,
|
||||
):
|
||||
"""
|
||||
Update or insert source information in the sources table.
|
||||
@@ -279,7 +286,14 @@ def update_source_info(
|
||||
search_logger.info(f"Preserving existing title for {source_id}: {existing_title}")
|
||||
|
||||
# Update metadata while preserving title
|
||||
source_type = "file" if source_id.startswith("file_") else "url"
|
||||
# Determine source_type based on source_url or original_url
|
||||
if source_url and source_url.startswith("file://"):
|
||||
source_type = "file"
|
||||
elif original_url and original_url.startswith("file://"):
|
||||
source_type = "file"
|
||||
else:
|
||||
source_type = "url"
|
||||
|
||||
metadata = {
|
||||
"knowledge_type": knowledge_type,
|
||||
"tags": tags or [],
|
||||
@@ -292,14 +306,22 @@ def update_source_info(
|
||||
metadata["original_url"] = original_url
|
||||
|
||||
# Update existing source (preserving title)
|
||||
update_data = {
|
||||
"summary": summary,
|
||||
"total_word_count": word_count,
|
||||
"metadata": metadata,
|
||||
"updated_at": "now()",
|
||||
}
|
||||
|
||||
# Add new fields if provided
|
||||
if source_url:
|
||||
update_data["source_url"] = source_url
|
||||
if source_display_name:
|
||||
update_data["source_display_name"] = source_display_name
|
||||
|
||||
result = (
|
||||
client.table("archon_sources")
|
||||
.update({
|
||||
"summary": summary,
|
||||
"total_word_count": word_count,
|
||||
"metadata": metadata,
|
||||
"updated_at": "now()",
|
||||
})
|
||||
.update(update_data)
|
||||
.eq("source_id", source_id)
|
||||
.execute()
|
||||
)
|
||||
@@ -308,10 +330,38 @@ def update_source_info(
|
||||
f"Updated source {source_id} while preserving title: {existing_title}"
|
||||
)
|
||||
else:
|
||||
# New source - generate title and metadata
|
||||
title, metadata = generate_source_title_and_metadata(
|
||||
source_id, content, knowledge_type, tags
|
||||
)
|
||||
# New source - use display name as title if available, otherwise generate
|
||||
if source_display_name:
|
||||
# Use the display name directly as the title (truncated to prevent DB issues)
|
||||
title = source_display_name[:100].strip()
|
||||
|
||||
# Determine source_type based on source_url or original_url
|
||||
if source_url and source_url.startswith("file://"):
|
||||
source_type = "file"
|
||||
elif original_url and original_url.startswith("file://"):
|
||||
source_type = "file"
|
||||
else:
|
||||
source_type = "url"
|
||||
|
||||
metadata = {
|
||||
"knowledge_type": knowledge_type,
|
||||
"tags": tags or [],
|
||||
"source_type": source_type,
|
||||
"auto_generated": False,
|
||||
}
|
||||
else:
|
||||
# Fallback to AI generation only if no display name
|
||||
title, metadata = generate_source_title_and_metadata(
|
||||
source_id, content, knowledge_type, tags, None, source_display_name
|
||||
)
|
||||
|
||||
# Override the source_type from AI with actual URL-based determination
|
||||
if source_url and source_url.startswith("file://"):
|
||||
metadata["source_type"] = "file"
|
||||
elif original_url and original_url.startswith("file://"):
|
||||
metadata["source_type"] = "file"
|
||||
else:
|
||||
metadata["source_type"] = "url"
|
||||
|
||||
# Add update_frequency and original_url to metadata
|
||||
metadata["update_frequency"] = update_frequency
|
||||
@@ -319,15 +369,23 @@ def update_source_info(
|
||||
metadata["original_url"] = original_url
|
||||
|
||||
search_logger.info(f"Creating new source {source_id} with knowledge_type={knowledge_type}")
|
||||
# Insert new source
|
||||
client.table("archon_sources").insert({
|
||||
# Use upsert to avoid race conditions with concurrent crawls
|
||||
upsert_data = {
|
||||
"source_id": source_id,
|
||||
"title": title,
|
||||
"summary": summary,
|
||||
"total_word_count": word_count,
|
||||
"metadata": metadata,
|
||||
}).execute()
|
||||
search_logger.info(f"Created new source {source_id} with title: {title}")
|
||||
}
|
||||
|
||||
# Add new fields if provided
|
||||
if source_url:
|
||||
upsert_data["source_url"] = source_url
|
||||
if source_display_name:
|
||||
upsert_data["source_display_name"] = source_display_name
|
||||
|
||||
client.table("archon_sources").upsert(upsert_data).execute()
|
||||
search_logger.info(f"Created/updated source {source_id} with title: {title}")
|
||||
|
||||
except Exception as e:
|
||||
search_logger.error(f"Error updating source {source_id}: {e}")
|
||||
|
||||
Reference in New Issue
Block a user