The New Archon (Beta) - The Operating System for AI Coding Assistants!

2026-01-02 12:48:54 -05:00 · 2025-08-13 07:58:24 -05:00
parent 13e1fc6a0e
commit 59084036f6
603 changed files with 131376 additions and 417 deletions
--- a/docs/docs/background-tasks.mdx
+++ b/docs/docs/background-tasks.mdx
@@ -0,0 +1,320 @@
+---
+sidebar_position: 25
+sidebar_label: Background Tasks
+---
+
+# Background Tasks Architecture
+
+## Overview
+
+This document describes the architecture for handling long-running operations in Archon without blocking the FastAPI/Socket.IO event loop. The key insight is that browser-based operations (crawling) must remain in the main event loop, while only CPU-intensive operations should be offloaded to threads.
+
+## Architecture Principles
+
+1. **Keep async I/O operations in the main event loop** - Browser automation, database operations, and network requests must stay async
+2. **Only offload CPU-intensive work to threads** - Text processing, chunking, and synchronous API calls can run in ThreadPoolExecutor
+3. **Use asyncio.create_task for background async work** - Don't block the event loop, but keep async operations async
+4. **Maintain single event loop** - Never create new event loops in threads
+
+## Architecture Diagram
+
+```mermaid
+graph TB
+    subgraph "Main Event Loop"
+        API[FastAPI Endpoint]
+        SIO[Socket.IO Handler]
+        BGTask[Background Async Task]
+        Crawler[AsyncWebCrawler]
+        DB[Database Operations]
+        Progress[Progress Updates]
+    end
+    
+    subgraph "ThreadPoolExecutor"
+        Chunk[Text Chunking]
+        Embed[Embedding Generation]
+        Summary[Summary Extraction]
+        CodeExt[Code Extraction]
+    end
+    
+    API -->|asyncio.create_task| BGTask
+    BGTask -->|await| Crawler
+    BGTask -->|await| DB
+    BGTask -->|run_in_executor| Chunk
+    BGTask -->|run_in_executor| Embed
+    BGTask -->|run_in_executor| Summary
+    BGTask -->|run_in_executor| CodeExt
+    BGTask -->|emit| Progress
+    Progress -->|websocket| SIO
+    
+    classDef async fill:#e1f5fe,stroke:#01579b,stroke-width:2px
+    classDef sync fill:#fff3e0,stroke:#e65100,stroke-width:2px
+    
+    class API,SIO,BGTask,Crawler,DB,Progress async
+    class Chunk,Embed,Summary,CodeExt sync
+```
+
+## Core Components
+
+### CrawlOrchestrationService
+
+The orchestration service manages the entire crawl workflow while keeping the main event loop responsive:
+
+```python
+class CrawlOrchestrationService:
+    def __init__(self, crawler, supabase_client, progress_id=None):
+        self.crawler = crawler
+        self.supabase_client = supabase_client
+        self.progress_id = progress_id
+        self.active_tasks = {}
+        # Thread pool for CPU-intensive operations only
+        self.executor = ThreadPoolExecutor(max_workers=4)
+    
+    async def orchestrate_crawl(self, request: Dict[str, Any]) -> Dict[str, Any]:
+        """Start crawl operation as background task"""
+        url = str(request.get('url', ''))
+        
+        # Create background task in the SAME event loop
+        task = asyncio.create_task(
+            self._async_orchestrate_crawl(request)
+        )
+        
+        # Store task reference
+        self.active_tasks[self.progress_id] = task
+        
+        # Return immediately
+        return {
+            "task_id": self.progress_id,
+            "status": "started",
+            "message": f"Crawl operation started for {url}"
+        }
+    
+    async def _async_orchestrate_crawl(self, request: Dict[str, Any]):
+        """Background async task - runs in main event loop"""
+        try:
+            url = request.get('url', '')
+            
+            # Emit initial progress
+            await self._emit_progress({
+                'status': 'analyzing',
+                'percentage': 0,
+                'currentUrl': url,
+                'log': f'Analyzing URL type for {url}'
+            })
+            
+            # Step 1: Crawl URLs (MUST stay async in main loop)
+            crawl_results = await self._crawl_urls_async(url, request)
+            
+            # Step 2: Process documents (CPU-intensive, can go to thread)
+            loop = asyncio.get_event_loop()
+            doc_results = await loop.run_in_executor(
+                self.executor,
+                self._process_documents_sync,  # Sync version
+                crawl_results, request
+            )
+            
+            # Step 3: Store in database (MUST stay async in main loop)
+            await self._store_documents_async(doc_results)
+            
+            # Step 4: Generate embeddings (CPU-intensive, can go to thread)
+            await loop.run_in_executor(
+                self.executor,
+                self._generate_embeddings_sync,
+                doc_results
+            )
+            
+            # Step 5: Extract code (CPU-intensive, can go to thread)
+            code_count = await loop.run_in_executor(
+                self.executor,
+                self._extract_code_sync,
+                crawl_results
+            )
+            
+            # Complete
+            await self._emit_progress({
+                'status': 'complete',
+                'percentage': 100,
+                'log': 'Crawl operation completed successfully'
+            })
+            
+        except Exception as e:
+            logger.error(f"Crawl orchestration error: {e}")
+            await self._emit_progress({
+                'status': 'error',
+                'percentage': -1,
+                'error': str(e)
+            })
+    
+    async def _emit_progress(self, update: Dict[str, Any]):
+        """Emit progress via Socket.IO"""
+        if self.progress_id:
+            await update_crawl_progress(self.progress_id, update)
+```
+
+### Sync Functions for Thread Execution
+
+Only CPU-intensive operations should have sync versions for thread execution:
+
+```python
+def _process_documents_sync(self, crawl_results, request):
+    """Sync version for thread execution - CPU-intensive text processing"""
+    all_chunks = []
+    for doc in crawl_results:
+        # Text chunking is CPU-intensive
+        chunks = self.chunk_text(doc['markdown'])
+        all_chunks.extend(chunks)
+    return {
+        'chunks': all_chunks,
+        'chunk_count': len(all_chunks)
+    }
+
+def _generate_embeddings_sync(self, doc_results):
+    """Sync version - uses synchronous OpenAI client"""
+    client = openai.Client()  # Sync client
+    embeddings = []
+    
+    for chunk in doc_results['chunks']:
+        # CPU-intensive: preparing embedding request
+        response = client.embeddings.create(
+            input=chunk,
+            model="text-embedding-3-small"
+        )
+        embeddings.append(response.data[0].embedding)
+    
+    return embeddings
+
+def _extract_code_sync(self, crawl_results):
+    """Sync version - CPU-intensive regex and parsing"""
+    code_examples = []
+    for doc in crawl_results:
+        # Extract code blocks with regex
+        code_blocks = self.extract_code_blocks(doc['markdown'])
+        code_examples.extend(code_blocks)
+    return len(code_examples)
+```
+
+### Socket.IO Integration
+
+Socket.IO handlers remain in the main event loop:
+
+```python
+# socketio_handlers.py
+
+async def update_crawl_progress(progress_id: str, data: dict):
+    """Emit progress updates to connected clients"""
+    # Check if room has subscribers
+    room_sids = []
+    if hasattr(sio.manager, 'rooms'):
+        namespace_rooms = sio.manager.rooms.get('/', {})
+        room_sids = list(namespace_rooms.get(progress_id, []))
+    
+    if not room_sids:
+        logger.warning(f"No subscribers in room {progress_id}")
+        return
+    
+    # Emit progress
+    data['progressId'] = progress_id
+    await sio.emit('crawl_progress', data, room=progress_id)
+
+@sio.event
+async def subscribe_to_progress(sid, data):
+    """Client subscribes to progress updates"""
+    progress_id = data.get('progressId')
+    if progress_id:
+        sio.enter_room(sid, progress_id)
+        # Send current status if task is running
+        orchestrator = get_orchestrator_for_progress(progress_id)
+        if orchestrator and progress_id in orchestrator.active_tasks:
+            await sio.emit('crawl_progress', {
+                'progressId': progress_id,
+                'status': 'running',
+                'message': 'Reconnected to running task'
+            }, to=sid)
+```
+
+### API Endpoint Pattern
+
+FastAPI endpoints start background tasks and return immediately:
+
+```python
+# knowledge_api.py
+
+@router.post("/knowledge/add")
+async def add_knowledge_item(request: KnowledgeAddRequest):
+    """Start crawl operation - returns immediately"""
+    # Generate progress ID
+    progress_id = str(uuid.uuid4())
+    
+    # Create orchestrator
+    orchestrator = CrawlOrchestrationService(
+        crawler=await crawler_manager.get_crawler(),
+        supabase_client=supabase_client,
+        progress_id=progress_id
+    )
+    
+    # Start background task
+    result = await orchestrator.orchestrate_crawl(request.dict())
+    
+    # Return task info immediately
+    return {
+        "success": True,
+        "task_id": result["task_id"],
+        "progress_id": progress_id,
+        "message": "Crawl started in background"
+    }
+
+@router.get("/knowledge/status/{task_id}")
+async def get_task_status(task_id: str):
+    """Check status of background task"""
+    orchestrator = get_orchestrator_for_task(task_id)
+    if not orchestrator:
+        raise HTTPException(404, "Task not found")
+    
+    task = orchestrator.active_tasks.get(task_id)
+    if not task:
+        raise HTTPException(404, "Task not found")
+    
+    return {
+        "task_id": task_id,
+        "done": task.done(),
+        "cancelled": task.cancelled()
+    }
+```
+
+## Key Patterns
+
+### What Stays Async (Main Loop)
+- Browser automation (crawling)
+- Database operations
+- Network requests
+- Socket.IO communications
+- Task coordination
+
+### What Goes to Threads
+- Text chunking
+- Markdown parsing
+- Code extraction
+- Embedding preparation
+- CPU-intensive calculations
+
+### Progress Updates
+- Use asyncio.Queue for async tasks
+- Regular python Queue for thread tasks
+- Always emit from main event loop
+- Include detailed status information
+
+## Common Pitfalls to Avoid
+
+1. **Don't create new event loops in threads** - Database connections won't work
+2. **Don't run browser automation in threads** - It needs the main event loop
+3. **Don't block the main loop** - Use asyncio.create_task for background work
+4. **Don't mix async and sync incorrectly** - Keep clear boundaries
+5. **Don't forget progress updates** - Users need feedback
+
+## Testing Guidelines
+
+1. Test with long-running crawls (100+ pages)
+2. Verify Socket.IO doesn't disconnect
+3. Check database operations work correctly
+4. Monitor memory usage in threads
+5. Test task cancellation
+6. Verify progress accuracy