archon/docs/docs/background-tasks.mdx

---
sidebar_position: 25
sidebar_label: Background Tasks
---

# Background Tasks Architecture

## Overview

This document describes the architecture for handling long-running operations in Archon without blocking the FastAPI/Socket.IO event loop. The key insight is that browser-based operations (crawling) must remain in the main event loop, while only CPU-intensive operations should be offloaded to threads.

## Architecture Principles

1. **Keep async I/O operations in the main event loop** - Browser automation, database operations, and network requests must stay async
2. **Only offload CPU-intensive work to threads** - Text processing, chunking, and synchronous API calls can run in ThreadPoolExecutor
3. **Use asyncio.create_task for background async work** - Don't block the event loop, but keep async operations async
4. **Maintain single event loop** - Never create new event loops in threads

## Architecture Diagram

```mermaid
graph TB
    subgraph "Main Event Loop"
        API[FastAPI Endpoint]
        SIO[Socket.IO Handler]
        BGTask[Background Async Task]
        Crawler[AsyncWebCrawler]
        DB[Database Operations]
        Progress[Progress Updates]
    end

    subgraph "ThreadPoolExecutor"
        Chunk[Text Chunking]
        Embed[Embedding Generation]
        Summary[Summary Extraction]
        CodeExt[Code Extraction]
    end

    API -->|asyncio.create_task| BGTask
    BGTask -->|await| Crawler
    BGTask -->|await| DB
    BGTask -->|run_in_executor| Chunk
    BGTask -->|run_in_executor| Embed
    BGTask -->|run_in_executor| Summary
    BGTask -->|run_in_executor| CodeExt
    BGTask -->|emit| Progress
    Progress -->|websocket| SIO

    classDef async fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef sync fill:#fff3e0,stroke:#e65100,stroke-width:2px

    class API,SIO,BGTask,Crawler,DB,Progress async
    class Chunk,Embed,Summary,CodeExt sync
```

## Core Components

### CrawlOrchestrationService

The orchestration service manages the entire crawl workflow while keeping the main event loop responsive:

```python
class CrawlOrchestrationService:
    def __init__(self, crawler, supabase_client, progress_id=None):
        self.crawler = crawler
        self.supabase_client = supabase_client
        self.progress_id = progress_id
        self.active_tasks = {}
        # Thread pool for CPU-intensive operations only
        self.executor = ThreadPoolExecutor(max_workers=4)

    async def orchestrate_crawl(self, request: Dict[str, Any]) -> Dict[str, Any]:
        """Start crawl operation as background task"""
        url = str(request.get('url', ''))

        # Create background task in the SAME event loop
        task = asyncio.create_task(
            self._async_orchestrate_crawl(request)
        )

        # Store task reference
        self.active_tasks[self.progress_id] = task

        # Return immediately
        return {
            "task_id": self.progress_id,
            "status": "started",
            "message": f"Crawl operation started for {url}"
        }

    async def _async_orchestrate_crawl(self, request: Dict[str, Any]):
        """Background async task - runs in main event loop"""
        try:
            url = request.get('url', '')

            # Emit initial progress
            await self._emit_progress({
                'status': 'analyzing',
                'percentage': 0,
                'currentUrl': url,
                'log': f'Analyzing URL type for {url}'
            })

            # Step 1: Crawl URLs (MUST stay async in main loop)
            crawl_results = await self._crawl_urls_async(url, request)

            # Step 2: Process documents (CPU-intensive, can go to thread)
            loop = asyncio.get_event_loop()
            doc_results = await loop.run_in_executor(
                self.executor,
                self._process_documents_sync,  # Sync version
                crawl_results, request
            )

            # Step 3: Store in database (MUST stay async in main loop)
            await self._store_documents_async(doc_results)

            # Step 4: Generate embeddings (CPU-intensive, can go to thread)
            await loop.run_in_executor(
                self.executor,
                self._generate_embeddings_sync,
                doc_results
            )

            # Step 5: Extract code (CPU-intensive, can go to thread)
            code_count = await loop.run_in_executor(
                self.executor,
                self._extract_code_sync,
                crawl_results
            )

            # Complete
            await self._emit_progress({
                'status': 'complete',
                'percentage': 100,
                'log': 'Crawl operation completed successfully'
            })

        except Exception as e:
            logger.error(f"Crawl orchestration error: {e}")
            await self._emit_progress({
                'status': 'error',
                'percentage': -1,
                'error': str(e)
            })

    async def _emit_progress(self, update: Dict[str, Any]):
        """Emit progress via Socket.IO"""
        if self.progress_id:
            await update_crawl_progress(self.progress_id, update)
```

### Sync Functions for Thread Execution

Only CPU-intensive operations should have sync versions for thread execution:

```python
def _process_documents_sync(self, crawl_results, request):
    """Sync version for thread execution - CPU-intensive text processing"""
    all_chunks = []
    for doc in crawl_results:
        # Text chunking is CPU-intensive
        chunks = self.chunk_text(doc['markdown'])
        all_chunks.extend(chunks)
    return {
        'chunks': all_chunks,
        'chunk_count': len(all_chunks)
    }

def _generate_embeddings_sync(self, doc_results):
    """Sync version - uses synchronous OpenAI client"""
    client = openai.Client()  # Sync client
    embeddings = []

    for chunk in doc_results['chunks']:
        # CPU-intensive: preparing embedding request
        response = client.embeddings.create(
            input=chunk,
            model="text-embedding-3-small"
        )
        embeddings.append(response.data[0].embedding)

    return embeddings

def _extract_code_sync(self, crawl_results):
    """Sync version - CPU-intensive regex and parsing"""
    code_examples = []
    for doc in crawl_results:
        # Extract code blocks with regex
        code_blocks = self.extract_code_blocks(doc['markdown'])
        code_examples.extend(code_blocks)
    return len(code_examples)
```

### Socket.IO Integration

Socket.IO handlers remain in the main event loop:

```python
# socketio_handlers.py

async def update_crawl_progress(progress_id: str, data: dict):
    """Emit progress updates to connected clients"""
    # Check if room has subscribers
    room_sids = []
    if hasattr(sio.manager, 'rooms'):
        namespace_rooms = sio.manager.rooms.get('/', {})
        room_sids = list(namespace_rooms.get(progress_id, []))

    if not room_sids:
        logger.warning(f"No subscribers in room {progress_id}")
        return

    # Emit progress
    data['progressId'] = progress_id
    await sio.emit('crawl_progress', data, room=progress_id)

@sio.event
async def subscribe_to_progress(sid, data):
    """Client subscribes to progress updates"""
    progress_id = data.get('progressId')
    if progress_id:
        sio.enter_room(sid, progress_id)
        # Send current status if task is running
        orchestrator = get_orchestrator_for_progress(progress_id)
        if orchestrator and progress_id in orchestrator.active_tasks:
            await sio.emit('crawl_progress', {
                'progressId': progress_id,
                'status': 'running',
                'message': 'Reconnected to running task'
            }, to=sid)
```

### API Endpoint Pattern

FastAPI endpoints start background tasks and return immediately:

```python
# knowledge_api.py

@router.post("/knowledge/add")
async def add_knowledge_item(request: KnowledgeAddRequest):
    """Start crawl operation - returns immediately"""
    # Generate progress ID
    progress_id = str(uuid.uuid4())

    # Create orchestrator
    orchestrator = CrawlOrchestrationService(
        crawler=await crawler_manager.get_crawler(),
        supabase_client=supabase_client,
        progress_id=progress_id
    )

    # Start background task
    result = await orchestrator.orchestrate_crawl(request.dict())

    # Return task info immediately
    return {
        "success": True,
        "task_id": result["task_id"],
        "progress_id": progress_id,
        "message": "Crawl started in background"
    }

@router.get("/knowledge/status/{task_id}")
async def get_task_status(task_id: str):
    """Check status of background task"""
    orchestrator = get_orchestrator_for_task(task_id)
    if not orchestrator:
        raise HTTPException(404, "Task not found")

    task = orchestrator.active_tasks.get(task_id)
    if not task:
        raise HTTPException(404, "Task not found")

    return {
        "task_id": task_id,
        "done": task.done(),
        "cancelled": task.cancelled()
    }
```

## Key Patterns

### What Stays Async (Main Loop)
- Browser automation (crawling)
- Database operations
- Network requests
- Socket.IO communications
- Task coordination

### What Goes to Threads
- Text chunking
- Markdown parsing
- Code extraction
- Embedding preparation
- CPU-intensive calculations

### Progress Updates
- Use asyncio.Queue for async tasks
- Regular python Queue for thread tasks
- Always emit from main event loop
- Include detailed status information

## Common Pitfalls to Avoid

1. **Don't create new event loops in threads** - Database connections won't work
2. **Don't run browser automation in threads** - It needs the main event loop
3. **Don't block the main loop** - Use asyncio.create_task for background work
4. **Don't mix async and sync incorrectly** - Keep clear boundaries
5. **Don't forget progress updates** - Users need feedback

## Testing Guidelines

1. Test with long-running crawls (100+ pages)
2. Verify Socket.IO doesn't disconnect
3. Check database operations work correctly
4. Monitor memory usage in threads
5. Test task cancellation
6. Verify progress accuracy