DIY Smart Code 6abb8831f7 fix: enable code examples extraction for manual file uploads (#626)
* fix: enable code examples extraction for manual file uploads

- Add extract_code_examples parameter to upload API endpoint (default: true)
- Integrate CodeExtractionService into DocumentStorageService.upload_document()
- Add code extraction after document storage with progress tracking
- Map code extraction progress to 85-95% range in upload progress
- Include code_examples_stored in upload results and logging
- Support extract_code_examples in batch document upload via store_documents()
- Handle code extraction errors gracefully without failing upload

Fixes issue where code examples were only extracted for URL crawls
but not for manual file uploads, despite using the same underlying
CodeExtractionService that supports both HTML and text formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Fix code extraction for uploaded markdown files

- Provide file content in both html and markdown fields for crawl_results
- This ensures markdown files (.md) use the correct text file extraction path
- The CodeExtractionService checks html_content first for text files
- Fixes issue where uploaded .md files didn't extract code examples properly

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* debug: Add comprehensive logging to trace code extraction issue

- Add detailed debug logging to upload code extraction flow
- Log extract_code_examples parameter value
- Log crawl_results structure and content length
- Log progress callbacks from extraction service
- Log final extraction count with more context
- Enhanced error logging with full stack traces

This will help identify exactly where the extraction is failing for uploaded files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Remove invalid start_progress/end_progress parameters

The extract_and_store_code_examples method doesn't accept start_progress
and end_progress parameters, causing TypeError during file uploads.

This was the root cause preventing code extraction from working - the
method was failing with a signature mismatch before any extraction logic
could run.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Preserve code blocks across PDF page boundaries

PDF extraction was breaking markdown code blocks by inserting page separators:

```python
def hello():
--- Page 2 ---
    return "world"
```

This made code blocks unrecognizable to extraction patterns.

Solution:
- Add _preserve_code_blocks_across_pages() function
- Detect split code blocks using regex pattern matching
- Remove page separators that appear within code blocks
- Apply to both pdfplumber and PyPDF2 extraction paths

Now PDF uploads should properly extract code examples just like markdown files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add PDF-specific code extraction for files without markdown delimiters

Root cause: PDFs lose markdown code block delimiters (``` ) during text extraction,
making standard markdown patterns fail to detect code.

Solution:
1. Add _extract_pdf_code_blocks() method with plain-text code detection patterns:
   - Python import blocks and function definitions
   - YAML configuration blocks
   - Shell command sequences
   - Multi-line indented code blocks

2. Add PDF detection logic in _extract_code_blocks_from_documents()
3. Set content_type properly for PDF files in storage service
4. Add debug logging to PDF text extraction process

This allows extraction of code from PDFs that contain technical documentation
with code examples, even when markdown formatting is lost during PDF->text conversion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Enhanced PDF code extraction to match markdown extraction results

Problem: PDF extraction only found 1 code example vs 9 from same markdown content
Root cause: PDF extraction patterns were too restrictive and specific

Enhanced solution:
1. **Multi-line code block detection**: Scans for consecutive "code-like" lines
   - Variable assignments, imports, function calls, method calls
   - Includes comments, control flow, YAML keys, shell commands
   - Handles indented continuation lines and empty lines within blocks

2. **Smarter block boundary detection**:
   - Excludes prose lines with narrative indicators
   - Allows natural code block boundaries
   - Preserves context around extracted blocks

3. **Comprehensive pattern coverage**:
   - Python scripts and functions
   - YAML configuration blocks
   - Shell command sequences
   - JavaScript functions

This approach should extract the same ~9 code examples from PDFs as from
markdown files, since it detects code patterns without relying on markdown delimiters.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Simplify PDF extraction to section-based approach

Changed from complex line-by-line analysis to simpler section-based approach:

1. Split PDF content by natural boundaries (paragraphs, page breaks)
2. Score each section for code vs prose indicators
3. Extract sections that score high on code indicators
4. Add comprehensive logging to debug section classification

Code indicators include:
- Python imports, functions, classes (high weight)
- Variable assignments, method calls (medium weight)
- Package management commands, lambda functions

This should better match the 9 code examples found in markdown version
by treating each logical code segment as a separate extractable block.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add explicit HTML file detection and extraction path

Problem: HTML files (0 code examples extracted) weren't being routed to HTML extraction

Root cause: HTML files (.html, .htm) weren't explicitly detected, so they fell through
to generic extraction logic instead of using the robust HTML code block patterns.

Solution:
1. Add HTML file detection: is_html_file = source_url.endswith(('.html', '.htm'))
2. Add explicit HTML extraction path before fallback logic
3. Set proper content_type: "text/html" for HTML files in storage service
4. Ensure HTML content is passed to _extract_html_code_blocks method

The existing HTML extraction already has comprehensive patterns for:
- <pre><code class="lang-python"> (syntax highlighted)
- <pre><code> (standard)
- Various code highlighting libraries (Prism, highlight.js, etc.)

This should now extract all code blocks from HTML files just like URL crawls do.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add HTML tag cleanup and proper code extraction for HTML files

Problem: HTML uploads had 0 code examples and contained HTML tags in RAG chunks

Solution:
1. **HTML Tag Cleanup**: Added _clean_html_to_text() function that:
   - Preserves code blocks by temporarily replacing them with placeholders
   - Removes all HTML tags, scripts, styles from prose content
   - Converts HTML structure (headers, paragraphs, lists) to clean text
   - Restores code blocks as markdown format (```language)
   - Cleans HTML entities (&lt;, &gt;, etc.)

2. **Unified Text Processing**: HTML files now processed as text files since they:
   - Have clean text for RAG chunking (no HTML tags)
   - Have markdown-style code blocks for extraction
   - Use existing text file extraction path

3. **Content Type Mapping**: Set text/markdown for cleaned HTML files

Result: HTML files now extract code examples like markdown files while providing
clean text for RAG without HTML markup pollution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Add HTML file support to upload dialog

- Add .html and .htm to accepted file types in AddKnowledgeDialog
- Users can now see and select HTML files in the file picker by default
- HTML files will be processed with tag cleanup and code extraction

Previously HTML files had to be manually typed or dragged, now they appear
in the standard file picker alongside other supported formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Prevent HTML extraction path confusion in crawl_results payload

Problem: Setting both 'markdown' and 'html' fields to same content could trigger
HTML extraction regexes when we want text/markdown extraction.

Solution:
- markdown: Contains cleaned plaintext/markdown content
- html: Empty string to prevent HTML extraction path
- content_type: Proper type (application/pdf, text/markdown, text/plain)

This ensures HTML files (now cleaned to markdown format) use the text file
extraction path with backtick patterns, not HTML regex patterns.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
2025-09-18 20:06:48 +03:00

Archon Main Graphic

Power up your AI coding assistants with your own custom knowledge base and task management as an MCP server

Quick StartUpgradingWhat's IncludedArchitectureTroubleshooting


🎯 What is Archon?

Archon is currently in beta! Expect things to not work 100%, and please feel free to share any feedback and contribute with fixes/new features! Thank you to everyone for all the excitement we have for Archon already, as well as the bug reports, PRs, and discussions. It's a lot for our small team to get through but we're committed to addressing everything and making Archon into the best tool it possibly can be!

Archon is the command center for AI coding assistants. For you, it's a sleek interface to manage knowledge, context, and tasks for your projects. For the AI coding assistant(s), it's a Model Context Protocol (MCP) server to collaborate on and leverage the same knowledge, context, and tasks. Connect Claude Code, Kiro, Cursor, Windsurf, etc. to give your AI agents access to:

  • Your documentation (crawled websites, uploaded PDFs/docs)
  • Smart search capabilities with advanced RAG strategies
  • Task management integrated with your knowledge base
  • Real-time updates as you add new content and collaborate with your coding assistant on tasks
  • Much more coming soon to build Archon into an integrated environment for all context engineering

This new vision for Archon replaces the old one (the agenteer). Archon used to be the AI agent that builds other agents, and now you can use Archon to do that and more.

It doesn't matter what you're building or if it's a new/existing codebase - Archon's knowledge and task management capabilities will improve the output of any AI driven coding.

Quick Start

Prerequisites

Setup Instructions

  1. Clone Repository:

    git clone -b stable https://github.com/coleam00/archon.git
    
    cd archon
    

    Note: The stable branch is recommended for using Archon. If you want to contribute or try the latest features, use the main branch with git clone https://github.com/coleam00/archon.git

  2. Environment Configuration:

    cp .env.example .env
    # Edit .env and add your Supabase credentials:
    # SUPABASE_URL=https://your-project.supabase.co
    # SUPABASE_SERVICE_KEY=your-service-key-here
    

    IMPORTANT NOTES:

    • For cloud Supabase: they recently introduced a new type of service role key but use the legacy one (the longer one).
    • For local Supabase: set SUPABASE_URL to http://host.docker.internal:8000 (unless you have an IP address set up).
  3. Database Setup: In your Supabase project SQL Editor, copy, paste, and execute the contents of migration/complete_setup.sql

  4. Start Services (choose one):

    Full Docker Mode (Recommended for Normal Archon Usage)

    docker compose up --build -d
    

    This starts all core microservices in Docker:

    • Server: Core API and business logic (Port: 8181)
    • MCP Server: Protocol interface for AI clients (Port: 8051)
    • UI: Web interface (Port: 3737)

    Ports are configurable in your .env as well!

  5. Configure API Keys:

    • Open http://localhost:3737
    • You'll automatically be brought through an onboarding flow to set your API key (OpenAI is default)

Quick Test

Once everything is running:

  1. Test Web Crawling: Go to http://localhost:3737 → Knowledge Base → "Crawl Website" → Enter a doc URL (such as https://ai.pydantic.dev/llms-full.txt)
  2. Test Document Upload: Knowledge Base → Upload a PDF
  3. Test Projects: Projects → Create a new project and add tasks
  4. Integrate with your AI coding assistant: MCP Dashboard → Copy connection config for your AI coding assistant

Installing Make

🛠️ Make installation (OPTIONAL - For Dev Workflows)

Windows

# Option 1: Using Chocolatey
choco install make

# Option 2: Using Scoop
scoop install make

# Option 3: Using WSL2
wsl --install
# Then in WSL: sudo apt-get install make

macOS

# Make comes pre-installed on macOS
# If needed: brew install make

Linux

# Debian/Ubuntu
sudo apt-get install make

# RHEL/CentOS/Fedora
sudo yum install make
🚀 Quick Command Reference for Make
Command Description
make dev Start hybrid dev (backend in Docker, frontend local)
make dev-docker Everything in Docker
make stop Stop all services
make test Run all tests
make lint Run linters
make install Install dependencies
make check Check environment setup
make clean Remove containers and volumes (with confirmation)

🔄 Database Reset (Start Fresh if Needed)

If you need to completely reset your database and start fresh:

⚠️ Reset Database - This will delete ALL data for Archon!
  1. Run Reset Script: In your Supabase SQL Editor, run the contents of migration/RESET_DB.sql

    ⚠️ WARNING: This will delete all Archon specific tables and data! Nothing else will be touched in your DB though.

  2. Rebuild Database: After reset, run migration/complete_setup.sql to create all the tables again.

  3. Restart Services:

    docker compose --profile full up -d
    
  4. Reconfigure:

    • Select your LLM/embedding provider and set the API key again
    • Re-upload any documents or re-crawl websites

The reset script safely removes all tables, functions, triggers, and policies with proper dependency handling.

📚 Documentation

Core Services

Service Container Name Default URL Purpose
Web Interface archon-ui http://localhost:3737 Main dashboard and controls
API Service archon-server http://localhost:8181 Web crawling, document processing
MCP Server archon-mcp http://localhost:8051 Model Context Protocol interface
Agents Service archon-agents http://localhost:8052 AI/ML operations, reranking

Upgrading

To upgrade Archon to the latest version:

  1. Pull latest changes:

    git pull
    
  2. Check for migrations: Look in the migration/ folder for any SQL files newer than your last update. Check the file created dates to determine if you need to run them. You can run these in the SQL editor just like you did when you first set up Archon. We are also working on a way to make handling these migrations automatic!

  3. Rebuild and restart:

    docker compose up -d --build
    

This is the same command used for initial setup - it rebuilds containers with the latest code and restarts services.

What's Included

🧠 Knowledge Management

  • Smart Web Crawling: Automatically detects and crawls entire documentation sites, sitemaps, and individual pages
  • Document Processing: Upload and process PDFs, Word docs, markdown files, and text documents with intelligent chunking
  • Code Example Extraction: Automatically identifies and indexes code examples from documentation for enhanced search
  • Vector Search: Advanced semantic search with contextual embeddings for precise knowledge retrieval
  • Source Management: Organize knowledge by source, type, and tags for easy filtering

🤖 AI Integration

  • Model Context Protocol (MCP): Connect any MCP-compatible client (Claude Code, Cursor, even non-AI coding assistants like Claude Desktop)
  • MCP Tools: Comprehensive yet simple set of tools for RAG queries, task management, and project operations
  • Multi-LLM Support: Works with OpenAI, Ollama, and Google Gemini models
  • RAG Strategies: Hybrid search, contextual embeddings, and result reranking for optimal AI responses
  • Real-time Streaming: Live responses from AI agents with progress tracking

📋 Project & Task Management

  • Hierarchical Projects: Organize work with projects, features, and tasks in a structured workflow
  • AI-Assisted Creation: Generate project requirements and tasks using integrated AI agents
  • Document Management: Version-controlled documents with collaborative editing capabilities
  • Progress Tracking: Real-time updates and status management across all project activities

🔄 Real-time Collaboration

  • WebSocket Updates: Live progress tracking for crawling, processing, and AI operations
  • Multi-user Support: Collaborative knowledge building and project management
  • Background Processing: Asynchronous operations that don't block the user interface
  • Health Monitoring: Built-in service health checks and automatic reconnection

Architecture

Microservices Structure

Archon uses true microservices architecture with clear separation of concerns:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Frontend UI   │    │  Server (API)   │    │   MCP Server    │    │ Agents Service  │
│                 │    │                 │    │                 │    │                 │
│  React + Vite   │◄──►│    FastAPI +    │◄──►│    Lightweight  │◄──►│   PydanticAI    │
│  Port 3737      │    │    SocketIO     │    │    HTTP Wrapper │    │   Port 8052     │
│                 │    │    Port 8181    │    │    Port 8051    │    │                 │
└─────────────────┘    └─────────────────┘    └─────────────────┘    └─────────────────┘
         │                        │                        │                        │
         └────────────────────────┼────────────────────────┼────────────────────────┘
                                  │                        │
                         ┌─────────────────┐               │
                         │    Database     │               │
                         │                 │               │
                         │    Supabase     │◄──────────────┘
                         │    PostgreSQL   │
                         │    PGVector     │
                         └─────────────────┘

Service Responsibilities

Service Location Purpose Key Features
Frontend archon-ui-main/ Web interface and dashboard React, TypeScript, TailwindCSS, Socket.IO client
Server python/src/server/ Core business logic and APIs FastAPI, service layer, Socket.IO broadcasts, all ML/AI operations
MCP Server python/src/mcp/ MCP protocol interface Lightweight HTTP wrapper, MCP tools, session management
Agents python/src/agents/ PydanticAI agent hosting Document and RAG agents, streaming responses

Communication Patterns

  • HTTP-based: All inter-service communication uses HTTP APIs
  • Socket.IO: Real-time updates from Server to Frontend
  • MCP Protocol: AI clients connect to MCP Server via SSE or stdio
  • No Direct Imports: Services are truly independent with no shared code dependencies

Key Architectural Benefits

  • Lightweight Containers: Each service contains only required dependencies
  • Independent Scaling: Services can be scaled independently based on load
  • Development Flexibility: Teams can work on different services without conflicts
  • Technology Diversity: Each service uses the best tools for its specific purpose

🔧 Configuring Custom Ports & Hostname

By default, Archon services run on the following ports:

  • archon-ui: 3737
  • archon-server: 8181
  • archon-mcp: 8051
  • archon-agents: 8052
  • archon-docs: 3838 (optional)

Changing Ports

To use custom ports, add these variables to your .env file:

# Service Ports Configuration
ARCHON_UI_PORT=3737
ARCHON_SERVER_PORT=8181
ARCHON_MCP_PORT=8051
ARCHON_AGENTS_PORT=8052
ARCHON_DOCS_PORT=3838

Example: Running on different ports:

ARCHON_SERVER_PORT=8282
ARCHON_MCP_PORT=8151

Configuring Hostname

By default, Archon uses localhost as the hostname. You can configure a custom hostname or IP address by setting the HOST variable in your .env file:

# Hostname Configuration
HOST=localhost  # Default

# Examples of custom hostnames:
HOST=192.168.1.100     # Use specific IP address
HOST=archon.local      # Use custom domain
HOST=myserver.com      # Use public domain

This is useful when:

  • Running Archon on a different machine and accessing it remotely
  • Using a custom domain name for your installation
  • Deploying in a network environment where localhost isn't accessible

After changing hostname or ports:

  1. Restart Docker containers: docker compose down && docker compose --profile full up -d
  2. Access the UI at: http://${HOST}:${ARCHON_UI_PORT}
  3. Update your AI client configuration with the new hostname and MCP port

🔧 Development

Quick Start

# Install dependencies
make install

# Start development (recommended)
make dev        # Backend in Docker, frontend local with hot reload

# Alternative: Everything in Docker
make dev-docker # All services in Docker

# Stop everything (local FE needs to be stopped manually)
make stop

Development Modes

Best for active development with instant frontend updates:

  • Backend services run in Docker (isolated, consistent)
  • Frontend runs locally with hot module replacement
  • Instant UI updates without Docker rebuilds

Full Docker Mode - make dev-docker

For all services in Docker environment:

  • All services run in Docker containers
  • Better for integration testing
  • Slower frontend updates

Testing & Code Quality

# Run tests
make test       # Run all tests
make test-fe    # Run frontend tests
make test-be    # Run backend tests

# Run linters
make lint       # Lint all code
make lint-fe    # Lint frontend code
make lint-be    # Lint backend code

# Check environment
make check      # Verify environment setup

# Clean up
make clean      # Remove containers and volumes (asks for confirmation)

Viewing Logs

# View logs using Docker Compose directly
docker compose logs -f              # All services
docker compose logs -f archon-server # API server
docker compose logs -f archon-mcp    # MCP server
docker compose logs -f archon-ui     # Frontend

Note: The backend services are configured with --reload flag in their uvicorn commands and have source code mounted as volumes for automatic hot reloading when you make changes.

Troubleshooting

Common Issues and Solutions

Port Conflicts

If you see "Port already in use" errors:

# Check what's using a port (e.g., 3737)
lsof -i :3737

# Stop all containers and local services
make stop

# Change the port in .env

Docker Permission Issues (Linux)

If you encounter permission errors with Docker:

# Add your user to the docker group
sudo usermod -aG docker $USER

# Log out and back in, or run
newgrp docker

Windows-Specific Issues

  • Make not found: Install Make via Chocolatey, Scoop, or WSL2 (see Installing Make)
  • Line ending issues: Configure Git to use LF endings:
    git config --global core.autocrlf false
    

Frontend Can't Connect to Backend

  • Check backend is running: curl http://localhost:8181/health
  • Verify port configuration in .env
  • For custom ports, ensure both ARCHON_SERVER_PORT and VITE_ARCHON_SERVER_PORT are set

Docker Compose Hangs

If docker compose commands hang:

# Reset Docker Compose
docker compose down --remove-orphans
docker system prune -f

# Restart Docker Desktop (if applicable)

Hot Reload Not Working

  • Frontend: Ensure you're running in hybrid mode (make dev) for best HMR experience
  • Backend: Check that volumes are mounted correctly in docker-compose.yml
  • File permissions: On some systems, mounted volumes may have permission issues

📈 Progress

Star History Chart

📄 License

Archon Community License (ACL) v1.2 - see LICENSE file for details.

TL;DR: Archon is free, open, and hackable. Run it, fork it, share it - just don't sell it as-a-service without permission.

Description
Beta release of Archon OS - the knowledge and task management backbone for AI coding assistants.
Readme 90 MiB
Languages
Python 54.3%
TypeScript 42.8%
PLpgSQL 2.3%
CSS 0.3%
Makefile 0.2%