mirror of
https://github.com/coleam00/Archon.git
synced 2025-12-24 10:49:27 -05:00
* fix: enable code examples extraction for manual file uploads - Add extract_code_examples parameter to upload API endpoint (default: true) - Integrate CodeExtractionService into DocumentStorageService.upload_document() - Add code extraction after document storage with progress tracking - Map code extraction progress to 85-95% range in upload progress - Include code_examples_stored in upload results and logging - Support extract_code_examples in batch document upload via store_documents() - Handle code extraction errors gracefully without failing upload Fixes issue where code examples were only extracted for URL crawls but not for manual file uploads, despite using the same underlying CodeExtractionService that supports both HTML and text formats. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Fix code extraction for uploaded markdown files - Provide file content in both html and markdown fields for crawl_results - This ensures markdown files (.md) use the correct text file extraction path - The CodeExtractionService checks html_content first for text files - Fixes issue where uploaded .md files didn't extract code examples properly 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * debug: Add comprehensive logging to trace code extraction issue - Add detailed debug logging to upload code extraction flow - Log extract_code_examples parameter value - Log crawl_results structure and content length - Log progress callbacks from extraction service - Log final extraction count with more context - Enhanced error logging with full stack traces This will help identify exactly where the extraction is failing for uploaded files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Remove invalid start_progress/end_progress parameters The extract_and_store_code_examples method doesn't accept start_progress and end_progress parameters, causing TypeError during file uploads. This was the root cause preventing code extraction from working - the method was failing with a signature mismatch before any extraction logic could run. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Preserve code blocks across PDF page boundaries PDF extraction was breaking markdown code blocks by inserting page separators: ```python def hello(): --- Page 2 --- return "world" ``` This made code blocks unrecognizable to extraction patterns. Solution: - Add _preserve_code_blocks_across_pages() function - Detect split code blocks using regex pattern matching - Remove page separators that appear within code blocks - Apply to both pdfplumber and PyPDF2 extraction paths Now PDF uploads should properly extract code examples just like markdown files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add PDF-specific code extraction for files without markdown delimiters Root cause: PDFs lose markdown code block delimiters (``` ) during text extraction, making standard markdown patterns fail to detect code. Solution: 1. Add _extract_pdf_code_blocks() method with plain-text code detection patterns: - Python import blocks and function definitions - YAML configuration blocks - Shell command sequences - Multi-line indented code blocks 2. Add PDF detection logic in _extract_code_blocks_from_documents() 3. Set content_type properly for PDF files in storage service 4. Add debug logging to PDF text extraction process This allows extraction of code from PDFs that contain technical documentation with code examples, even when markdown formatting is lost during PDF->text conversion. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Enhanced PDF code extraction to match markdown extraction results Problem: PDF extraction only found 1 code example vs 9 from same markdown content Root cause: PDF extraction patterns were too restrictive and specific Enhanced solution: 1. **Multi-line code block detection**: Scans for consecutive "code-like" lines - Variable assignments, imports, function calls, method calls - Includes comments, control flow, YAML keys, shell commands - Handles indented continuation lines and empty lines within blocks 2. **Smarter block boundary detection**: - Excludes prose lines with narrative indicators - Allows natural code block boundaries - Preserves context around extracted blocks 3. **Comprehensive pattern coverage**: - Python scripts and functions - YAML configuration blocks - Shell command sequences - JavaScript functions This approach should extract the same ~9 code examples from PDFs as from markdown files, since it detects code patterns without relying on markdown delimiters. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Simplify PDF extraction to section-based approach Changed from complex line-by-line analysis to simpler section-based approach: 1. Split PDF content by natural boundaries (paragraphs, page breaks) 2. Score each section for code vs prose indicators 3. Extract sections that score high on code indicators 4. Add comprehensive logging to debug section classification Code indicators include: - Python imports, functions, classes (high weight) - Variable assignments, method calls (medium weight) - Package management commands, lambda functions This should better match the 9 code examples found in markdown version by treating each logical code segment as a separate extractable block. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add explicit HTML file detection and extraction path Problem: HTML files (0 code examples extracted) weren't being routed to HTML extraction Root cause: HTML files (.html, .htm) weren't explicitly detected, so they fell through to generic extraction logic instead of using the robust HTML code block patterns. Solution: 1. Add HTML file detection: is_html_file = source_url.endswith(('.html', '.htm')) 2. Add explicit HTML extraction path before fallback logic 3. Set proper content_type: "text/html" for HTML files in storage service 4. Ensure HTML content is passed to _extract_html_code_blocks method The existing HTML extraction already has comprehensive patterns for: - <pre><code class="lang-python"> (syntax highlighted) - <pre><code> (standard) - Various code highlighting libraries (Prism, highlight.js, etc.) This should now extract all code blocks from HTML files just like URL crawls do. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add HTML tag cleanup and proper code extraction for HTML files Problem: HTML uploads had 0 code examples and contained HTML tags in RAG chunks Solution: 1. **HTML Tag Cleanup**: Added _clean_html_to_text() function that: - Preserves code blocks by temporarily replacing them with placeholders - Removes all HTML tags, scripts, styles from prose content - Converts HTML structure (headers, paragraphs, lists) to clean text - Restores code blocks as markdown format (```language) - Cleans HTML entities (<, >, etc.) 2. **Unified Text Processing**: HTML files now processed as text files since they: - Have clean text for RAG chunking (no HTML tags) - Have markdown-style code blocks for extraction - Use existing text file extraction path 3. **Content Type Mapping**: Set text/markdown for cleaned HTML files Result: HTML files now extract code examples like markdown files while providing clean text for RAG without HTML markup pollution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Add HTML file support to upload dialog - Add .html and .htm to accepted file types in AddKnowledgeDialog - Users can now see and select HTML files in the file picker by default - HTML files will be processed with tag cleanup and code extraction Previously HTML files had to be manually typed or dragged, now they appear in the standard file picker alongside other supported formats. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Prevent HTML extraction path confusion in crawl_results payload Problem: Setting both 'markdown' and 'html' fields to same content could trigger HTML extraction regexes when we want text/markdown extraction. Solution: - markdown: Contains cleaned plaintext/markdown content - html: Empty string to prevent HTML extraction path - content_type: Proper type (application/pdf, text/markdown, text/plain) This ensures HTML files (now cleaned to markdown format) use the text file extraction path with backtick patterns, not HTML regex patterns. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>