archon

mirror of https://github.com/coleam00/Archon.git synced 2025-12-24 10:49:27 -05:00

Files

DIY Smart Code 6abb8831f7 fix: enable code examples extraction for manual file uploads (#626 )

* fix: enable code examples extraction for manual file uploads

- Add extract_code_examples parameter to upload API endpoint (default: true)
- Integrate CodeExtractionService into DocumentStorageService.upload_document()
- Add code extraction after document storage with progress tracking
- Map code extraction progress to 85-95% range in upload progress
- Include code_examples_stored in upload results and logging
- Support extract_code_examples in batch document upload via store_documents()
- Handle code extraction errors gracefully without failing upload

Fixes issue where code examples were only extracted for URL crawls
but not for manual file uploads, despite using the same underlying
CodeExtractionService that supports both HTML and text formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Fix code extraction for uploaded markdown files

- Provide file content in both html and markdown fields for crawl_results
- This ensures markdown files (.md) use the correct text file extraction path
- The CodeExtractionService checks html_content first for text files
- Fixes issue where uploaded .md files didn't extract code examples properly

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* debug: Add comprehensive logging to trace code extraction issue

- Add detailed debug logging to upload code extraction flow
- Log extract_code_examples parameter value
- Log crawl_results structure and content length
- Log progress callbacks from extraction service
- Log final extraction count with more context
- Enhanced error logging with full stack traces

This will help identify exactly where the extraction is failing for uploaded files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Remove invalid start_progress/end_progress parameters

The extract_and_store_code_examples method doesn't accept start_progress
and end_progress parameters, causing TypeError during file uploads.

This was the root cause preventing code extraction from working - the
method was failing with a signature mismatch before any extraction logic
could run.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Preserve code blocks across PDF page boundaries

PDF extraction was breaking markdown code blocks by inserting page separators:

```python
def hello():
--- Page 2 ---
    return "world"
```

This made code blocks unrecognizable to extraction patterns.

Solution:
- Add _preserve_code_blocks_across_pages() function
- Detect split code blocks using regex pattern matching
- Remove page separators that appear within code blocks
- Apply to both pdfplumber and PyPDF2 extraction paths

Now PDF uploads should properly extract code examples just like markdown files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add PDF-specific code extraction for files without markdown delimiters

Root cause: PDFs lose markdown code block delimiters (``` ) during text extraction,
making standard markdown patterns fail to detect code.

Solution:
1. Add _extract_pdf_code_blocks() method with plain-text code detection patterns:
   - Python import blocks and function definitions
   - YAML configuration blocks
   - Shell command sequences
   - Multi-line indented code blocks

2. Add PDF detection logic in _extract_code_blocks_from_documents()
3. Set content_type properly for PDF files in storage service
4. Add debug logging to PDF text extraction process

This allows extraction of code from PDFs that contain technical documentation
with code examples, even when markdown formatting is lost during PDF->text conversion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Enhanced PDF code extraction to match markdown extraction results

Problem: PDF extraction only found 1 code example vs 9 from same markdown content
Root cause: PDF extraction patterns were too restrictive and specific

Enhanced solution:
1. **Multi-line code block detection**: Scans for consecutive "code-like" lines
   - Variable assignments, imports, function calls, method calls
   - Includes comments, control flow, YAML keys, shell commands
   - Handles indented continuation lines and empty lines within blocks

2. **Smarter block boundary detection**:
   - Excludes prose lines with narrative indicators
   - Allows natural code block boundaries
   - Preserves context around extracted blocks

3. **Comprehensive pattern coverage**:
   - Python scripts and functions
   - YAML configuration blocks
   - Shell command sequences
   - JavaScript functions

This approach should extract the same ~9 code examples from PDFs as from
markdown files, since it detects code patterns without relying on markdown delimiters.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Simplify PDF extraction to section-based approach

Changed from complex line-by-line analysis to simpler section-based approach:

1. Split PDF content by natural boundaries (paragraphs, page breaks)
2. Score each section for code vs prose indicators
3. Extract sections that score high on code indicators
4. Add comprehensive logging to debug section classification

Code indicators include:
- Python imports, functions, classes (high weight)
- Variable assignments, method calls (medium weight)
- Package management commands, lambda functions

This should better match the 9 code examples found in markdown version
by treating each logical code segment as a separate extractable block.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add explicit HTML file detection and extraction path

Problem: HTML files (0 code examples extracted) weren't being routed to HTML extraction

Root cause: HTML files (.html, .htm) weren't explicitly detected, so they fell through
to generic extraction logic instead of using the robust HTML code block patterns.

Solution:
1. Add HTML file detection: is_html_file = source_url.endswith(('.html', '.htm'))
2. Add explicit HTML extraction path before fallback logic
3. Set proper content_type: "text/html" for HTML files in storage service
4. Ensure HTML content is passed to _extract_html_code_blocks method

The existing HTML extraction already has comprehensive patterns for:
- <pre><code class="lang-python"> (syntax highlighted)
- <pre><code> (standard)
- Various code highlighting libraries (Prism, highlight.js, etc.)

This should now extract all code blocks from HTML files just like URL crawls do.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add HTML tag cleanup and proper code extraction for HTML files

Problem: HTML uploads had 0 code examples and contained HTML tags in RAG chunks

Solution:
1. **HTML Tag Cleanup**: Added _clean_html_to_text() function that:
   - Preserves code blocks by temporarily replacing them with placeholders
   - Removes all HTML tags, scripts, styles from prose content
   - Converts HTML structure (headers, paragraphs, lists) to clean text
   - Restores code blocks as markdown format (```language)
   - Cleans HTML entities (&lt;, &gt;, etc.)

2. **Unified Text Processing**: HTML files now processed as text files since they:
   - Have clean text for RAG chunking (no HTML tags)
   - Have markdown-style code blocks for extraction
   - Use existing text file extraction path

3. **Content Type Mapping**: Set text/markdown for cleaned HTML files

Result: HTML files now extract code examples like markdown files while providing
clean text for RAG without HTML markup pollution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Add HTML file support to upload dialog

- Add .html and .htm to accepted file types in AddKnowledgeDialog
- Users can now see and select HTML files in the file picker by default
- HTML files will be processed with tag cleanup and code extraction

Previously HTML files had to be manually typed or dragged, now they appear
in the standard file picker alongside other supported formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Prevent HTML extraction path confusion in crawl_results payload

Problem: Setting both 'markdown' and 'html' fields to same content could trigger
HTML extraction regexes when we want text/markdown extraction.

Solution:
- markdown: Contains cleaned plaintext/markdown content
- html: Empty string to prevent HTML extraction path
- content_type: Proper type (application/pdf, text/markdown, text/plain)

This ensures HTML files (now cleaned to markdown format) use the text file
extraction path with backtick patterns, not HTML regex patterns.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>