mirror of
https://github.com/coleam00/Archon.git
synced 2025-12-23 18:29:18 -05:00
fix: enable code examples extraction for manual file uploads (#626)
* fix: enable code examples extraction for manual file uploads - Add extract_code_examples parameter to upload API endpoint (default: true) - Integrate CodeExtractionService into DocumentStorageService.upload_document() - Add code extraction after document storage with progress tracking - Map code extraction progress to 85-95% range in upload progress - Include code_examples_stored in upload results and logging - Support extract_code_examples in batch document upload via store_documents() - Handle code extraction errors gracefully without failing upload Fixes issue where code examples were only extracted for URL crawls but not for manual file uploads, despite using the same underlying CodeExtractionService that supports both HTML and text formats. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Fix code extraction for uploaded markdown files - Provide file content in both html and markdown fields for crawl_results - This ensures markdown files (.md) use the correct text file extraction path - The CodeExtractionService checks html_content first for text files - Fixes issue where uploaded .md files didn't extract code examples properly 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * debug: Add comprehensive logging to trace code extraction issue - Add detailed debug logging to upload code extraction flow - Log extract_code_examples parameter value - Log crawl_results structure and content length - Log progress callbacks from extraction service - Log final extraction count with more context - Enhanced error logging with full stack traces This will help identify exactly where the extraction is failing for uploaded files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Remove invalid start_progress/end_progress parameters The extract_and_store_code_examples method doesn't accept start_progress and end_progress parameters, causing TypeError during file uploads. This was the root cause preventing code extraction from working - the method was failing with a signature mismatch before any extraction logic could run. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Preserve code blocks across PDF page boundaries PDF extraction was breaking markdown code blocks by inserting page separators: ```python def hello(): --- Page 2 --- return "world" ``` This made code blocks unrecognizable to extraction patterns. Solution: - Add _preserve_code_blocks_across_pages() function - Detect split code blocks using regex pattern matching - Remove page separators that appear within code blocks - Apply to both pdfplumber and PyPDF2 extraction paths Now PDF uploads should properly extract code examples just like markdown files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add PDF-specific code extraction for files without markdown delimiters Root cause: PDFs lose markdown code block delimiters (``` ) during text extraction, making standard markdown patterns fail to detect code. Solution: 1. Add _extract_pdf_code_blocks() method with plain-text code detection patterns: - Python import blocks and function definitions - YAML configuration blocks - Shell command sequences - Multi-line indented code blocks 2. Add PDF detection logic in _extract_code_blocks_from_documents() 3. Set content_type properly for PDF files in storage service 4. Add debug logging to PDF text extraction process This allows extraction of code from PDFs that contain technical documentation with code examples, even when markdown formatting is lost during PDF->text conversion. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Enhanced PDF code extraction to match markdown extraction results Problem: PDF extraction only found 1 code example vs 9 from same markdown content Root cause: PDF extraction patterns were too restrictive and specific Enhanced solution: 1. **Multi-line code block detection**: Scans for consecutive "code-like" lines - Variable assignments, imports, function calls, method calls - Includes comments, control flow, YAML keys, shell commands - Handles indented continuation lines and empty lines within blocks 2. **Smarter block boundary detection**: - Excludes prose lines with narrative indicators - Allows natural code block boundaries - Preserves context around extracted blocks 3. **Comprehensive pattern coverage**: - Python scripts and functions - YAML configuration blocks - Shell command sequences - JavaScript functions This approach should extract the same ~9 code examples from PDFs as from markdown files, since it detects code patterns without relying on markdown delimiters. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Simplify PDF extraction to section-based approach Changed from complex line-by-line analysis to simpler section-based approach: 1. Split PDF content by natural boundaries (paragraphs, page breaks) 2. Score each section for code vs prose indicators 3. Extract sections that score high on code indicators 4. Add comprehensive logging to debug section classification Code indicators include: - Python imports, functions, classes (high weight) - Variable assignments, method calls (medium weight) - Package management commands, lambda functions This should better match the 9 code examples found in markdown version by treating each logical code segment as a separate extractable block. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add explicit HTML file detection and extraction path Problem: HTML files (0 code examples extracted) weren't being routed to HTML extraction Root cause: HTML files (.html, .htm) weren't explicitly detected, so they fell through to generic extraction logic instead of using the robust HTML code block patterns. Solution: 1. Add HTML file detection: is_html_file = source_url.endswith(('.html', '.htm')) 2. Add explicit HTML extraction path before fallback logic 3. Set proper content_type: "text/html" for HTML files in storage service 4. Ensure HTML content is passed to _extract_html_code_blocks method The existing HTML extraction already has comprehensive patterns for: - <pre><code class="lang-python"> (syntax highlighted) - <pre><code> (standard) - Various code highlighting libraries (Prism, highlight.js, etc.) This should now extract all code blocks from HTML files just like URL crawls do. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Add HTML tag cleanup and proper code extraction for HTML files Problem: HTML uploads had 0 code examples and contained HTML tags in RAG chunks Solution: 1. **HTML Tag Cleanup**: Added _clean_html_to_text() function that: - Preserves code blocks by temporarily replacing them with placeholders - Removes all HTML tags, scripts, styles from prose content - Converts HTML structure (headers, paragraphs, lists) to clean text - Restores code blocks as markdown format (```language) - Cleans HTML entities (<, >, etc.) 2. **Unified Text Processing**: HTML files now processed as text files since they: - Have clean text for RAG chunking (no HTML tags) - Have markdown-style code blocks for extraction - Use existing text file extraction path 3. **Content Type Mapping**: Set text/markdown for cleaned HTML files Result: HTML files now extract code examples like markdown files while providing clean text for RAG without HTML markup pollution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * feat: Add HTML file support to upload dialog - Add .html and .htm to accepted file types in AddKnowledgeDialog - Users can now see and select HTML files in the file picker by default - HTML files will be processed with tag cleanup and code extraction Previously HTML files had to be manually typed or dragged, now they appear in the standard file picker alongside other supported formats. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * fix: Prevent HTML extraction path confusion in crawl_results payload Problem: Setting both 'markdown' and 'html' fields to same content could trigger HTML extraction regexes when we want text/markdown extraction. Solution: - markdown: Contains cleaned plaintext/markdown content - html: Empty string to prevent HTML extraction path - content_type: Proper type (application/pdf, text/markdown, text/plain) This ensures HTML files (now cleaned to markdown format) use the text file extraction path with backtick patterns, not HTML regex patterns. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -260,7 +260,7 @@ export const AddKnowledgeDialog: React.FC<AddKnowledgeDialogProps> = ({
|
||||
<input
|
||||
id={fileId}
|
||||
type="file"
|
||||
accept=".txt,.md,.pdf,.doc,.docx"
|
||||
accept=".txt,.md,.pdf,.doc,.docx,.html,.htm"
|
||||
onChange={(e) => setSelectedFile(e.target.files?.[0] || null)}
|
||||
disabled={isProcessing}
|
||||
className="absolute inset-0 w-full h-full opacity-0 cursor-pointer disabled:cursor-not-allowed z-10"
|
||||
|
||||
@@ -144,6 +144,48 @@ class RagQueryRequest(BaseModel):
|
||||
match_count: int = 5
|
||||
|
||||
|
||||
@router.get("/crawl-progress/{progress_id}")
|
||||
async def get_crawl_progress(progress_id: str):
|
||||
"""Get crawl progress for polling.
|
||||
|
||||
Returns the current state of a crawl operation.
|
||||
Frontend should poll this endpoint to track crawl progress.
|
||||
"""
|
||||
try:
|
||||
from ..models.progress_models import create_progress_response
|
||||
from ..utils.progress.progress_tracker import ProgressTracker
|
||||
|
||||
# Get progress from the tracker's in-memory storage
|
||||
progress_data = ProgressTracker.get_progress(progress_id)
|
||||
safe_logfire_info(f"Crawl progress requested | progress_id={progress_id} | found={progress_data is not None}")
|
||||
|
||||
if not progress_data:
|
||||
# Return 404 if no progress exists - this is correct behavior
|
||||
raise HTTPException(status_code=404, detail={"error": f"No progress found for ID: {progress_id}"})
|
||||
|
||||
# Ensure we have the progress_id in the data
|
||||
progress_data["progress_id"] = progress_id
|
||||
|
||||
# Get operation type for proper model selection
|
||||
operation_type = progress_data.get("type", "crawl")
|
||||
|
||||
# Create standardized response using Pydantic model
|
||||
progress_response = create_progress_response(operation_type, progress_data)
|
||||
|
||||
# Convert to dict with camelCase fields for API response
|
||||
response_data = progress_response.model_dump(by_alias=True, exclude_none=True)
|
||||
|
||||
safe_logfire_info(
|
||||
f"Progress retrieved | operation_id={progress_id} | status={response_data.get('status')} | "
|
||||
f"progress={response_data.get('progress')} | totalPages={response_data.get('totalPages')} | "
|
||||
f"processedPages={response_data.get('processedPages')}"
|
||||
)
|
||||
|
||||
return response_data
|
||||
except Exception as e:
|
||||
safe_logfire_error(f"Failed to get crawl progress | error={str(e)} | progress_id={progress_id}")
|
||||
raise HTTPException(status_code=500, detail={"error": str(e)})
|
||||
|
||||
|
||||
@router.get("/knowledge-items/sources")
|
||||
async def get_knowledge_sources():
|
||||
@@ -818,6 +860,7 @@ async def upload_document(
|
||||
file: UploadFile = File(...),
|
||||
tags: str | None = Form(None),
|
||||
knowledge_type: str = Form("technical"),
|
||||
extract_code_examples: bool = Form(True),
|
||||
):
|
||||
"""Upload and process a document with progress tracking."""
|
||||
|
||||
@@ -871,7 +914,7 @@ async def upload_document(
|
||||
# Upload tasks can be tracked directly since they don't spawn sub-tasks
|
||||
upload_task = asyncio.create_task(
|
||||
_perform_upload_with_progress(
|
||||
progress_id, file_content, file_metadata, tag_list, knowledge_type, tracker
|
||||
progress_id, file_content, file_metadata, tag_list, knowledge_type, extract_code_examples, tracker
|
||||
)
|
||||
)
|
||||
# Track the task for cancellation support
|
||||
@@ -899,7 +942,8 @@ async def _perform_upload_with_progress(
|
||||
file_metadata: dict,
|
||||
tag_list: list[str],
|
||||
knowledge_type: str,
|
||||
tracker,
|
||||
extract_code_examples: bool,
|
||||
tracker: "ProgressTracker",
|
||||
):
|
||||
"""Perform document upload with progress tracking using service layer."""
|
||||
# Create cancellation check function for document uploads
|
||||
@@ -978,6 +1022,7 @@ async def _perform_upload_with_progress(
|
||||
source_id=source_id,
|
||||
knowledge_type=knowledge_type,
|
||||
tags=tag_list,
|
||||
extract_code_examples=extract_code_examples,
|
||||
progress_callback=document_progress_callback,
|
||||
cancellation_check=check_upload_cancellation,
|
||||
)
|
||||
@@ -987,10 +1032,11 @@ async def _perform_upload_with_progress(
|
||||
await tracker.complete({
|
||||
"log": "Document uploaded successfully!",
|
||||
"chunks_stored": result.get("chunks_stored"),
|
||||
"code_examples_stored": result.get("code_examples_stored", 0),
|
||||
"sourceId": result.get("source_id"),
|
||||
})
|
||||
safe_logfire_info(
|
||||
f"Document uploaded successfully | progress_id={progress_id} | source_id={result.get('source_id')} | chunks_stored={result.get('chunks_stored')}"
|
||||
f"Document uploaded successfully | progress_id={progress_id} | source_id={result.get('source_id')} | chunks_stored={result.get('chunks_stored')} | code_examples_stored={result.get('code_examples_stored', 0)}"
|
||||
)
|
||||
else:
|
||||
error_msg = result.get("error", "Unknown error")
|
||||
|
||||
@@ -291,12 +291,16 @@ class CodeExtractionService:
|
||||
# Improved extraction logic - check for text files first, then HTML, then markdown
|
||||
code_blocks = []
|
||||
|
||||
# Check if this is a text file (e.g., .txt, .md)
|
||||
# Check if this is a text file (e.g., .txt, .md, .html after cleaning) or PDF
|
||||
is_text_file = source_url.endswith((
|
||||
".txt",
|
||||
".text",
|
||||
".md",
|
||||
)) or "text/plain" in doc.get("content_type", "")
|
||||
".html",
|
||||
".htm",
|
||||
)) or "text/plain" in doc.get("content_type", "") or "text/markdown" in doc.get("content_type", "")
|
||||
|
||||
is_pdf_file = source_url.endswith(".pdf") or "application/pdf" in doc.get("content_type", "")
|
||||
|
||||
if is_text_file:
|
||||
# For text files, use specialized text extraction
|
||||
@@ -322,7 +326,19 @@ class CodeExtractionService:
|
||||
else:
|
||||
safe_logfire_info(f"⚠️ NO CONTENT for text file | url={source_url}")
|
||||
|
||||
# If not a text file or no code blocks found, try HTML extraction first
|
||||
# If this is a PDF file, use specialized PDF extraction
|
||||
elif is_pdf_file:
|
||||
safe_logfire_info(f"📄 PDF FILE DETECTED | url={source_url}")
|
||||
# For PDFs, use the content that should be PDF-extracted text
|
||||
pdf_content = html_content if html_content else md
|
||||
if pdf_content:
|
||||
safe_logfire_info(f"📝 Using {'HTML' if html_content else 'MARKDOWN'} content for PDF extraction")
|
||||
code_blocks = await self._extract_pdf_code_blocks(pdf_content, source_url)
|
||||
safe_logfire_info(f"📦 PDF extraction complete | found={len(code_blocks)} blocks | url={source_url}")
|
||||
else:
|
||||
safe_logfire_info(f"⚠️ NO CONTENT for PDF file | url={source_url}")
|
||||
|
||||
# If not a text file or PDF, or no code blocks found, try HTML extraction as fallback
|
||||
if len(code_blocks) == 0 and html_content and not is_text_file:
|
||||
safe_logfire_info(
|
||||
f"Trying HTML extraction first | url={source_url} | html_length={len(html_content)}"
|
||||
@@ -912,6 +928,135 @@ class CodeExtractionService:
|
||||
)
|
||||
return code_blocks
|
||||
|
||||
async def _extract_pdf_code_blocks(
|
||||
self, content: str, url: str
|
||||
) -> list[dict[str, Any]]:
|
||||
"""
|
||||
Extract code blocks from PDF-extracted text that lacks markdown formatting.
|
||||
PDFs lose markdown delimiters, so we need to detect code patterns in plain text.
|
||||
|
||||
This uses a much simpler approach - look for distinct code segments separated by prose.
|
||||
"""
|
||||
import re
|
||||
|
||||
safe_logfire_info(f"🔍 PDF CODE EXTRACTION START | url={url} | content_length={len(content)}")
|
||||
|
||||
code_blocks = []
|
||||
min_length = await self._get_min_code_length()
|
||||
|
||||
# Split content into paragraphs/sections
|
||||
# Use double newlines and page breaks as natural boundaries
|
||||
sections = re.split(r'\n\n+|--- Page \d+ ---', content)
|
||||
|
||||
safe_logfire_info(f"📄 Split PDF into {len(sections)} sections")
|
||||
|
||||
for i, section in enumerate(sections):
|
||||
section = section.strip()
|
||||
if not section or len(section) < 50: # Skip very short sections
|
||||
continue
|
||||
|
||||
# Check if this section looks like code
|
||||
if self._is_pdf_section_code_like(section):
|
||||
safe_logfire_info(f"🔍 Analyzing section {i} as potential code (length: {len(section)})")
|
||||
|
||||
# Try to detect language
|
||||
language = self._detect_language_from_content(section)
|
||||
|
||||
# Clean the content
|
||||
cleaned_code = self._clean_code_content(section, language)
|
||||
|
||||
# Check length after cleaning
|
||||
if len(cleaned_code) >= min_length:
|
||||
# Validate quality
|
||||
if await self._validate_code_quality(cleaned_code, language):
|
||||
# Get context from adjacent sections
|
||||
context_before = sections[i-1].strip() if i > 0 else ""
|
||||
context_after = sections[i+1].strip() if i < len(sections)-1 else ""
|
||||
|
||||
safe_logfire_info(f"✅ PDF code section | language={language} | length={len(cleaned_code)}")
|
||||
code_blocks.append({
|
||||
"code": cleaned_code,
|
||||
"language": language,
|
||||
"context_before": context_before,
|
||||
"context_after": context_after,
|
||||
"full_context": f"{context_before}\n\n{cleaned_code}\n\n{context_after}",
|
||||
"source_type": "pdf_section",
|
||||
})
|
||||
else:
|
||||
safe_logfire_info(f"❌ PDF section failed validation | language={language}")
|
||||
else:
|
||||
safe_logfire_info(f"❌ PDF section too short after cleaning: {len(cleaned_code)} < {min_length}")
|
||||
else:
|
||||
safe_logfire_info(f"📝 Section {i} identified as prose/documentation")
|
||||
|
||||
safe_logfire_info(f"🔍 PDF CODE EXTRACTION COMPLETE | total_blocks={len(code_blocks)} | url={url}")
|
||||
return code_blocks
|
||||
|
||||
def _is_pdf_section_code_like(self, section: str) -> bool:
|
||||
"""
|
||||
Determine if a PDF section contains code rather than prose.
|
||||
"""
|
||||
import re
|
||||
|
||||
# Count code indicators vs prose indicators
|
||||
code_score = 0
|
||||
prose_score = 0
|
||||
|
||||
# Code indicators (higher weight for stronger indicators)
|
||||
code_patterns = [
|
||||
(r'\bfrom \w+(?:\.\w+)* import\b', 3), # Python imports (strong)
|
||||
(r'\bdef \w+\s*\(', 3), # Function definitions (strong)
|
||||
(r'\bclass \w+\s*[\(:]', 3), # Class definitions (strong)
|
||||
(r'\w+\s*=\s*\w+\(', 2), # Function calls assigned (medium)
|
||||
(r'\w+\s*=\s*\[.*\]', 2), # List assignments (medium)
|
||||
(r'\w+\.\w+\(', 2), # Method calls (medium)
|
||||
(r'^\s*#[^#]', 1), # Single-line comments (weak)
|
||||
(r'\bpip install\b', 2), # Package management (medium)
|
||||
(r'\bpytest\b', 2), # Testing commands (medium)
|
||||
(r'\bgit clone\b', 2), # Git commands (medium)
|
||||
(r':\s*\n\s+\w+:', 2), # YAML structure (medium)
|
||||
(r'\blambda\s+\w+:', 2), # Lambda functions (medium)
|
||||
]
|
||||
|
||||
# Prose indicators
|
||||
prose_patterns = [
|
||||
(r'\b(the|this|that|these|those|are|is|was|were|will|would|should|could|have|has|had)\b', 1),
|
||||
(r'[.!?]\s+[A-Z]', 2), # Sentence endings
|
||||
(r'\b(however|therefore|furthermore|moreover|additionally|specifically)\b', 2),
|
||||
(r'\bTable of Contents\b', 3),
|
||||
(r'\bAPI Reference\b', 2),
|
||||
]
|
||||
|
||||
# Count patterns
|
||||
for pattern, weight in code_patterns:
|
||||
matches = len(re.findall(pattern, section, re.IGNORECASE | re.MULTILINE))
|
||||
code_score += matches * weight
|
||||
|
||||
for pattern, weight in prose_patterns:
|
||||
matches = len(re.findall(pattern, section, re.IGNORECASE | re.MULTILINE))
|
||||
prose_score += matches * weight
|
||||
|
||||
# Additional checks
|
||||
lines = section.split('\n')
|
||||
non_empty_lines = [line.strip() for line in lines if line.strip()]
|
||||
|
||||
if not non_empty_lines:
|
||||
return False
|
||||
|
||||
# If section is mostly single words or very short lines, probably not code
|
||||
short_lines = sum(1 for line in non_empty_lines if len(line.split()) < 3)
|
||||
if len(non_empty_lines) > 0 and short_lines / len(non_empty_lines) > 0.7:
|
||||
prose_score += 3
|
||||
|
||||
# If section has common code structure indicators
|
||||
if any('(' in line and ')' in line for line in non_empty_lines[:5]):
|
||||
code_score += 2
|
||||
|
||||
safe_logfire_info(f"📊 Section scoring: code_score={code_score}, prose_score={prose_score}")
|
||||
|
||||
# Code-like if code score significantly higher than prose score
|
||||
return code_score > prose_score and code_score > 2
|
||||
|
||||
def _detect_language_from_content(self, code: str) -> str:
|
||||
"""
|
||||
Try to detect programming language from code content.
|
||||
|
||||
@@ -24,6 +24,7 @@ class DocumentStorageService(BaseStorageService):
|
||||
source_id: str,
|
||||
knowledge_type: str = "documentation",
|
||||
tags: list[str] | None = None,
|
||||
extract_code_examples: bool = True,
|
||||
progress_callback: Any | None = None,
|
||||
cancellation_check: Any | None = None,
|
||||
) -> tuple[bool, dict[str, Any]]:
|
||||
@@ -36,7 +37,9 @@ class DocumentStorageService(BaseStorageService):
|
||||
source_id: Source identifier
|
||||
knowledge_type: Type of knowledge
|
||||
tags: Optional list of tags
|
||||
extract_code_examples: Whether to extract code examples from the document
|
||||
progress_callback: Optional callback for progress
|
||||
cancellation_check: Optional function to check for cancellation
|
||||
|
||||
Returns:
|
||||
Tuple of (success, result_dict)
|
||||
@@ -145,10 +148,65 @@ class DocumentStorageService(BaseStorageService):
|
||||
cancellation_check=cancellation_check,
|
||||
)
|
||||
|
||||
# Extract code examples if requested
|
||||
code_examples_count = 0
|
||||
if extract_code_examples and len(chunks) > 0:
|
||||
try:
|
||||
await report_progress("Extracting code examples...", 85)
|
||||
|
||||
logger.info(f"🔍 DEBUG: Starting code extraction for {filename} | extract_code_examples={extract_code_examples}")
|
||||
|
||||
# Import code extraction service
|
||||
from ..crawling.code_extraction_service import CodeExtractionService
|
||||
|
||||
code_service = CodeExtractionService(self.supabase_client)
|
||||
|
||||
# Create crawl_results format expected by code extraction service
|
||||
# markdown: cleaned plaintext (HTML->markdown for HTML files, raw content otherwise)
|
||||
# html: empty string to prevent HTML extraction path confusion
|
||||
# content_type: proper type to guide extraction method selection
|
||||
crawl_results = [{
|
||||
"url": doc_url,
|
||||
"markdown": file_content, # Cleaned plaintext/markdown content
|
||||
"html": "", # Empty to prevent HTML extraction path
|
||||
"content_type": "application/pdf" if filename.lower().endswith('.pdf') else (
|
||||
"text/markdown" if filename.lower().endswith(('.html', '.htm', '.md')) else "text/plain"
|
||||
)
|
||||
}]
|
||||
|
||||
logger.info(f"🔍 DEBUG: Created crawl_results with url={doc_url}, content_length={len(file_content)}")
|
||||
|
||||
# Create progress callback for code extraction
|
||||
async def code_progress_callback(data: dict):
|
||||
logger.info(f"🔍 DEBUG: Code extraction progress: {data}")
|
||||
if progress_callback:
|
||||
# Map code extraction progress (0-100) to our remaining range (85-95)
|
||||
raw_progress = data.get("progress", data.get("percentage", 0))
|
||||
mapped_progress = 85 + (raw_progress / 100.0) * 10 # 85% to 95%
|
||||
message = data.get("log", "Extracting code examples...")
|
||||
await progress_callback(message, int(mapped_progress))
|
||||
|
||||
logger.info(f"🔍 DEBUG: About to call extract_and_store_code_examples...")
|
||||
code_examples_count = await code_service.extract_and_store_code_examples(
|
||||
crawl_results=crawl_results,
|
||||
url_to_full_document=url_to_full_document,
|
||||
source_id=source_id,
|
||||
progress_callback=code_progress_callback,
|
||||
cancellation_check=cancellation_check,
|
||||
)
|
||||
|
||||
logger.info(f"🔍 DEBUG: Code extraction completed: {code_examples_count} code examples found for {filename}")
|
||||
|
||||
except Exception as e:
|
||||
# Log error with full traceback but don't fail the entire upload
|
||||
logger.error(f"Code extraction failed for {filename}: {e}", exc_info=True)
|
||||
code_examples_count = 0
|
||||
|
||||
await report_progress("Document upload completed!", 100)
|
||||
|
||||
result = {
|
||||
"chunks_stored": len(chunks),
|
||||
"code_examples_stored": code_examples_count,
|
||||
"total_word_count": total_word_count,
|
||||
"source_id": source_id,
|
||||
"filename": filename,
|
||||
@@ -156,10 +214,11 @@ class DocumentStorageService(BaseStorageService):
|
||||
|
||||
span.set_attribute("success", True)
|
||||
span.set_attribute("chunks_stored", len(chunks))
|
||||
span.set_attribute("code_examples_stored", code_examples_count)
|
||||
span.set_attribute("total_word_count", total_word_count)
|
||||
|
||||
logger.info(
|
||||
f"Document upload completed successfully: filename={filename}, chunks_stored={len(chunks)}, total_word_count={total_word_count}"
|
||||
f"Document upload completed successfully: filename={filename}, chunks_stored={len(chunks)}, code_examples_stored={code_examples_count}, total_word_count={total_word_count}"
|
||||
)
|
||||
|
||||
return True, result
|
||||
@@ -192,6 +251,7 @@ class DocumentStorageService(BaseStorageService):
|
||||
source_id=doc.get("source_id", "upload"),
|
||||
knowledge_type=doc.get("knowledge_type", "documentation"),
|
||||
tags=doc.get("tags"),
|
||||
extract_code_examples=doc.get("extract_code_examples", True),
|
||||
progress_callback=kwargs.get("progress_callback"),
|
||||
cancellation_check=kwargs.get("cancellation_check"),
|
||||
)
|
||||
|
||||
@@ -36,6 +36,125 @@ from ..config.logfire_config import get_logger, logfire
|
||||
logger = get_logger(__name__)
|
||||
|
||||
|
||||
def _preserve_code_blocks_across_pages(text: str) -> str:
|
||||
"""
|
||||
Fix code blocks that were split across PDF page boundaries.
|
||||
|
||||
PDFs often break markdown code blocks with page headers like:
|
||||
```python
|
||||
def hello():
|
||||
--- Page 2 ---
|
||||
return "world"
|
||||
```
|
||||
|
||||
This function rejoins split code blocks by removing page separators
|
||||
that appear within code blocks.
|
||||
"""
|
||||
import re
|
||||
|
||||
# Pattern to match page separators that split code blocks
|
||||
# Look for: ``` [content] --- Page N --- [content] ```
|
||||
page_break_in_code_pattern = r'(```\w*[^\n]*\n(?:[^`]|`(?!``))*)(\n--- Page \d+ ---\n)((?:[^`]|`(?!``))*)```'
|
||||
|
||||
# Keep merging until no more splits are found
|
||||
while True:
|
||||
matches = list(re.finditer(page_break_in_code_pattern, text, re.DOTALL))
|
||||
if not matches:
|
||||
break
|
||||
|
||||
# Replace each match by removing the page separator
|
||||
for match in reversed(matches): # Reverse to maintain positions
|
||||
before_page_break = match.group(1)
|
||||
page_separator = match.group(2)
|
||||
after_page_break = match.group(3)
|
||||
|
||||
# Rejoin the code block without the page separator
|
||||
rejoined = f"{before_page_break}\n{after_page_break}```"
|
||||
text = text[:match.start()] + rejoined + text[match.end():]
|
||||
|
||||
return text
|
||||
|
||||
|
||||
def _clean_html_to_text(html_content: str) -> str:
|
||||
"""
|
||||
Clean HTML tags and convert to plain text suitable for RAG.
|
||||
Preserves code blocks and important structure while removing markup.
|
||||
"""
|
||||
import re
|
||||
|
||||
# First preserve code blocks with their content before general cleaning
|
||||
# This ensures code blocks remain intact for extraction
|
||||
code_blocks = []
|
||||
|
||||
# Find and temporarily replace code blocks to preserve them
|
||||
code_patterns = [
|
||||
r'<pre><code[^>]*>(.*?)</code></pre>',
|
||||
r'<code[^>]*>(.*?)</code>',
|
||||
r'<pre[^>]*>(.*?)</pre>',
|
||||
]
|
||||
|
||||
processed_html = html_content
|
||||
placeholder_map = {}
|
||||
|
||||
for pattern in code_patterns:
|
||||
matches = list(re.finditer(pattern, processed_html, re.DOTALL | re.IGNORECASE))
|
||||
for i, match in enumerate(reversed(matches)): # Reverse to maintain positions
|
||||
# Extract code content and clean HTML entities
|
||||
code_content = match.group(1)
|
||||
# Clean HTML entities and span tags from code
|
||||
code_content = re.sub(r'<span[^>]*>', '', code_content)
|
||||
code_content = re.sub(r'</span>', '', code_content)
|
||||
code_content = re.sub(r'<', '<', code_content)
|
||||
code_content = re.sub(r'>', '>', code_content)
|
||||
code_content = re.sub(r'&', '&', code_content)
|
||||
code_content = re.sub(r'"', '"', code_content)
|
||||
code_content = re.sub(r''', "'", code_content)
|
||||
|
||||
# Create placeholder
|
||||
placeholder = f"__CODE_BLOCK_{len(placeholder_map)}__"
|
||||
placeholder_map[placeholder] = code_content.strip()
|
||||
|
||||
# Replace in HTML
|
||||
processed_html = processed_html[:match.start()] + placeholder + processed_html[match.end():]
|
||||
|
||||
# Now clean all remaining HTML tags
|
||||
# Remove script and style content entirely
|
||||
processed_html = re.sub(r'<script[^>]*>.*?</script>', '', processed_html, flags=re.DOTALL | re.IGNORECASE)
|
||||
processed_html = re.sub(r'<style[^>]*>.*?</style>', '', processed_html, flags=re.DOTALL | re.IGNORECASE)
|
||||
|
||||
# Convert common HTML elements to readable text
|
||||
# Headers
|
||||
processed_html = re.sub(r'<h[1-6][^>]*>(.*?)</h[1-6]>', r'\n\n\1\n\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
|
||||
# Paragraphs
|
||||
processed_html = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
|
||||
# Line breaks
|
||||
processed_html = re.sub(r'<br\s*/?>', '\n', processed_html, flags=re.IGNORECASE)
|
||||
# List items
|
||||
processed_html = re.sub(r'<li[^>]*>(.*?)</li>', r'• \1\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
|
||||
|
||||
# Remove all remaining HTML tags
|
||||
processed_html = re.sub(r'<[^>]+>', '', processed_html)
|
||||
|
||||
# Clean up HTML entities
|
||||
processed_html = re.sub(r' ', ' ', processed_html)
|
||||
processed_html = re.sub(r'<', '<', processed_html)
|
||||
processed_html = re.sub(r'>', '>', processed_html)
|
||||
processed_html = re.sub(r'&', '&', processed_html)
|
||||
processed_html = re.sub(r'"', '"', processed_html)
|
||||
processed_html = re.sub(r''', "'", processed_html)
|
||||
processed_html = re.sub(r''', "'", processed_html)
|
||||
|
||||
# Restore code blocks
|
||||
for placeholder, code_content in placeholder_map.items():
|
||||
processed_html = processed_html.replace(placeholder, f"\n\n```\n{code_content}\n```\n\n")
|
||||
|
||||
# Clean up excessive whitespace
|
||||
processed_html = re.sub(r'\n\s*\n\s*\n', '\n\n', processed_html) # Max 2 consecutive newlines
|
||||
processed_html = re.sub(r'[ \t]+', ' ', processed_html) # Multiple spaces to single space
|
||||
|
||||
return processed_html.strip()
|
||||
|
||||
|
||||
def extract_text_from_document(file_content: bytes, filename: str, content_type: str) -> str:
|
||||
"""
|
||||
Extract text from various document formats.
|
||||
@@ -64,6 +183,14 @@ def extract_text_from_document(file_content: bytes, filename: str, content_type:
|
||||
] or filename.lower().endswith((".docx", ".doc")):
|
||||
return extract_text_from_docx(file_content)
|
||||
|
||||
# HTML files - clean tags and extract text
|
||||
elif content_type == "text/html" or filename.lower().endswith((".html", ".htm")):
|
||||
# Decode HTML and clean tags for RAG
|
||||
html_text = file_content.decode("utf-8", errors="ignore").strip()
|
||||
if not html_text:
|
||||
raise ValueError(f"The file {filename} appears to be empty.")
|
||||
return _clean_html_to_text(html_text)
|
||||
|
||||
# Text files (markdown, txt, etc.)
|
||||
elif content_type.startswith("text/") or filename.lower().endswith((
|
||||
".txt",
|
||||
@@ -126,7 +253,22 @@ def extract_text_from_pdf(file_content: bytes) -> str:
|
||||
|
||||
# If pdfplumber got good results, use them
|
||||
if text_content and len("\n".join(text_content).strip()) > 100:
|
||||
return "\n\n".join(text_content)
|
||||
combined_text = "\n\n".join(text_content)
|
||||
logger.info(f"🔍 PDF DEBUG: Extracted {len(text_content)} pages, total length: {len(combined_text)}")
|
||||
logger.info(f"🔍 PDF DEBUG: First 500 chars: {repr(combined_text[:500])}")
|
||||
|
||||
# Check for backticks before and after processing
|
||||
backtick_count_before = combined_text.count("```")
|
||||
logger.info(f"🔍 PDF DEBUG: Backticks found before processing: {backtick_count_before}")
|
||||
|
||||
processed_text = _preserve_code_blocks_across_pages(combined_text)
|
||||
backtick_count_after = processed_text.count("```")
|
||||
logger.info(f"🔍 PDF DEBUG: Backticks found after processing: {backtick_count_after}")
|
||||
|
||||
if backtick_count_after > 0:
|
||||
logger.info(f"🔍 PDF DEBUG: Sample after processing: {repr(processed_text[:1000])}")
|
||||
|
||||
return processed_text
|
||||
|
||||
except Exception as e:
|
||||
logfire.warning(f"pdfplumber extraction failed: {e}, trying PyPDF2")
|
||||
@@ -147,7 +289,8 @@ def extract_text_from_pdf(file_content: bytes) -> str:
|
||||
continue
|
||||
|
||||
if text_content:
|
||||
return "\n\n".join(text_content)
|
||||
combined_text = "\n\n".join(text_content)
|
||||
return _preserve_code_blocks_across_pages(combined_text)
|
||||
else:
|
||||
raise ValueError(
|
||||
"No text extracted from PDF: file may be empty, images-only, "
|
||||
|
||||
Reference in New Issue
Block a user