fix: enable code examples extraction for manual file uploads (#626)

* fix: enable code examples extraction for manual file uploads

- Add extract_code_examples parameter to upload API endpoint (default: true)
- Integrate CodeExtractionService into DocumentStorageService.upload_document()
- Add code extraction after document storage with progress tracking
- Map code extraction progress to 85-95% range in upload progress
- Include code_examples_stored in upload results and logging
- Support extract_code_examples in batch document upload via store_documents()
- Handle code extraction errors gracefully without failing upload

Fixes issue where code examples were only extracted for URL crawls
but not for manual file uploads, despite using the same underlying
CodeExtractionService that supports both HTML and text formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Fix code extraction for uploaded markdown files

- Provide file content in both html and markdown fields for crawl_results
- This ensures markdown files (.md) use the correct text file extraction path
- The CodeExtractionService checks html_content first for text files
- Fixes issue where uploaded .md files didn't extract code examples properly

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* debug: Add comprehensive logging to trace code extraction issue

- Add detailed debug logging to upload code extraction flow
- Log extract_code_examples parameter value
- Log crawl_results structure and content length
- Log progress callbacks from extraction service
- Log final extraction count with more context
- Enhanced error logging with full stack traces

This will help identify exactly where the extraction is failing for uploaded files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Remove invalid start_progress/end_progress parameters

The extract_and_store_code_examples method doesn't accept start_progress
and end_progress parameters, causing TypeError during file uploads.

This was the root cause preventing code extraction from working - the
method was failing with a signature mismatch before any extraction logic
could run.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Preserve code blocks across PDF page boundaries

PDF extraction was breaking markdown code blocks by inserting page separators:

```python
def hello():
--- Page 2 ---
    return "world"
```

This made code blocks unrecognizable to extraction patterns.

Solution:
- Add _preserve_code_blocks_across_pages() function
- Detect split code blocks using regex pattern matching
- Remove page separators that appear within code blocks
- Apply to both pdfplumber and PyPDF2 extraction paths

Now PDF uploads should properly extract code examples just like markdown files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add PDF-specific code extraction for files without markdown delimiters

Root cause: PDFs lose markdown code block delimiters (``` ) during text extraction,
making standard markdown patterns fail to detect code.

Solution:
1. Add _extract_pdf_code_blocks() method with plain-text code detection patterns:
   - Python import blocks and function definitions
   - YAML configuration blocks
   - Shell command sequences
   - Multi-line indented code blocks

2. Add PDF detection logic in _extract_code_blocks_from_documents()
3. Set content_type properly for PDF files in storage service
4. Add debug logging to PDF text extraction process

This allows extraction of code from PDFs that contain technical documentation
with code examples, even when markdown formatting is lost during PDF->text conversion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Enhanced PDF code extraction to match markdown extraction results

Problem: PDF extraction only found 1 code example vs 9 from same markdown content
Root cause: PDF extraction patterns were too restrictive and specific

Enhanced solution:
1. **Multi-line code block detection**: Scans for consecutive "code-like" lines
   - Variable assignments, imports, function calls, method calls
   - Includes comments, control flow, YAML keys, shell commands
   - Handles indented continuation lines and empty lines within blocks

2. **Smarter block boundary detection**:
   - Excludes prose lines with narrative indicators
   - Allows natural code block boundaries
   - Preserves context around extracted blocks

3. **Comprehensive pattern coverage**:
   - Python scripts and functions
   - YAML configuration blocks
   - Shell command sequences
   - JavaScript functions

This approach should extract the same ~9 code examples from PDFs as from
markdown files, since it detects code patterns without relying on markdown delimiters.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Simplify PDF extraction to section-based approach

Changed from complex line-by-line analysis to simpler section-based approach:

1. Split PDF content by natural boundaries (paragraphs, page breaks)
2. Score each section for code vs prose indicators
3. Extract sections that score high on code indicators
4. Add comprehensive logging to debug section classification

Code indicators include:
- Python imports, functions, classes (high weight)
- Variable assignments, method calls (medium weight)
- Package management commands, lambda functions

This should better match the 9 code examples found in markdown version
by treating each logical code segment as a separate extractable block.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add explicit HTML file detection and extraction path

Problem: HTML files (0 code examples extracted) weren't being routed to HTML extraction

Root cause: HTML files (.html, .htm) weren't explicitly detected, so they fell through
to generic extraction logic instead of using the robust HTML code block patterns.

Solution:
1. Add HTML file detection: is_html_file = source_url.endswith(('.html', '.htm'))
2. Add explicit HTML extraction path before fallback logic
3. Set proper content_type: "text/html" for HTML files in storage service
4. Ensure HTML content is passed to _extract_html_code_blocks method

The existing HTML extraction already has comprehensive patterns for:
- <pre><code class="lang-python"> (syntax highlighted)
- <pre><code> (standard)
- Various code highlighting libraries (Prism, highlight.js, etc.)

This should now extract all code blocks from HTML files just like URL crawls do.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Add HTML tag cleanup and proper code extraction for HTML files

Problem: HTML uploads had 0 code examples and contained HTML tags in RAG chunks

Solution:
1. **HTML Tag Cleanup**: Added _clean_html_to_text() function that:
   - Preserves code blocks by temporarily replacing them with placeholders
   - Removes all HTML tags, scripts, styles from prose content
   - Converts HTML structure (headers, paragraphs, lists) to clean text
   - Restores code blocks as markdown format (```language)
   - Cleans HTML entities (&lt;, &gt;, etc.)

2. **Unified Text Processing**: HTML files now processed as text files since they:
   - Have clean text for RAG chunking (no HTML tags)
   - Have markdown-style code blocks for extraction
   - Use existing text file extraction path

3. **Content Type Mapping**: Set text/markdown for cleaned HTML files

Result: HTML files now extract code examples like markdown files while providing
clean text for RAG without HTML markup pollution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* feat: Add HTML file support to upload dialog

- Add .html and .htm to accepted file types in AddKnowledgeDialog
- Users can now see and select HTML files in the file picker by default
- HTML files will be processed with tag cleanup and code extraction

Previously HTML files had to be manually typed or dragged, now they appear
in the standard file picker alongside other supported formats.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Prevent HTML extraction path confusion in crawl_results payload

Problem: Setting both 'markdown' and 'html' fields to same content could trigger
HTML extraction regexes when we want text/markdown extraction.

Solution:
- markdown: Contains cleaned plaintext/markdown content
- html: Empty string to prevent HTML extraction path
- content_type: Proper type (application/pdf, text/markdown, text/plain)

This ensures HTML files (now cleaned to markdown format) use the text file
extraction path with backtick patterns, not HTML regex patterns.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

---------

Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
DIY Smart Code
2025-09-18 19:06:48 +02:00
committed by GitHub
parent 85bd6bc012
commit 6abb8831f7
5 changed files with 404 additions and 10 deletions

View File

@@ -260,7 +260,7 @@ export const AddKnowledgeDialog: React.FC<AddKnowledgeDialogProps> = ({
<input
id={fileId}
type="file"
accept=".txt,.md,.pdf,.doc,.docx"
accept=".txt,.md,.pdf,.doc,.docx,.html,.htm"
onChange={(e) => setSelectedFile(e.target.files?.[0] || null)}
disabled={isProcessing}
className="absolute inset-0 w-full h-full opacity-0 cursor-pointer disabled:cursor-not-allowed z-10"

View File

@@ -144,6 +144,48 @@ class RagQueryRequest(BaseModel):
match_count: int = 5
@router.get("/crawl-progress/{progress_id}")
async def get_crawl_progress(progress_id: str):
"""Get crawl progress for polling.
Returns the current state of a crawl operation.
Frontend should poll this endpoint to track crawl progress.
"""
try:
from ..models.progress_models import create_progress_response
from ..utils.progress.progress_tracker import ProgressTracker
# Get progress from the tracker's in-memory storage
progress_data = ProgressTracker.get_progress(progress_id)
safe_logfire_info(f"Crawl progress requested | progress_id={progress_id} | found={progress_data is not None}")
if not progress_data:
# Return 404 if no progress exists - this is correct behavior
raise HTTPException(status_code=404, detail={"error": f"No progress found for ID: {progress_id}"})
# Ensure we have the progress_id in the data
progress_data["progress_id"] = progress_id
# Get operation type for proper model selection
operation_type = progress_data.get("type", "crawl")
# Create standardized response using Pydantic model
progress_response = create_progress_response(operation_type, progress_data)
# Convert to dict with camelCase fields for API response
response_data = progress_response.model_dump(by_alias=True, exclude_none=True)
safe_logfire_info(
f"Progress retrieved | operation_id={progress_id} | status={response_data.get('status')} | "
f"progress={response_data.get('progress')} | totalPages={response_data.get('totalPages')} | "
f"processedPages={response_data.get('processedPages')}"
)
return response_data
except Exception as e:
safe_logfire_error(f"Failed to get crawl progress | error={str(e)} | progress_id={progress_id}")
raise HTTPException(status_code=500, detail={"error": str(e)})
@router.get("/knowledge-items/sources")
async def get_knowledge_sources():
@@ -818,6 +860,7 @@ async def upload_document(
file: UploadFile = File(...),
tags: str | None = Form(None),
knowledge_type: str = Form("technical"),
extract_code_examples: bool = Form(True),
):
"""Upload and process a document with progress tracking."""
@@ -871,7 +914,7 @@ async def upload_document(
# Upload tasks can be tracked directly since they don't spawn sub-tasks
upload_task = asyncio.create_task(
_perform_upload_with_progress(
progress_id, file_content, file_metadata, tag_list, knowledge_type, tracker
progress_id, file_content, file_metadata, tag_list, knowledge_type, extract_code_examples, tracker
)
)
# Track the task for cancellation support
@@ -899,7 +942,8 @@ async def _perform_upload_with_progress(
file_metadata: dict,
tag_list: list[str],
knowledge_type: str,
tracker,
extract_code_examples: bool,
tracker: "ProgressTracker",
):
"""Perform document upload with progress tracking using service layer."""
# Create cancellation check function for document uploads
@@ -978,6 +1022,7 @@ async def _perform_upload_with_progress(
source_id=source_id,
knowledge_type=knowledge_type,
tags=tag_list,
extract_code_examples=extract_code_examples,
progress_callback=document_progress_callback,
cancellation_check=check_upload_cancellation,
)
@@ -987,10 +1032,11 @@ async def _perform_upload_with_progress(
await tracker.complete({
"log": "Document uploaded successfully!",
"chunks_stored": result.get("chunks_stored"),
"code_examples_stored": result.get("code_examples_stored", 0),
"sourceId": result.get("source_id"),
})
safe_logfire_info(
f"Document uploaded successfully | progress_id={progress_id} | source_id={result.get('source_id')} | chunks_stored={result.get('chunks_stored')}"
f"Document uploaded successfully | progress_id={progress_id} | source_id={result.get('source_id')} | chunks_stored={result.get('chunks_stored')} | code_examples_stored={result.get('code_examples_stored', 0)}"
)
else:
error_msg = result.get("error", "Unknown error")

View File

@@ -291,12 +291,16 @@ class CodeExtractionService:
# Improved extraction logic - check for text files first, then HTML, then markdown
code_blocks = []
# Check if this is a text file (e.g., .txt, .md)
# Check if this is a text file (e.g., .txt, .md, .html after cleaning) or PDF
is_text_file = source_url.endswith((
".txt",
".text",
".md",
)) or "text/plain" in doc.get("content_type", "")
".html",
".htm",
)) or "text/plain" in doc.get("content_type", "") or "text/markdown" in doc.get("content_type", "")
is_pdf_file = source_url.endswith(".pdf") or "application/pdf" in doc.get("content_type", "")
if is_text_file:
# For text files, use specialized text extraction
@@ -322,7 +326,19 @@ class CodeExtractionService:
else:
safe_logfire_info(f"⚠️ NO CONTENT for text file | url={source_url}")
# If not a text file or no code blocks found, try HTML extraction first
# If this is a PDF file, use specialized PDF extraction
elif is_pdf_file:
safe_logfire_info(f"📄 PDF FILE DETECTED | url={source_url}")
# For PDFs, use the content that should be PDF-extracted text
pdf_content = html_content if html_content else md
if pdf_content:
safe_logfire_info(f"📝 Using {'HTML' if html_content else 'MARKDOWN'} content for PDF extraction")
code_blocks = await self._extract_pdf_code_blocks(pdf_content, source_url)
safe_logfire_info(f"📦 PDF extraction complete | found={len(code_blocks)} blocks | url={source_url}")
else:
safe_logfire_info(f"⚠️ NO CONTENT for PDF file | url={source_url}")
# If not a text file or PDF, or no code blocks found, try HTML extraction as fallback
if len(code_blocks) == 0 and html_content and not is_text_file:
safe_logfire_info(
f"Trying HTML extraction first | url={source_url} | html_length={len(html_content)}"
@@ -912,6 +928,135 @@ class CodeExtractionService:
)
return code_blocks
async def _extract_pdf_code_blocks(
self, content: str, url: str
) -> list[dict[str, Any]]:
"""
Extract code blocks from PDF-extracted text that lacks markdown formatting.
PDFs lose markdown delimiters, so we need to detect code patterns in plain text.
This uses a much simpler approach - look for distinct code segments separated by prose.
"""
import re
safe_logfire_info(f"🔍 PDF CODE EXTRACTION START | url={url} | content_length={len(content)}")
code_blocks = []
min_length = await self._get_min_code_length()
# Split content into paragraphs/sections
# Use double newlines and page breaks as natural boundaries
sections = re.split(r'\n\n+|--- Page \d+ ---', content)
safe_logfire_info(f"📄 Split PDF into {len(sections)} sections")
for i, section in enumerate(sections):
section = section.strip()
if not section or len(section) < 50: # Skip very short sections
continue
# Check if this section looks like code
if self._is_pdf_section_code_like(section):
safe_logfire_info(f"🔍 Analyzing section {i} as potential code (length: {len(section)})")
# Try to detect language
language = self._detect_language_from_content(section)
# Clean the content
cleaned_code = self._clean_code_content(section, language)
# Check length after cleaning
if len(cleaned_code) >= min_length:
# Validate quality
if await self._validate_code_quality(cleaned_code, language):
# Get context from adjacent sections
context_before = sections[i-1].strip() if i > 0 else ""
context_after = sections[i+1].strip() if i < len(sections)-1 else ""
safe_logfire_info(f"✅ PDF code section | language={language} | length={len(cleaned_code)}")
code_blocks.append({
"code": cleaned_code,
"language": language,
"context_before": context_before,
"context_after": context_after,
"full_context": f"{context_before}\n\n{cleaned_code}\n\n{context_after}",
"source_type": "pdf_section",
})
else:
safe_logfire_info(f"❌ PDF section failed validation | language={language}")
else:
safe_logfire_info(f"❌ PDF section too short after cleaning: {len(cleaned_code)} < {min_length}")
else:
safe_logfire_info(f"📝 Section {i} identified as prose/documentation")
safe_logfire_info(f"🔍 PDF CODE EXTRACTION COMPLETE | total_blocks={len(code_blocks)} | url={url}")
return code_blocks
def _is_pdf_section_code_like(self, section: str) -> bool:
"""
Determine if a PDF section contains code rather than prose.
"""
import re
# Count code indicators vs prose indicators
code_score = 0
prose_score = 0
# Code indicators (higher weight for stronger indicators)
code_patterns = [
(r'\bfrom \w+(?:\.\w+)* import\b', 3), # Python imports (strong)
(r'\bdef \w+\s*\(', 3), # Function definitions (strong)
(r'\bclass \w+\s*[\(:]', 3), # Class definitions (strong)
(r'\w+\s*=\s*\w+\(', 2), # Function calls assigned (medium)
(r'\w+\s*=\s*\[.*\]', 2), # List assignments (medium)
(r'\w+\.\w+\(', 2), # Method calls (medium)
(r'^\s*#[^#]', 1), # Single-line comments (weak)
(r'\bpip install\b', 2), # Package management (medium)
(r'\bpytest\b', 2), # Testing commands (medium)
(r'\bgit clone\b', 2), # Git commands (medium)
(r':\s*\n\s+\w+:', 2), # YAML structure (medium)
(r'\blambda\s+\w+:', 2), # Lambda functions (medium)
]
# Prose indicators
prose_patterns = [
(r'\b(the|this|that|these|those|are|is|was|were|will|would|should|could|have|has|had)\b', 1),
(r'[.!?]\s+[A-Z]', 2), # Sentence endings
(r'\b(however|therefore|furthermore|moreover|additionally|specifically)\b', 2),
(r'\bTable of Contents\b', 3),
(r'\bAPI Reference\b', 2),
]
# Count patterns
for pattern, weight in code_patterns:
matches = len(re.findall(pattern, section, re.IGNORECASE | re.MULTILINE))
code_score += matches * weight
for pattern, weight in prose_patterns:
matches = len(re.findall(pattern, section, re.IGNORECASE | re.MULTILINE))
prose_score += matches * weight
# Additional checks
lines = section.split('\n')
non_empty_lines = [line.strip() for line in lines if line.strip()]
if not non_empty_lines:
return False
# If section is mostly single words or very short lines, probably not code
short_lines = sum(1 for line in non_empty_lines if len(line.split()) < 3)
if len(non_empty_lines) > 0 and short_lines / len(non_empty_lines) > 0.7:
prose_score += 3
# If section has common code structure indicators
if any('(' in line and ')' in line for line in non_empty_lines[:5]):
code_score += 2
safe_logfire_info(f"📊 Section scoring: code_score={code_score}, prose_score={prose_score}")
# Code-like if code score significantly higher than prose score
return code_score > prose_score and code_score > 2
def _detect_language_from_content(self, code: str) -> str:
"""
Try to detect programming language from code content.

View File

@@ -24,6 +24,7 @@ class DocumentStorageService(BaseStorageService):
source_id: str,
knowledge_type: str = "documentation",
tags: list[str] | None = None,
extract_code_examples: bool = True,
progress_callback: Any | None = None,
cancellation_check: Any | None = None,
) -> tuple[bool, dict[str, Any]]:
@@ -36,7 +37,9 @@ class DocumentStorageService(BaseStorageService):
source_id: Source identifier
knowledge_type: Type of knowledge
tags: Optional list of tags
extract_code_examples: Whether to extract code examples from the document
progress_callback: Optional callback for progress
cancellation_check: Optional function to check for cancellation
Returns:
Tuple of (success, result_dict)
@@ -145,10 +148,65 @@ class DocumentStorageService(BaseStorageService):
cancellation_check=cancellation_check,
)
# Extract code examples if requested
code_examples_count = 0
if extract_code_examples and len(chunks) > 0:
try:
await report_progress("Extracting code examples...", 85)
logger.info(f"🔍 DEBUG: Starting code extraction for {filename} | extract_code_examples={extract_code_examples}")
# Import code extraction service
from ..crawling.code_extraction_service import CodeExtractionService
code_service = CodeExtractionService(self.supabase_client)
# Create crawl_results format expected by code extraction service
# markdown: cleaned plaintext (HTML->markdown for HTML files, raw content otherwise)
# html: empty string to prevent HTML extraction path confusion
# content_type: proper type to guide extraction method selection
crawl_results = [{
"url": doc_url,
"markdown": file_content, # Cleaned plaintext/markdown content
"html": "", # Empty to prevent HTML extraction path
"content_type": "application/pdf" if filename.lower().endswith('.pdf') else (
"text/markdown" if filename.lower().endswith(('.html', '.htm', '.md')) else "text/plain"
)
}]
logger.info(f"🔍 DEBUG: Created crawl_results with url={doc_url}, content_length={len(file_content)}")
# Create progress callback for code extraction
async def code_progress_callback(data: dict):
logger.info(f"🔍 DEBUG: Code extraction progress: {data}")
if progress_callback:
# Map code extraction progress (0-100) to our remaining range (85-95)
raw_progress = data.get("progress", data.get("percentage", 0))
mapped_progress = 85 + (raw_progress / 100.0) * 10 # 85% to 95%
message = data.get("log", "Extracting code examples...")
await progress_callback(message, int(mapped_progress))
logger.info(f"🔍 DEBUG: About to call extract_and_store_code_examples...")
code_examples_count = await code_service.extract_and_store_code_examples(
crawl_results=crawl_results,
url_to_full_document=url_to_full_document,
source_id=source_id,
progress_callback=code_progress_callback,
cancellation_check=cancellation_check,
)
logger.info(f"🔍 DEBUG: Code extraction completed: {code_examples_count} code examples found for {filename}")
except Exception as e:
# Log error with full traceback but don't fail the entire upload
logger.error(f"Code extraction failed for {filename}: {e}", exc_info=True)
code_examples_count = 0
await report_progress("Document upload completed!", 100)
result = {
"chunks_stored": len(chunks),
"code_examples_stored": code_examples_count,
"total_word_count": total_word_count,
"source_id": source_id,
"filename": filename,
@@ -156,10 +214,11 @@ class DocumentStorageService(BaseStorageService):
span.set_attribute("success", True)
span.set_attribute("chunks_stored", len(chunks))
span.set_attribute("code_examples_stored", code_examples_count)
span.set_attribute("total_word_count", total_word_count)
logger.info(
f"Document upload completed successfully: filename={filename}, chunks_stored={len(chunks)}, total_word_count={total_word_count}"
f"Document upload completed successfully: filename={filename}, chunks_stored={len(chunks)}, code_examples_stored={code_examples_count}, total_word_count={total_word_count}"
)
return True, result
@@ -192,6 +251,7 @@ class DocumentStorageService(BaseStorageService):
source_id=doc.get("source_id", "upload"),
knowledge_type=doc.get("knowledge_type", "documentation"),
tags=doc.get("tags"),
extract_code_examples=doc.get("extract_code_examples", True),
progress_callback=kwargs.get("progress_callback"),
cancellation_check=kwargs.get("cancellation_check"),
)

View File

@@ -36,6 +36,125 @@ from ..config.logfire_config import get_logger, logfire
logger = get_logger(__name__)
def _preserve_code_blocks_across_pages(text: str) -> str:
"""
Fix code blocks that were split across PDF page boundaries.
PDFs often break markdown code blocks with page headers like:
```python
def hello():
--- Page 2 ---
return "world"
```
This function rejoins split code blocks by removing page separators
that appear within code blocks.
"""
import re
# Pattern to match page separators that split code blocks
# Look for: ``` [content] --- Page N --- [content] ```
page_break_in_code_pattern = r'(```\w*[^\n]*\n(?:[^`]|`(?!``))*)(\n--- Page \d+ ---\n)((?:[^`]|`(?!``))*)```'
# Keep merging until no more splits are found
while True:
matches = list(re.finditer(page_break_in_code_pattern, text, re.DOTALL))
if not matches:
break
# Replace each match by removing the page separator
for match in reversed(matches): # Reverse to maintain positions
before_page_break = match.group(1)
page_separator = match.group(2)
after_page_break = match.group(3)
# Rejoin the code block without the page separator
rejoined = f"{before_page_break}\n{after_page_break}```"
text = text[:match.start()] + rejoined + text[match.end():]
return text
def _clean_html_to_text(html_content: str) -> str:
"""
Clean HTML tags and convert to plain text suitable for RAG.
Preserves code blocks and important structure while removing markup.
"""
import re
# First preserve code blocks with their content before general cleaning
# This ensures code blocks remain intact for extraction
code_blocks = []
# Find and temporarily replace code blocks to preserve them
code_patterns = [
r'<pre><code[^>]*>(.*?)</code></pre>',
r'<code[^>]*>(.*?)</code>',
r'<pre[^>]*>(.*?)</pre>',
]
processed_html = html_content
placeholder_map = {}
for pattern in code_patterns:
matches = list(re.finditer(pattern, processed_html, re.DOTALL | re.IGNORECASE))
for i, match in enumerate(reversed(matches)): # Reverse to maintain positions
# Extract code content and clean HTML entities
code_content = match.group(1)
# Clean HTML entities and span tags from code
code_content = re.sub(r'<span[^>]*>', '', code_content)
code_content = re.sub(r'</span>', '', code_content)
code_content = re.sub(r'&lt;', '<', code_content)
code_content = re.sub(r'&gt;', '>', code_content)
code_content = re.sub(r'&amp;', '&', code_content)
code_content = re.sub(r'&quot;', '"', code_content)
code_content = re.sub(r'&#39;', "'", code_content)
# Create placeholder
placeholder = f"__CODE_BLOCK_{len(placeholder_map)}__"
placeholder_map[placeholder] = code_content.strip()
# Replace in HTML
processed_html = processed_html[:match.start()] + placeholder + processed_html[match.end():]
# Now clean all remaining HTML tags
# Remove script and style content entirely
processed_html = re.sub(r'<script[^>]*>.*?</script>', '', processed_html, flags=re.DOTALL | re.IGNORECASE)
processed_html = re.sub(r'<style[^>]*>.*?</style>', '', processed_html, flags=re.DOTALL | re.IGNORECASE)
# Convert common HTML elements to readable text
# Headers
processed_html = re.sub(r'<h[1-6][^>]*>(.*?)</h[1-6]>', r'\n\n\1\n\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
# Paragraphs
processed_html = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
# Line breaks
processed_html = re.sub(r'<br\s*/?>', '\n', processed_html, flags=re.IGNORECASE)
# List items
processed_html = re.sub(r'<li[^>]*>(.*?)</li>', r'\1\n', processed_html, flags=re.DOTALL | re.IGNORECASE)
# Remove all remaining HTML tags
processed_html = re.sub(r'<[^>]+>', '', processed_html)
# Clean up HTML entities
processed_html = re.sub(r'&nbsp;', ' ', processed_html)
processed_html = re.sub(r'&lt;', '<', processed_html)
processed_html = re.sub(r'&gt;', '>', processed_html)
processed_html = re.sub(r'&amp;', '&', processed_html)
processed_html = re.sub(r'&quot;', '"', processed_html)
processed_html = re.sub(r'&#39;', "'", processed_html)
processed_html = re.sub(r'&#x27;', "'", processed_html)
# Restore code blocks
for placeholder, code_content in placeholder_map.items():
processed_html = processed_html.replace(placeholder, f"\n\n```\n{code_content}\n```\n\n")
# Clean up excessive whitespace
processed_html = re.sub(r'\n\s*\n\s*\n', '\n\n', processed_html) # Max 2 consecutive newlines
processed_html = re.sub(r'[ \t]+', ' ', processed_html) # Multiple spaces to single space
return processed_html.strip()
def extract_text_from_document(file_content: bytes, filename: str, content_type: str) -> str:
"""
Extract text from various document formats.
@@ -64,6 +183,14 @@ def extract_text_from_document(file_content: bytes, filename: str, content_type:
] or filename.lower().endswith((".docx", ".doc")):
return extract_text_from_docx(file_content)
# HTML files - clean tags and extract text
elif content_type == "text/html" or filename.lower().endswith((".html", ".htm")):
# Decode HTML and clean tags for RAG
html_text = file_content.decode("utf-8", errors="ignore").strip()
if not html_text:
raise ValueError(f"The file {filename} appears to be empty.")
return _clean_html_to_text(html_text)
# Text files (markdown, txt, etc.)
elif content_type.startswith("text/") or filename.lower().endswith((
".txt",
@@ -126,7 +253,22 @@ def extract_text_from_pdf(file_content: bytes) -> str:
# If pdfplumber got good results, use them
if text_content and len("\n".join(text_content).strip()) > 100:
return "\n\n".join(text_content)
combined_text = "\n\n".join(text_content)
logger.info(f"🔍 PDF DEBUG: Extracted {len(text_content)} pages, total length: {len(combined_text)}")
logger.info(f"🔍 PDF DEBUG: First 500 chars: {repr(combined_text[:500])}")
# Check for backticks before and after processing
backtick_count_before = combined_text.count("```")
logger.info(f"🔍 PDF DEBUG: Backticks found before processing: {backtick_count_before}")
processed_text = _preserve_code_blocks_across_pages(combined_text)
backtick_count_after = processed_text.count("```")
logger.info(f"🔍 PDF DEBUG: Backticks found after processing: {backtick_count_after}")
if backtick_count_after > 0:
logger.info(f"🔍 PDF DEBUG: Sample after processing: {repr(processed_text[:1000])}")
return processed_text
except Exception as e:
logfire.warning(f"pdfplumber extraction failed: {e}, trying PyPDF2")
@@ -147,7 +289,8 @@ def extract_text_from_pdf(file_content: bytes) -> str:
continue
if text_content:
return "\n\n".join(text_content)
combined_text = "\n\n".join(text_content)
return _preserve_code_blocks_across_pages(combined_text)
else:
raise ValueError(
"No text extracted from PDF: file may be empty, images-only, "