Remove the special case that gave robots.txt sitemap declarations highest
priority, which incorrectly overrode the global priority order. Now properly
respects the intended priority: llms-full.txt > llms.txt > llms.md > llms.mdx >
sitemap.xml > robots.txt.
This fixes the issue where supabase.com/docs would return sitemap.xml instead
of llms.txt even though both files exist at /docs/ and llms.txt should have
higher priority.
Changes:
- Removed robots.txt early return that bypassed priority order
- Updated test to verify llms files take precedence over robots.txt sitemaps
- All discovery now follows consistent DISCOVERY_PRIORITY order
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Improve discovery logic to check the same directory as the base URL first before
falling back to root-level and subdirectories. This ensures files like
https://supabase.com/docs/llms.txt are found when crawling
https://supabase.com/docs.
Changes:
- Check same directory as base_url first (e.g., /docs/llms.txt for /docs URL)
- Fall back to root-level urljoin behavior
- Include base directory name in subdirectory checks (e.g., /docs subdirectory)
- Maintain priority order: same-dir > root > subdirectories
- Log discovery location for better debugging
This addresses cases where documentation directories contain their own llms.txt
or sitemap files that should take precedence over root-level files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Preserve URL case in robots.txt parsing by only lowercasing the sitemap: prefix check
- Add support for relative sitemap paths in robots.txt using urljoin()
- Fix HTML meta tag parsing to use case-insensitive regex instead of lowercasing content
- Add URL scheme validation for discovered sitemaps (http/https only)
- Fix discovery target domain filtering to use discovered URL's domain instead of input URL
- Clean up whitespace and improve dict comprehension usage
These changes improve discovery reliability and prevent URL corruption while maintaining
backward compatibility with existing discovery behavior.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Enable SSL certificate verification (verify=True) for all HTTP requests
- Implement streaming with size limits (10MB default) to prevent memory exhaustion
- Add _read_response_with_limit() helper for secure response reading
- Update all test mocks to support streaming API with iter_content()
- Fix test assertions to expect new security parameters
- Enforce deterministic rounding in progress mapper tests
Security improvements:
- Prevents MITM attacks through SSL verification
- Guards against DoS via oversized responses
- Ensures proper resource cleanup with response.close()
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Resolved merge conflicts by integrating features from both branches:
- Added page_storage_ops service initialization from main
- Merged link text extraction with discovery mode features
- Preserved discovery single-file mode and domain filtering
- Maintained link text fallbacks for title extraction
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Set explicit PLAYWRIGHT_BROWSERS_PATH to fix browser installation
Fixes Playwright browser not found error during web crawling.
The issue was introduced in the uv migration (9f22659) where the
browser installation path was not explicitly set as a persistent
environment variable.
Changes:
- Add ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
- Add --with-deps flag to playwright install command
- Add comprehensive root cause analysis document
Without this fix, Playwright installed browsers to a default location
at build time but couldn't find them at runtime, causing crawling
operations to fail with "Executable doesn't exist" errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Remove --with-deps flag to prevent build conflicts
The --with-deps flag was causing build failures on some systems because:
- We already manually install all Playwright dependencies (lines 26-49)
- --with-deps attempts to reinstall these packages
- This causes package conflicts and build failures on Windows/WSL
The core fix (ENV PLAYWRIGHT_BROWSERS_PATH) remains the same.
* Delete PLAYWRIGHT_FIX_ANALYSIS.md
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cole Medin <cole@dynamous.ai>
* Initial commit for RAG by document
* Phase 2
* Adding migrations
* Fixing page IDs for chunk metadata
* Fixing unit tests, adding tool to list pages for source
* Fixing page storage upsert issues
* Max file length for retrieval
* Fixing title issue
* Fixing tests
* fix: implement CASCADE DELETE for source deletion timeout issue
- Add migration 009 to add CASCADE DELETE constraints to foreign keys
- Simplify delete_source() to only delete parent record
- Database now handles cascading deletes efficiently
- Fixes timeout issues when deleting sources with thousands of pages
* chore: update complete_setup.sql to include CASCADE DELETE constraints
- Add ON DELETE CASCADE to foreign keys in initial setup
- Include migration 009 in the migrations tracking
- Ensures new installations have CASCADE DELETE from the start
Updates crawl4ai dependency to latest stable version with performance
and stability improvements.
Key improvements in 0.7.4:
- LLM-powered table extraction with intelligent chunking
- Fixed dispatcher bug for better concurrent processing
- Resolved browser manager race conditions
- Enhanced URL processing and proxy support
All existing tests pass (18/18). No breaking changes identified.
API remains backward compatible.
⚠️ IMPORTANT: URL Resolution Bug Status
A critical bug in v0.6.2 where ../../ paths only go up ONE directory
instead of TWO has been documented (see crawler-test branch). Status
in v0.7.4 is UNKNOWN - testing required before production deployment.
Test script provided: python/test_url_resolution_fix.py
Related issues fixed in v0.7.x:
- #570: General relative URL handling
- #1268: URLs after redirects
- #1323: Trailing slash base URL handling
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>