- Fix test mocks to use requests.Session for _check_url_exists
- Add url parameter to create_mock_response to prevent MagicMock issues
- Update all test scenarios to mock both requests.get and session.get
- Remove redundant UNSAFE_PROTOCOLS check in URL validation
- Fix test assertions to match new priority order (llms.txt > llms-full.txt)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
## Backend Improvements
### Discovery Service
- Fix SSRF protection: Use requests.Session() for max_redirects parameter
- Add comprehensive IP validation (_is_safe_ip, _resolve_and_validate_hostname)
- Add hostname DNS resolution validation before requests
- Fix llms.txt link following to crawl ALL same-domain pages (not just llms.txt files)
- Remove unused file variants: llms.md, llms.markdown, sitemap_index.xml, sitemap-index.xml
- Optimize DISCOVERY_PRIORITY based on real-world usage research
- Update priority: llms.txt > llms-full.txt > sitemap.xml > robots.txt
### URL Handler
- Fix .well-known path to be case-sensitive per RFC 8615
- Remove llms.md, llms.markdown, llms.mdx from variant detection
- Simplify link collection patterns to only .txt files (most common)
- Update llms_variants list to only include spec-compliant files
### Crawling Service
- Add tldextract for proper root domain extraction (handles .co.uk, .com.au, etc.)
- Replace naive domain extraction with robust get_root_domain() function
- Add tldextract>=5.0.0 to dependencies
## Frontend Improvements
### Type Safety
- Extend ActiveOperation type with discovery fields (discovered_file, discovered_file_type, linked_files)
- Remove all type casting (operation as any) from CrawlingProgress component
- Add proper TypeScript types for discovery information
### Security
- Create URL validation utility (urlValidation.ts)
- Only render clickable links for validated HTTP/HTTPS URLs
- Reject unsafe protocols (javascript:, data:, vbscript:, file:)
- Display invalid URLs as plain text instead of links
## Testing
- Update test mocks to include history and url attributes for redirect checking
- Fix .well-known case sensitivity tests (must be lowercase per RFC 8615)
- Update discovery priority tests to match new order
- Remove tests for deprecated file variants
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Remove the special case that gave robots.txt sitemap declarations highest
priority, which incorrectly overrode the global priority order. Now properly
respects the intended priority: llms-full.txt > llms.txt > llms.md > llms.mdx >
sitemap.xml > robots.txt.
This fixes the issue where supabase.com/docs would return sitemap.xml instead
of llms.txt even though both files exist at /docs/ and llms.txt should have
higher priority.
Changes:
- Removed robots.txt early return that bypassed priority order
- Updated test to verify llms files take precedence over robots.txt sitemaps
- All discovery now follows consistent DISCOVERY_PRIORITY order
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Enable SSL certificate verification (verify=True) for all HTTP requests
- Implement streaming with size limits (10MB default) to prevent memory exhaustion
- Add _read_response_with_limit() helper for secure response reading
- Update all test mocks to support streaming API with iter_content()
- Fix test assertions to expect new security parameters
- Enforce deterministic rounding in progress mapper tests
Security improvements:
- Prevents MITM attacks through SSL verification
- Guards against DoS via oversized responses
- Ensures proper resource cleanup with response.close()
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Fix discovery service tests to match new single-file return format
- Remove obsolete tests for removed discovery methods
- Update progress mapper tests for new discovery stage ranges
- Fix stage range expectations after adding discovery stage (2,3)
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add DiscoveryService with single-file priority selection
- Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
- All files contain similar AI/crawling guidance, so only best one is needed
- Robots.txt sitemap declarations have highest priority
- Fallback to subdirectories for llms files
- Enhance URLHandler with discovery helper methods
- Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
- Follow existing patterns with proper error handling
- Integrate discovery into CrawlingService orchestration
- When discovery finds file: crawl ONLY discovered file (not main URL)
- When no discovery: crawl main URL normally
- Fixes issue where both main URL + discovered file were crawled
- Add discovery stage to progress mapping
- New "discovery" stage in progress flow
- Clear progress messages for discovered files
- Comprehensive test coverage
- Tests for priority-based selection logic
- Tests for robots.txt priority and fallback behavior
- Updated existing tests for new return formats
Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>