Files
archon/python/tests
leex279 13796abbe8 feat: Improve discovery system with SSRF protection and optimize file detection
## Backend Improvements

### Discovery Service
- Fix SSRF protection: Use requests.Session() for max_redirects parameter
- Add comprehensive IP validation (_is_safe_ip, _resolve_and_validate_hostname)
- Add hostname DNS resolution validation before requests
- Fix llms.txt link following to crawl ALL same-domain pages (not just llms.txt files)
- Remove unused file variants: llms.md, llms.markdown, sitemap_index.xml, sitemap-index.xml
- Optimize DISCOVERY_PRIORITY based on real-world usage research
- Update priority: llms.txt > llms-full.txt > sitemap.xml > robots.txt

### URL Handler
- Fix .well-known path to be case-sensitive per RFC 8615
- Remove llms.md, llms.markdown, llms.mdx from variant detection
- Simplify link collection patterns to only .txt files (most common)
- Update llms_variants list to only include spec-compliant files

### Crawling Service
- Add tldextract for proper root domain extraction (handles .co.uk, .com.au, etc.)
- Replace naive domain extraction with robust get_root_domain() function
- Add tldextract>=5.0.0 to dependencies

## Frontend Improvements

### Type Safety
- Extend ActiveOperation type with discovery fields (discovered_file, discovered_file_type, linked_files)
- Remove all type casting (operation as any) from CrawlingProgress component
- Add proper TypeScript types for discovery information

### Security
- Create URL validation utility (urlValidation.ts)
- Only render clickable links for validated HTTP/HTTPS URLs
- Reject unsafe protocols (javascript:, data:, vbscript:, file:)
- Display invalid URLs as plain text instead of links

## Testing
- Update test mocks to include history and url attributes for redirect checking
- Fix .well-known case sensitivity tests (must be lowercase per RFC 8615)
- Update discovery priority tests to match new order
- Remove tests for deprecated file variants

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-19 15:31:08 +02:00
..