archon

mirror of https://github.com/coleam00/Archon.git synced 2025-12-24 18:59:24 -05:00

Files

leex279 247c7eaa7b Implement robots.txt compliance for web crawler

Adds robots.txt validation to respect website crawling policies.
Uses Protego library for parsing and enforces RFC 9309 standards.

Changes:
- RobotsChecker service with manual TTL caching and shared httpx client
- User-Agent: "Archon-Crawler/0.1.0 (+repo_url)"
- URL validation at 3 critical integration points
- Proper resource cleanup in API route finally blocks
- Removed robots.txt from discovery file list (used for validation, not content)
- Clean INFO-level logging: one line per domain showing compliance

Dependencies:
- Added protego>=0.3.1 (fast RFC 9309 compliant parser with wildcard support)
- crawl4ai updated 0.7.4 -> 0.7.6 (latest bug fixes, unrelated to robots.txt)
- Manual async caching (no asyncache - unmaintained with cachetools risks)

Key Features:
- 24-hour TTL cache per domain with LRU eviction
- Proper error handling (404=allow, 5xx=disallow per RFC 9309)
- Thread-safe with separate locks for cache and delay tracking
- Shared httpx.AsyncClient singleton prevents connection leaks
- close() called in finally blocks for proper cleanup
- Minimal logging: "Respecting robots.txt for {domain} (cached for 24h)"

Closes #275

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-07 23:45:02 +01:00