Files
archon/python
leex279 7ea4d99a27 fix: Improve domain filter robustness in crawling service
Backend fixes for crawling stability:

- Add comment clarifying DomainFilter doesn't need init params
- Improve base URL selection in recursive strategy:
  - Check start_urls length before indexing
  - Use appropriate base URL for domain checks
  - Fallback to original_url when start_urls is empty
- Add error handling for domain filter:
  - Wrap is_url_allowed in try/except block
  - Log exceptions and conservatively skip URLs on error
  - Prevents domain filter exceptions from crashing crawler
- Better handling of relative URL resolution

These changes ensure more robust crawling especially when:
- start_urls array is empty
- Domain filter encounters unexpected URLs
- Relative links need proper base URL resolution
2025-09-22 13:44:10 +02:00
..