Files
archon/python/tests
leex279 1a55d93a4e Implement priority-based automatic discovery of llms.txt and sitemap.xml files
- Add DiscoveryService with single-file priority selection
  - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
  - All files contain similar AI/crawling guidance, so only best one is needed
  - Robots.txt sitemap declarations have highest priority
  - Fallback to subdirectories for llms files

- Enhance URLHandler with discovery helper methods
  - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
  - Follow existing patterns with proper error handling

- Integrate discovery into CrawlingService orchestration
  - When discovery finds file: crawl ONLY discovered file (not main URL)
  - When no discovery: crawl main URL normally
  - Fixes issue where both main URL + discovered file were crawled

- Add discovery stage to progress mapping
  - New "discovery" stage in progress flow
  - Clear progress messages for discovered files

- Comprehensive test coverage
  - Tests for priority-based selection logic
  - Tests for robots.txt priority and fallback behavior
  - Updated existing tests for new return formats

Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-08 09:03:15 +02:00
..