archon

mirrors/archon

Fork 0

mirror of https://github.com/coleam00/Archon.git synced 2025-12-30 21:49:30 -05:00

Commit Graph

Author	SHA1	Message	Date
leex279	43af7b747c	fix: Update tests for single-file discovery and discovery stage integration - Fix discovery service tests to match new single-file return format - Remove obsolete tests for removed discovery methods - Update progress mapper tests for new discovery stage ranges - Fix stage range expectations after adding discovery stage (2,3) 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-08 10:27:50 +02:00
leex279	1a55d93a4e	Implement priority-based automatic discovery of llms.txt and sitemap.xml files - Add DiscoveryService with single-file priority selection - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt - All files contain similar AI/crawling guidance, so only best one is needed - Robots.txt sitemap declarations have highest priority - Fallback to subdirectories for llms files - Enhance URLHandler with discovery helper methods - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods - Follow existing patterns with proper error handling - Integrate discovery into CrawlingService orchestration - When discovery finds file: crawl ONLY discovered file (not main URL) - When no discovery: crawl main URL normally - Fixes issue where both main URL + discovered file were crawled - Add discovery stage to progress mapping - New "discovery" stage in progress flow - Clear progress messages for discovered files - Comprehensive test coverage - Tests for priority-based selection logic - Tests for robots.txt priority and fallback behavior - Updated existing tests for new return formats Resolves efficient crawling by selecting single best guidance file instead of crawling redundant content from multiple similar files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-08 09:03:15 +02:00

Author

SHA1

Message

Date

leex279

43af7b747c

fix: Update tests for single-file discovery and discovery stage integration

- Fix discovery service tests to match new single-file return format
- Remove obsolete tests for removed discovery methods
- Update progress mapper tests for new discovery stage ranges
- Fix stage range expectations after adding discovery stage (2,3)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-08 10:27:50 +02:00

leex279

1a55d93a4e

Implement priority-based automatic discovery of llms.txt and sitemap.xml files

- Add DiscoveryService with single-file priority selection
  - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
  - All files contain similar AI/crawling guidance, so only best one is needed
  - Robots.txt sitemap declarations have highest priority
  - Fallback to subdirectories for llms files

- Enhance URLHandler with discovery helper methods
  - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
  - Follow existing patterns with proper error handling

- Integrate discovery into CrawlingService orchestration
  - When discovery finds file: crawl ONLY discovered file (not main URL)
  - When no discovery: crawl main URL normally
  - Fixes issue where both main URL + discovered file were crawled

- Add discovery stage to progress mapping
  - New "discovery" stage in progress flow
  - Clear progress messages for discovered files

- Comprehensive test coverage
  - Tests for priority-based selection logic
  - Tests for robots.txt priority and fallback behavior
  - Updated existing tests for new return formats

Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-08 09:03:15 +02:00

2 Commits