archon

mirrors/archon

Fork 0

mirror of https://github.com/coleam00/Archon.git synced 2025-12-24 02:39:17 -05:00

Commit Graph

Author	SHA1	Message	Date
leex279	d2adc15be2	fix: Address CodeRabbit critical issues for discovery service - Fix progress regression: map crawl callback progress through ProgressMapper - Prevents UI progress bars from jumping backwards - Ensures consistent progress reporting across all stages - Add same-domain filtering for discovered file link following - Discovery targets (llms.txt) can follow links but only to same domain - Prevents external crawling while preserving related AI guidance - Add _is_same_domain() method for domain comparison - Fix filename filtering false positives with regex token matching - Replace substring 'full' check with token-aware regex pattern - Prevents excluding files like "helpful.md" or "meaningful.txt" - Only excludes actual "full" variants like "llms-full.txt" - Add llms-full.txt to URLHandler detection patterns - Support for highest priority discovery file format - Ensures proper file type detection for link following logic 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-08 11:18:49 +02:00
leex279	1a55d93a4e	Implement priority-based automatic discovery of llms.txt and sitemap.xml files - Add DiscoveryService with single-file priority selection - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt - All files contain similar AI/crawling guidance, so only best one is needed - Robots.txt sitemap declarations have highest priority - Fallback to subdirectories for llms files - Enhance URLHandler with discovery helper methods - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods - Follow existing patterns with proper error handling - Integrate discovery into CrawlingService orchestration - When discovery finds file: crawl ONLY discovered file (not main URL) - When no discovery: crawl main URL normally - Fixes issue where both main URL + discovered file were crawled - Add discovery stage to progress mapping - New "discovery" stage in progress flow - Clear progress messages for discovered files - Comprehensive test coverage - Tests for priority-based selection logic - Tests for robots.txt priority and fallback behavior - Updated existing tests for new return formats Resolves efficient crawling by selecting single best guidance file instead of crawling redundant content from multiple similar files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-09-08 09:03:15 +02:00
Rasmus Widing	8157670936	Fix crawler attempting to navigate to binary files - Add is_binary_file() method to URLHandler to detect 40+ binary extensions - Update RecursiveCrawlStrategy to filter binary URLs before crawl queue - Add comprehensive unit tests for binary file detection - Prevents net::ERR_ABORTED errors when crawler encounters ZIP, PDF, etc. This fixes the issue where the crawler was treating binary file URLs (like .zip downloads) as navigable web pages, causing errors in crawl4ai.	2025-08-15 17:24:46 +03:00

Author

SHA1

Message

Date

leex279

d2adc15be2

fix: Address CodeRabbit critical issues for discovery service

- Fix progress regression: map crawl callback progress through ProgressMapper
  - Prevents UI progress bars from jumping backwards
  - Ensures consistent progress reporting across all stages

- Add same-domain filtering for discovered file link following
  - Discovery targets (llms.txt) can follow links but only to same domain
  - Prevents external crawling while preserving related AI guidance
  - Add _is_same_domain() method for domain comparison

- Fix filename filtering false positives with regex token matching
  - Replace substring 'full' check with token-aware regex pattern
  - Prevents excluding files like "helpful.md" or "meaningful.txt"
  - Only excludes actual "full" variants like "llms-full.txt"

- Add llms-full.txt to URLHandler detection patterns
  - Support for highest priority discovery file format
  - Ensures proper file type detection for link following logic

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-08 11:18:49 +02:00

leex279

1a55d93a4e

Implement priority-based automatic discovery of llms.txt and sitemap.xml files

- Add DiscoveryService with single-file priority selection
  - Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
  - All files contain similar AI/crawling guidance, so only best one is needed
  - Robots.txt sitemap declarations have highest priority
  - Fallback to subdirectories for llms files

- Enhance URLHandler with discovery helper methods
  - Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
  - Follow existing patterns with proper error handling

- Integrate discovery into CrawlingService orchestration
  - When discovery finds file: crawl ONLY discovered file (not main URL)
  - When no discovery: crawl main URL normally
  - Fixes issue where both main URL + discovered file were crawled

- Add discovery stage to progress mapping
  - New "discovery" stage in progress flow
  - Clear progress messages for discovered files

- Comprehensive test coverage
  - Tests for priority-based selection logic
  - Tests for robots.txt priority and fallback behavior
  - Updated existing tests for new return formats

Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-08 09:03:15 +02:00

Rasmus Widing

8157670936

Fix crawler attempting to navigate to binary files

- Add is_binary_file() method to URLHandler to detect 40+ binary extensions
- Update RecursiveCrawlStrategy to filter binary URLs before crawl queue
- Add comprehensive unit tests for binary file detection
- Prevents net::ERR_ABORTED errors when crawler encounters ZIP, PDF, etc.

This fixes the issue where the crawler was treating binary file URLs
(like .zip downloads) as navigable web pages, causing errors in crawl4ai.

2025-08-15 17:24:46 +03:00

3 Commits