- Fix progress regression: map crawl callback progress through ProgressMapper
- Prevents UI progress bars from jumping backwards
- Ensures consistent progress reporting across all stages
- Add same-domain filtering for discovered file link following
- Discovery targets (llms.txt) can follow links but only to same domain
- Prevents external crawling while preserving related AI guidance
- Add _is_same_domain() method for domain comparison
- Fix filename filtering false positives with regex token matching
- Replace substring 'full' check with token-aware regex pattern
- Prevents excluding files like "helpful.md" or "meaningful.txt"
- Only excludes actual "full" variants like "llms-full.txt"
- Add llms-full.txt to URLHandler detection patterns
- Support for highest priority discovery file format
- Ensures proper file type detection for link following logic
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add DiscoveryService with single-file priority selection
- Priority: llms-full.txt > llms.txt > llms.md > llms.mdx > sitemap.xml > robots.txt
- All files contain similar AI/crawling guidance, so only best one is needed
- Robots.txt sitemap declarations have highest priority
- Fallback to subdirectories for llms files
- Enhance URLHandler with discovery helper methods
- Add is_robots_txt, is_llms_variant, is_well_known_file, get_base_url methods
- Follow existing patterns with proper error handling
- Integrate discovery into CrawlingService orchestration
- When discovery finds file: crawl ONLY discovered file (not main URL)
- When no discovery: crawl main URL normally
- Fixes issue where both main URL + discovered file were crawled
- Add discovery stage to progress mapping
- New "discovery" stage in progress flow
- Clear progress messages for discovered files
- Comprehensive test coverage
- Tests for priority-based selection logic
- Tests for robots.txt priority and fallback behavior
- Updated existing tests for new return formats
Resolves efficient crawling by selecting single best guidance file instead
of crawling redundant content from multiple similar files.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Add is_binary_file() method to URLHandler to detect 40+ binary extensions
- Update RecursiveCrawlStrategy to filter binary URLs before crawl queue
- Add comprehensive unit tests for binary file detection
- Prevents net::ERR_ABORTED errors when crawler encounters ZIP, PDF, etc.
This fixes the issue where the crawler was treating binary file URLs
(like .zip downloads) as navigable web pages, causing errors in crawl4ai.