Commit Graph

281 Commits

Author SHA1 Message Date
leex279
cdf4323534 feat: Implement llms.txt link following with discovery priority fix
Implements complete llms.txt link following functionality that crawls
linked llms.txt files on the same domain/subdomain, along with critical
bug fixes for discovery priority and variant detection.

Backend Core Functionality:
- Add _is_same_domain_or_subdomain method for subdomain matching
- Fix is_llms_variant to detect .txt files in /llms/ directories
- Implement llms.txt link extraction and following logic
- Add two-phase discovery: prioritize ALL llms.txt before sitemaps
- Enhanced progress reporting with discovery metadata

Critical Bug Fixes:
- Discovery priority: Fixed sitemap.xml being found before llms.txt
- is_llms_variant: Now matches /llms/guides.txt, /llms/swift.txt, etc.
- These were blocking bugs preventing link following from working

Frontend UI:
- Add discovery and linked files display to CrawlingProgress component
- Update progress types to include discoveredFile, linkedFiles fields
- Add new crawl types: llms_txt_with_linked_files, discovery_*
- Add "discovery" to ProgressStatus enum and active statuses

Testing:
- 8 subdomain matching unit tests (test_crawling_service_subdomain.py)
- 7 integration tests for link following (test_llms_txt_link_following.py)
- All 15 tests passing
- Validated against real Supabase llms.txt structure (1 main + 8 linked)

Files Modified:
Backend:
- crawling_service.py: Core link following logic (lines 744-788, 862-920)
- url_handler.py: Fixed variant detection (lines 633-665)
- discovery_service.py: Two-phase discovery (lines 137-214)
- 2 new comprehensive test files

Frontend:
- progress/types/progress.ts: Updated types with new fields
- progress/components/CrawlingProgress.tsx: Added UI sections

Real-world testing: Crawling supabase.com/docs now discovers
/docs/llms.txt and automatically follows 8 linked llms.txt files,
indexing complete documentation from all files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 22:05:15 +02:00
leex279
a03ce1e4fd fix: Respect llms.txt priority over robots.txt sitemap declarations
Remove the special case that gave robots.txt sitemap declarations highest
priority, which incorrectly overrode the global priority order. Now properly
respects the intended priority: llms-full.txt > llms.txt > llms.md > llms.mdx >
sitemap.xml > robots.txt.

This fixes the issue where supabase.com/docs would return sitemap.xml instead
of llms.txt even though both files exist at /docs/ and llms.txt should have
higher priority.

Changes:
- Removed robots.txt early return that bypassed priority order
- Updated test to verify llms files take precedence over robots.txt sitemaps
- All discovery now follows consistent DISCOVERY_PRIORITY order

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 19:37:14 +02:00
leex279
8777e9456c feat: Prioritize same-directory discovery for llms.txt and sitemaps
Improve discovery logic to check the same directory as the base URL first before
falling back to root-level and subdirectories. This ensures files like
https://supabase.com/docs/llms.txt are found when crawling
https://supabase.com/docs.

Changes:
- Check same directory as base_url first (e.g., /docs/llms.txt for /docs URL)
- Fall back to root-level urljoin behavior
- Include base directory name in subdirectory checks (e.g., /docs subdirectory)
- Maintain priority order: same-dir > root > subdirectories
- Log discovery location for better debugging

This addresses cases where documentation directories contain their own llms.txt
or sitemap files that should take precedence over root-level files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 19:26:24 +02:00
leex279
e5160dde5c fix: Address CodeRabbit feedback for discovery service
- Preserve URL case in robots.txt parsing by only lowercasing the sitemap: prefix check
- Add support for relative sitemap paths in robots.txt using urljoin()
- Fix HTML meta tag parsing to use case-insensitive regex instead of lowercasing content
- Add URL scheme validation for discovered sitemaps (http/https only)
- Fix discovery target domain filtering to use discovered URL's domain instead of input URL
- Clean up whitespace and improve dict comprehension usage

These changes improve discovery reliability and prevent URL corruption while maintaining
backward compatibility with existing discovery behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-17 19:03:25 +02:00
leex279
968e5b73fe Add SSL verification and response size limits to discovery service
- Enable SSL certificate verification (verify=True) for all HTTP requests
- Implement streaming with size limits (10MB default) to prevent memory exhaustion
- Add _read_response_with_limit() helper for secure response reading
- Update all test mocks to support streaming API with iter_content()
- Fix test assertions to expect new security parameters
- Enforce deterministic rounding in progress mapper tests

Security improvements:
- Prevents MITM attacks through SSL verification
- Guards against DoS via oversized responses
- Ensures proper resource cleanup with response.close()

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-14 22:31:19 +02:00
leex279
d696918ff0 Merge main into feature/automatic-discovery-llms-sitemap-430
Resolved merge conflicts by integrating features from both branches:
- Added page_storage_ops service initialization from main
- Merged link text extraction with discovery mode features
- Preserved discovery single-file mode and domain filtering
- Maintained link text fallbacks for title extraction

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-11 09:31:24 +02:00
sean-eskerium
4050c3540a Merge pull request #777 from coleam00/refactor/projects-ui
Refactor the UI and add Documents back.
2025-10-10 21:58:25 -04:00
Developer
ef4262681f Code rabbit issues fix again 2025-10-10 21:54:04 -04:00
Cole Medin
77e9342c27 Updating title exxtraction for llms.txt 2025-10-10 18:16:03 -05:00
Cole Medin
4a9ed51cff Adjusting table creation order in complete_setup.sql 2025-10-10 17:55:30 -05:00
Cole Medin
571e7c18c4 Correcting migrations in complete_setup.sql 2025-10-10 17:52:14 -05:00
Cole Medin
710909eecd Fixing up migration order 2025-10-10 17:50:41 -05:00
Developer
913f47ba62 code rabbit feedback 2025-10-10 18:40:25 -04:00
Developer
20c57acb00 Code rabbit feedback 2025-10-10 18:30:12 -04:00
DIY Smart Code
3168c8b69f fix: Set explicit PLAYWRIGHT_BROWSERS_PATH to fix browser installation (#765)
* fix: Set explicit PLAYWRIGHT_BROWSERS_PATH to fix browser installation

Fixes Playwright browser not found error during web crawling.

The issue was introduced in the uv migration (9f22659) where the
browser installation path was not explicitly set as a persistent
environment variable.

Changes:
- Add ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
- Add --with-deps flag to playwright install command
- Add comprehensive root cause analysis document

Without this fix, Playwright installed browsers to a default location
at build time but couldn't find them at runtime, causing crawling
operations to fail with "Executable doesn't exist" errors.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fix: Remove --with-deps flag to prevent build conflicts

The --with-deps flag was causing build failures on some systems because:
- We already manually install all Playwright dependencies (lines 26-49)
- --with-deps attempts to reinstall these packages
- This causes package conflicts and build failures on Windows/WSL

The core fix (ENV PLAYWRIGHT_BROWSERS_PATH) remains the same.

* Delete PLAYWRIGHT_FIX_ANALYSIS.md

---------

Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cole Medin <cole@dynamous.ai>
2025-10-10 17:11:52 -05:00
sean-eskerium
7c3823e08f Fixes: crawl code storage issue with <think> tags for ollama models. (#775)
* Fixes: crawl code storage issue with <think> tags for ollama models.

* updates from code rabbit review
2025-10-10 17:09:53 -05:00
Developer
8ff39fa1d5 Merge branch 'main' into refactor/projects-ui
Merged in PR #776 (refactor/knowledge-ui) from main.
No conflicts - different features.
2025-10-10 17:08:05 -04:00
sean-eskerium
94e28f85fd Merge pull request #776 from coleam00/refactor/knowledge-ui
Refactoring the UI for consistent styling
2025-10-10 17:03:05 -04:00
sean-eskerium
e22c6c3836 fix code rabbit suggestions. 2025-10-10 14:42:01 -04:00
sean-eskerium
a860b27848 Refactor the UI and add Documents back. 2025-10-10 14:24:09 -04:00
sean-eskerium
691adccc12 Refactoring the UI for consistent styling 2025-10-10 03:36:35 -04:00
sean-eskerium
4ad1fb0808 Merge pull request #772 from coleam00/feature/ui-style-guide
Feature/UI style guide
2025-10-09 21:21:17 -04:00
sean-eskerium
88cb8d7f03 Update archon-ui-main/src/features/style-guide/layouts/ProjectsLayoutExample.tsx
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-09 21:17:00 -04:00
sean-eskerium
f0030699a8 Update archon-ui-main/src/features/style-guide/layouts/ProjectsLayoutExample.tsx
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
2025-10-09 21:15:14 -04:00
sean-eskerium
59f4568fda another round of code rabbit feedback 2025-10-09 21:05:12 -04:00
sean-eskerium
0013336ee3 Merge branch 'main' of https://github.com/coleam00/Archon into feature/ui-style-guide 2025-10-09 20:42:08 -04:00
sean-eskerium
ad82f6e9f6 Another round of Coderabbit feedback. 2025-10-09 20:40:47 -04:00
Cole Medin
bfd0a84f64 RAG Enhancements (Page Level Retrieval) (#767)
* Initial commit for RAG by document

* Phase 2

* Adding migrations

* Fixing page IDs for chunk metadata

* Fixing unit tests, adding tool to list pages for source

* Fixing page storage upsert issues

* Max file length for retrieval

* Fixing title issue

* Fixing tests
2025-10-09 19:39:27 -05:00
sean-eskerium
c3f42504ea code rabbit updates 2025-10-09 20:19:51 -04:00
sean-eskerium
98946817b4 Merge remote-tracking branch 'origin/main' into feature/ui-style-guide 2025-10-09 17:43:43 -04:00
sean-eskerium
02533dc37c Fixing Code Rabbit suggestions. 2025-10-09 16:23:32 -04:00
DIY Smart Code
e6d538fdd8 Merge pull request #769 from coleam00/crawl4ai-update
chore: update crawl4ai from 0.6.2 to 0.7.4
2025-10-09 21:52:36 +02:00
sean-eskerium
daf915c083 Fixes from biome and consistency review. 2025-10-09 14:26:37 -04:00
sean-eskerium
4e6116fa2f Fix consistency and biome formatting issues 2025-10-09 13:49:12 -04:00
sean-eskerium
9e4c7eaf4e Updating documentation and the review command refinement. 2025-10-09 13:35:24 -04:00
sean-eskerium
db538a5f46 Remove dead code 2025-10-09 12:14:36 -04:00
sean-eskerium
5c7924f43d Merge main into feature/ui-style-guide
- Resolved package-lock.json conflict
- Kept Tailwind 4.1.2 upgrade from feature branch
- Merged main's updates (react-icons, file reorganization, new features)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-09 11:53:27 -04:00
sean-eskerium
6f173e403d remove prp docs 2025-10-09 11:49:41 -04:00
sean-eskerium
bebe4c1037 candidate for release 2025-10-09 11:49:03 -04:00
Wirasm
489415d723 Fix: Database timeout when deleting large sources (#737)
* fix: implement CASCADE DELETE for source deletion timeout issue

- Add migration 009 to add CASCADE DELETE constraints to foreign keys
- Simplify delete_source() to only delete parent record
- Database now handles cascading deletes efficiently
- Fixes timeout issues when deleting sources with thousands of pages

* chore: update complete_setup.sql to include CASCADE DELETE constraints

- Add ON DELETE CASCADE to foreign keys in initial setup
- Include migration 009 in the migrations tracking
- Ensures new installations have CASCADE DELETE from the start
2025-10-09 17:52:06 +03:00
DIY Smart Code
00fe2599ad Delete python/test_url_resolution_fix.py 2025-10-09 16:05:37 +02:00
DIY Smart Code
f9a506b9c9 Delete CRAWL4AI_UPDATE.md 2025-10-09 16:04:58 +02:00
sean-eskerium
2e68403db0 update styles of the primitives. 2025-10-09 09:51:50 -04:00
sean-eskerium
80992ca975 Epgrade to Tailwind 4 2025-10-09 09:31:47 -04:00
sean-eskerium
70b6e70a95 trying to make the ui reviews programmatic 2025-10-09 07:59:54 -04:00
sean-eskerium
4cb7c46d6e fixing document browser and updating primitive tab styles. 2025-10-09 00:15:29 -04:00
sean-eskerium
17ca62ceb4 refining 2025-10-08 23:43:43 -04:00
sean-eskerium
5b839a1465 command for UI review, and settings to use primitives. 2025-10-08 18:38:12 -04:00
sean-eskerium
0727245c9d Udate the projects layout. And style guide. 2025-10-08 17:37:29 -04:00
leex279
8deee6fd7a chore: update crawl4ai from 0.6.2 to 0.7.4
Updates crawl4ai dependency to latest stable version with performance
and stability improvements.

Key improvements in 0.7.4:
- LLM-powered table extraction with intelligent chunking
- Fixed dispatcher bug for better concurrent processing
- Resolved browser manager race conditions
- Enhanced URL processing and proxy support

All existing tests pass (18/18). No breaking changes identified.
API remains backward compatible.

⚠️ IMPORTANT: URL Resolution Bug Status
A critical bug in v0.6.2 where ../../ paths only go up ONE directory
instead of TWO has been documented (see crawler-test branch). Status
in v0.7.4 is UNKNOWN - testing required before production deployment.

Test script provided: python/test_url_resolution_fix.py

Related issues fixed in v0.7.x:
- #570: General relative URL handling
- #1268: URLs after redirects
- #1323: Trailing slash base URL handling

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-08 22:27:15 +02:00