Improve discovery logic to check the same directory as the base URL first before
falling back to root-level and subdirectories. This ensures files like
https://supabase.com/docs/llms.txt are found when crawling
https://supabase.com/docs.
Changes:
- Check same directory as base_url first (e.g., /docs/llms.txt for /docs URL)
- Fall back to root-level urljoin behavior
- Include base directory name in subdirectory checks (e.g., /docs subdirectory)
- Maintain priority order: same-dir > root > subdirectories
- Log discovery location for better debugging
This addresses cases where documentation directories contain their own llms.txt
or sitemap files that should take precedence over root-level files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Preserve URL case in robots.txt parsing by only lowercasing the sitemap: prefix check
- Add support for relative sitemap paths in robots.txt using urljoin()
- Fix HTML meta tag parsing to use case-insensitive regex instead of lowercasing content
- Add URL scheme validation for discovered sitemaps (http/https only)
- Fix discovery target domain filtering to use discovered URL's domain instead of input URL
- Clean up whitespace and improve dict comprehension usage
These changes improve discovery reliability and prevent URL corruption while maintaining
backward compatibility with existing discovery behavior.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Enable SSL certificate verification (verify=True) for all HTTP requests
- Implement streaming with size limits (10MB default) to prevent memory exhaustion
- Add _read_response_with_limit() helper for secure response reading
- Update all test mocks to support streaming API with iter_content()
- Fix test assertions to expect new security parameters
- Enforce deterministic rounding in progress mapper tests
Security improvements:
- Prevents MITM attacks through SSL verification
- Guards against DoS via oversized responses
- Ensures proper resource cleanup with response.close()
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Resolved merge conflicts by integrating features from both branches:
- Added page_storage_ops service initialization from main
- Merged link text extraction with discovery mode features
- Preserved discovery single-file mode and domain filtering
- Maintained link text fallbacks for title extraction
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Set explicit PLAYWRIGHT_BROWSERS_PATH to fix browser installation
Fixes Playwright browser not found error during web crawling.
The issue was introduced in the uv migration (9f22659) where the
browser installation path was not explicitly set as a persistent
environment variable.
Changes:
- Add ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
- Add --with-deps flag to playwright install command
- Add comprehensive root cause analysis document
Without this fix, Playwright installed browsers to a default location
at build time but couldn't find them at runtime, causing crawling
operations to fail with "Executable doesn't exist" errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fix: Remove --with-deps flag to prevent build conflicts
The --with-deps flag was causing build failures on some systems because:
- We already manually install all Playwright dependencies (lines 26-49)
- --with-deps attempts to reinstall these packages
- This causes package conflicts and build failures on Windows/WSL
The core fix (ENV PLAYWRIGHT_BROWSERS_PATH) remains the same.
* Delete PLAYWRIGHT_FIX_ANALYSIS.md
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cole Medin <cole@dynamous.ai>
* Initial commit for RAG by document
* Phase 2
* Adding migrations
* Fixing page IDs for chunk metadata
* Fixing unit tests, adding tool to list pages for source
* Fixing page storage upsert issues
* Max file length for retrieval
* Fixing title issue
* Fixing tests
* fix: implement CASCADE DELETE for source deletion timeout issue
- Add migration 009 to add CASCADE DELETE constraints to foreign keys
- Simplify delete_source() to only delete parent record
- Database now handles cascading deletes efficiently
- Fixes timeout issues when deleting sources with thousands of pages
* chore: update complete_setup.sql to include CASCADE DELETE constraints
- Add ON DELETE CASCADE to foreign keys in initial setup
- Include migration 009 in the migrations tracking
- Ensures new installations have CASCADE DELETE from the start
Updates crawl4ai dependency to latest stable version with performance
and stability improvements.
Key improvements in 0.7.4:
- LLM-powered table extraction with intelligent chunking
- Fixed dispatcher bug for better concurrent processing
- Resolved browser manager race conditions
- Enhanced URL processing and proxy support
All existing tests pass (18/18). No breaking changes identified.
API remains backward compatible.
⚠️ IMPORTANT: URL Resolution Bug Status
A critical bug in v0.6.2 where ../../ paths only go up ONE directory
instead of TWO has been documented (see crawler-test branch). Status
in v0.7.4 is UNKNOWN - testing required before production deployment.
Test script provided: python/test_url_resolution_fix.py
Related issues fixed in v0.7.x:
- #570: General relative URL handling
- #1268: URLs after redirects
- #1323: Trailing slash base URL handling
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add Anthropic and Grok provider support
* feat: Add crucial GPT-5 and reasoning model support for OpenRouter
- Add requires_max_completion_tokens() function for GPT-5, o1, o3, Grok-3 series
- Add prepare_chat_completion_params() for reasoning model compatibility
- Implement max_tokens → max_completion_tokens conversion for reasoning models
- Add temperature handling for reasoning models (must be 1.0 default)
- Enhanced provider validation and API key security in provider endpoints
- Streamlined retry logic (3→2 attempts) for faster issue detection
- Add failure tracking and circuit breaker analysis for debugging
- Support OpenRouter format detection (openai/gpt-5-nano, openai/o1-mini)
- Improved Grok provider empty response handling with structured fallbacks
- Enhanced contextual embedding with provider-aware model selection
Core provider functionality:
- OpenRouter, Grok, Anthropic provider support with full embedding integration
- Provider-specific model defaults and validation
- Secure API connectivity testing endpoints
- Provider context passing for code generation workflows
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* fully working model providers, addressing securtiy and code related concerns, throughly hardening our code
* added multiprovider support, embeddings model support, cleaned the pr, need to fix health check, asnyico tasks errors, and contextual embeddings error
* fixed contextual embeddings issue
* - Added inspect-aware shutdown handling so get_llm_client always closes the underlying AsyncOpenAI / httpx.AsyncClient while the loop is still alive, with defensive logging if shutdown happens late (python/src/server/services/llm_provider_service.py:14, python/src/server/ services/llm_provider_service.py:520).
* - Restructured get_llm_client so client creation and usage live in separate try/finally blocks; fallback clients now close without logging spurious Error creating LLM client when downstream code raises (python/src/server/services/llm_provider_service.py:335-556). - Close logic now sanitizes provider names consistently and awaits whichever aclose/close coroutine the SDK exposes, keeping the loop shut down cleanly (python/src/server/services/llm_provider_service.py:530-559). Robust JSON Parsing - Added _extract_json_payload to strip code fences / extra text returned by Ollama before json.loads runs, averting the markdown-induced decode errors you saw in logs (python/src/server/services/storage/code_storage_service.py:40-63). - Swapped the direct parse call for the sanitized payload and emit a debug preview when cleanup alters the content (python/src/server/ services/storage/code_storage_service.py:858-864).
* added provider connection support
* added provider api key not being configured warning
* Updated get_llm_client so missing OpenAI keys automatically fall back to Ollama (matching existing tests) and so unsupported providers still raise the legacy ValueError the suite expects. The fallback now reuses _get_optimal_ollama_instance and rethrows ValueError(OpenAI API key not found and Ollama fallback failed) when it cant connect. Adjusted test_code_extraction_source_id.py to accept the new optional argument on the mocked extractor (and confirm its None when present).
* Resolved a few needed code rabbit suggestion - Updated the knowledge API key validation to call create_embedding with the provider argument and removed the hard-coded OpenAI fallback (python/src/server/api_routes/knowledge_api.py). - Broadened embedding provider detection so prefixed OpenRouter/OpenAI model names route through the correct client (python/src/server/ services/embeddings/embedding_service.py, python/src/server/services/llm_provider_service.py). - Removed the duplicate helper definitions from llm_provider_service.py, eliminating the stray docstring that was causing the import-time syntax error.
* updated via code rabbit PR review, code rabbit in my IDE found no issues and no nitpicks with the updates! what was done: Credential service now persists the provider under the uppercase key LLM_PROVIDER, matching the read path (no new EMBEDDING_PROVIDER usage introduced). Embedding batch creation stops inserting blank strings, logging failures and skipping invalid items before they ever hit the provider (python/src/server/services/embeddings/embedding_service.py). Contextual embedding prompts use real newline characters everywhereboth when constructing the batch prompt and when parsing the models response (python/src/server/services/embeddings/contextual_embedding_service.py). Embedding provider routing already recognizes OpenRouter-prefixed OpenAI models via is_openai_embedding_model; no further change needed there. Embedding insertion now skips unsupported vector dimensions instead of forcing them into the 1536-column, and the backoff loop uses await asyncio.sleep so we no longer block the event loop (python/src/server/services/storage/code_storage_service.py). RAG settings props were extended to include LLM_INSTANCE_NAME and OLLAMA_EMBEDDING_INSTANCE_NAME, and the debug log no longer prints API-key prefixes (the rest of the TanStack refactor/EMBEDDING_PROVIDER support remains deferred).
* test fix
* enhanced Openrouters parsing logic to automatically detect reasoning models and parse regardless of json output or not. this commit creates a robust way for archons parsing to work throughly with openrouter automatically, regardless of the model youre using, to ensure proper functionality with out breaking any generation capabilities!
* updated ui llm interface, added seprate embeddings provider, made the system fully capabale of mix and matching llm providers (local and non local) for chat & embeddings. updated the ragsettings.tsx ui mainly, along with core functionality
* added warning labels and updated ollama health checks
* ready for review, fixed som error warnings and consildated ollama status health checks
* fixed FAILED test_async_embedding_service.py
* code rabbit fixes
* Separated the code-summary LLM provider from the embedding provider, so code example storage now forwards a dedicated embedding provider override end-to-end without hijacking the embedding pipeline. this fixes code rabbits (Preserve provider override in create_embeddings_batch) suggesting
* - Swapped API credential storage to booleans so decrypted keys never sit in React state (archon-ui-main/src/components/
settings/RAGSettings.tsx).
- Normalized Ollama instance URLs and gated the metrics effect on real state changes to avoid mis-counts and duplicate
fetches (RAGSettings.tsx).
- Tightened crawl progress scaling and indented-block parsing to handle min_length=None safely (python/src/server/
services/crawling/code_extraction_service.py:160, python/src/server/services/crawling/code_extraction_service.py:911).
- Added provider-agnostic embedding rate-limit retries so Google and friends back off gracefully (python/src/server/
services/embeddings/embedding_service.py:427).
- Made the orchestration registry async + thread-safe and updated every caller to await it (python/src/server/services/
crawling/crawling_service.py:34, python/src/server/api_routes/knowledge_api.py:1291).
* Update RAGSettings.tsx - header for 'LLM Settings' is now 'LLM Provider Settings'
* (RAG Settings)
- Ollama Health Checks & Metrics
- Added a 10-second timeout to the health fetch so it doesn't hang.
- Adjusted logic so metric refreshes run for embedding-only Ollama setups too.
- Initial page load now checks Ollama if either chat or embedding provider uses it.
- Metrics and alerts now respect which provider (chat/embedding) is currently selected.
- Provider Sync & Alerts
- Fixed a sync bug so the very first provider change updates settings as expected.
- Alerts now track the active provider (chat vs embedding) rather than only the LLM provider.
- Warnings about missing credentials now skip whichever provider is currently selected.
- Modals & Types
- Normalize URLs before handing them to selection modals to keep consistent data.
- Strengthened helper function types (getDisplayedChatModel, getModelPlaceholder, etc.).
(Crawling Service)
- Made the orchestration registry lock lazy-initialized to avoid issues in Python 3.12 and wrapped registry commands
(register, unregister) in async calls. This keeps things thread-safe even during concurrent crawling and cancellation.
* - migration/complete_setup.sql:101 seeds Google/OpenRouter/Anthropic/Grok API key rows so fresh databases expose every
provider by default.
- migration/0.1.0/009_add_provider_placeholders.sql:1 backfills the same rows for existing Supabase instances and
records the migration.
- archon-ui-main/src/components/settings/RAGSettings.tsx:121 introduces a shared credentialprovider map,
reloadApiCredentials runs through all five providers, and the status poller includes the new keys.
- archon-ui-main/src/components/settings/RAGSettings.tsx:353 subscribes to the archon:credentials-updated browser event
so adding/removing a key immediately refetches credential status and pings the corresponding connectivity test.
- archon-ui-main/src/components/settings/RAGSettings.tsx:926 now treats missing Anthropic/OpenRouter/Grok keys as
missing, preventing stale connected badges when a key is removed.
* - archon-ui-main/src/components/settings/RAGSettings.tsx:90 adds a simple display-name map and reuses one red alert
style.
- archon-ui-main/src/components/settings/RAGSettings.tsx:1016 now shows exactly one red banner when the active provider
- Removed the old duplicate Missing API Key Configuration block, so the panel no longer stacks two warnings.
* Update credentialsService.ts default model
* updated the google embedding adapter for multi dimensional rag querying
* thought this micro fix in the google embedding pushed with the embedding update the other day, it didnt. pushing now
---------
Co-authored-by: Chillbruhhh <joshchesser97@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>