Files
archon/python/tests/test_code_extraction_source_id.py
Josh a580fdfe66 Feature/LLM-Providers-UI-Polished (#736)
* Add Anthropic and Grok provider support

* feat: Add crucial GPT-5 and reasoning model support for OpenRouter

- Add requires_max_completion_tokens() function for GPT-5, o1, o3, Grok-3 series
- Add prepare_chat_completion_params() for reasoning model compatibility
- Implement max_tokens → max_completion_tokens conversion for reasoning models
- Add temperature handling for reasoning models (must be 1.0 default)
- Enhanced provider validation and API key security in provider endpoints
- Streamlined retry logic (3→2 attempts) for faster issue detection
- Add failure tracking and circuit breaker analysis for debugging
- Support OpenRouter format detection (openai/gpt-5-nano, openai/o1-mini)
- Improved Grok provider empty response handling with structured fallbacks
- Enhanced contextual embedding with provider-aware model selection

Core provider functionality:
- OpenRouter, Grok, Anthropic provider support with full embedding integration
- Provider-specific model defaults and validation
- Secure API connectivity testing endpoints
- Provider context passing for code generation workflows

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

* fully working model providers, addressing securtiy and code related concerns, throughly hardening our code

* added multiprovider support, embeddings model support, cleaned the pr, need to fix health check, asnyico tasks errors, and contextual embeddings error

* fixed contextual embeddings issue

* - Added inspect-aware shutdown handling so get_llm_client always closes the underlying AsyncOpenAI / httpx.AsyncClient while the loop is   still alive, with defensive logging if shutdown happens late (python/src/server/services/llm_provider_service.py:14, python/src/server/    services/llm_provider_service.py:520).

* - Restructured get_llm_client so client creation and usage live in separate try/finally blocks; fallback clients now close without         logging spurious Error creating LLM client when downstream code raises (python/src/server/services/llm_provider_service.py:335-556).    - Close logic now sanitizes provider names consistently and awaits whichever aclose/close coroutine the SDK exposes, keeping the loop      shut down cleanly (python/src/server/services/llm_provider_service.py:530-559).                                                                                                                                                                                                       Robust JSON Parsing                                                                                                                                                                                                                                                                   - Added _extract_json_payload to strip code fences / extra text returned by Ollama before json.loads runs, averting the markdown-induced   decode errors you saw in logs (python/src/server/services/storage/code_storage_service.py:40-63).                                          - Swapped the direct parse call for the sanitized payload and emit a debug preview when cleanup alters the content (python/src/server/     services/storage/code_storage_service.py:858-864).

* added provider connection support

* added provider api key not being configured warning

* Updated get_llm_client so missing OpenAI keys automatically fall back to Ollama (matching existing tests) and so unsupported providers     still raise the legacy ValueError the suite expects. The fallback now reuses _get_optimal_ollama_instance and rethrows ValueError(OpenAI  API key not found and Ollama fallback failed) when it cant connect.  Adjusted test_code_extraction_source_id.py to accept the new optional argument on the mocked extractor (and confirm its None when         present).

* Resolved a few needed code rabbit suggestion   - Updated the knowledge API key validation to call create_embedding with the provider argument and removed the hard-coded OpenAI fallback  (python/src/server/api_routes/knowledge_api.py).                                                                                           - Broadened embedding provider detection so prefixed OpenRouter/OpenAI model names route through the correct client (python/src/server/    services/embeddings/embedding_service.py, python/src/server/services/llm_provider_service.py).                                             - Removed the duplicate helper definitions from llm_provider_service.py, eliminating the stray docstring that was causing the import-time  syntax error.

* updated via code rabbit PR review, code rabbit in my IDE found no issues and no nitpicks with the updates! what was done:    Credential service now persists the provider under the uppercase key LLM_PROVIDER, matching the read path (no new EMBEDDING_PROVIDER     usage introduced).                                                                                                                          Embedding batch creation stops inserting blank strings, logging failures and skipping invalid items before they ever hit the provider    (python/src/server/services/embeddings/embedding_service.py).                                                                               Contextual embedding prompts use real newline characters everywhereboth when constructing the batch prompt and when parsing the         models response (python/src/server/services/embeddings/contextual_embedding_service.py).                                                   Embedding provider routing already recognizes OpenRouter-prefixed OpenAI models via is_openai_embedding_model; no further change needed  there.                                                                                                                                      Embedding insertion now skips unsupported vector dimensions instead of forcing them into the 1536-column, and the backoff loop uses      await asyncio.sleep so we no longer block the event loop (python/src/server/services/storage/code_storage_service.py).                      RAG settings props were extended to include LLM_INSTANCE_NAME and OLLAMA_EMBEDDING_INSTANCE_NAME, and the debug log no longer prints     API-key prefixes (the rest of the TanStack refactor/EMBEDDING_PROVIDER support remains deferred).

* test fix

* enhanced Openrouters parsing logic to automatically detect reasoning models and parse regardless of json output or not. this commit creates a robust way for archons parsing to work throughly with openrouter automatically, regardless of the model youre using, to ensure proper functionality with out breaking any generation capabilities!

* updated ui llm interface, added seprate embeddings provider, made the system fully capabale of mix and matching llm providers (local and non local) for chat & embeddings. updated the ragsettings.tsx ui mainly, along with core functionality

* added warning labels and updated ollama health checks

* ready for review, fixed som error warnings and consildated ollama status health checks

* fixed FAILED test_async_embedding_service.py

* code rabbit fixes

* Separated the code-summary LLM provider from the embedding provider, so code example storage now forwards a dedicated embedding provider override end-to-end without hijacking the embedding pipeline. this fixes code rabbits (Preserve provider override in create_embeddings_batch) suggesting

* - Swapped API credential storage to booleans so decrypted keys never sit in React state (archon-ui-main/src/components/
  settings/RAGSettings.tsx).
  - Normalized Ollama instance URLs and gated the metrics effect on real state changes to avoid mis-counts and duplicate
  fetches (RAGSettings.tsx).
  - Tightened crawl progress scaling and indented-block parsing to handle min_length=None safely (python/src/server/
  services/crawling/code_extraction_service.py:160, python/src/server/services/crawling/code_extraction_service.py:911).
  - Added provider-agnostic embedding rate-limit retries so Google and friends back off gracefully (python/src/server/
  services/embeddings/embedding_service.py:427).
  - Made the orchestration registry async + thread-safe and updated every caller to await it (python/src/server/services/
  crawling/crawling_service.py:34, python/src/server/api_routes/knowledge_api.py:1291).

* Update RAGSettings.tsx - header for 'LLM Settings' is now 'LLM Provider Settings'

* (RAG Settings)

  - Ollama Health Checks & Metrics
      - Added a 10-second timeout to the health fetch so it doesn't hang.
      - Adjusted logic so metric refreshes run for embedding-only Ollama setups too.
      - Initial page load now checks Ollama if either chat or embedding provider uses it.
      - Metrics and alerts now respect which provider (chat/embedding) is currently selected.
  - Provider Sync & Alerts
      - Fixed a sync bug so the very first provider change updates settings as expected.
      - Alerts now track the active provider (chat vs embedding) rather than only the LLM provider.
      - Warnings about missing credentials now skip whichever provider is currently selected.
  - Modals & Types
      - Normalize URLs before handing them to selection modals to keep consistent data.
      - Strengthened helper function types (getDisplayedChatModel, getModelPlaceholder, etc.).

 (Crawling Service)

  - Made the orchestration registry lock lazy-initialized to avoid issues in Python 3.12 and wrapped registry commands
  (register, unregister) in async calls. This keeps things thread-safe even during concurrent crawling and cancellation.

* - migration/complete_setup.sql:101 seeds Google/OpenRouter/Anthropic/Grok API key rows so fresh databases expose every
  provider by default.
  - migration/0.1.0/009_add_provider_placeholders.sql:1 backfills the same rows for existing Supabase instances and
  records the migration.
  - archon-ui-main/src/components/settings/RAGSettings.tsx:121 introduces a shared credentialprovider map,
  reloadApiCredentials runs through all five providers, and the status poller includes the new keys.
  - archon-ui-main/src/components/settings/RAGSettings.tsx:353 subscribes to the archon:credentials-updated browser event
  so adding/removing a key immediately refetches credential status and pings the corresponding connectivity test.
  - archon-ui-main/src/components/settings/RAGSettings.tsx:926 now treats missing Anthropic/OpenRouter/Grok keys as
  missing, preventing stale connected badges when a key is removed.

* - archon-ui-main/src/components/settings/RAGSettings.tsx:90 adds a simple display-name map and reuses one red alert
  style.
  - archon-ui-main/src/components/settings/RAGSettings.tsx:1016 now shows exactly one red banner when the active provider
  - Removed the old duplicate Missing API Key Configuration block, so the panel no longer stacks two warnings.

* Update credentialsService.ts default model

* updated the google embedding adapter for multi dimensional rag querying

* thought this micro fix in the google embedding pushed with the embedding update the other day, it didnt. pushing now

---------

Co-authored-by: Chillbruhhh <joshchesser97@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
2025-10-05 13:49:09 -05:00

180 lines
7.4 KiB
Python

"""
Test that code extraction uses the correct source_id.
This test ensures that the fix for using hash-based source_ids
instead of domain-based source_ids works correctly.
"""
import pytest
from unittest.mock import Mock, AsyncMock, patch, MagicMock
from src.server.services.crawling.code_extraction_service import CodeExtractionService
from src.server.services.crawling.document_storage_operations import DocumentStorageOperations
class TestCodeExtractionSourceId:
"""Test that code extraction properly uses the provided source_id."""
@pytest.mark.asyncio
async def test_code_extraction_uses_provided_source_id(self):
"""Test that code extraction uses the hash-based source_id, not domain."""
# Create mock supabase client
mock_supabase = Mock()
mock_supabase.table.return_value.select.return_value.eq.return_value.execute.return_value.data = []
# Create service instance
code_service = CodeExtractionService(mock_supabase)
# Track what gets passed to the internal extraction method
extracted_blocks = []
async def mock_extract_blocks(crawl_results, source_id, progress_callback=None, start=0, end=100, cancellation_check=None):
# Simulate finding code blocks and verify source_id is passed correctly
for doc in crawl_results:
extracted_blocks.append({
"block": {"code": "print('hello')", "language": "python"},
"source_url": doc["url"],
"source_id": source_id # This should be the provided source_id
})
return extracted_blocks
code_service._extract_code_blocks_from_documents = mock_extract_blocks
code_service._generate_code_summaries = AsyncMock(return_value=[{"summary": "Test code"}])
code_service._prepare_code_examples_for_storage = Mock(return_value=[
{"source_id": extracted_blocks[0]["source_id"] if extracted_blocks else None}
])
code_service._store_code_examples = AsyncMock(return_value=1)
# Test data
crawl_results = [
{
"url": "https://docs.mem0.ai/example",
"markdown": "```python\nprint('hello')\n```"
}
]
url_to_full_document = {
"https://docs.mem0.ai/example": "Full content with code"
}
# The correct hash-based source_id
correct_source_id = "393224e227ba92eb"
# Call the method with the correct source_id
result = await code_service.extract_and_store_code_examples(
crawl_results,
url_to_full_document,
correct_source_id,
None
)
# Verify that extracted blocks use the correct source_id
assert len(extracted_blocks) > 0, "Should have extracted at least one code block"
for block in extracted_blocks:
# Check that it's using the hash-based source_id, not the domain
assert block["source_id"] == correct_source_id, \
f"Should use hash-based source_id '{correct_source_id}', not domain"
assert block["source_id"] != "docs.mem0.ai", \
"Should NOT use domain-based source_id"
@pytest.mark.asyncio
async def test_document_storage_passes_source_id(self):
"""Test that DocumentStorageOperations passes source_id to code extraction."""
# Create mock supabase client
mock_supabase = Mock()
# Create DocumentStorageOperations instance
doc_storage = DocumentStorageOperations(mock_supabase)
# Mock the code extraction service
mock_extract = AsyncMock(return_value=5)
doc_storage.code_extraction_service.extract_and_store_code_examples = mock_extract
# Test data
crawl_results = [{"url": "https://example.com", "markdown": "test"}]
url_to_full_document = {"https://example.com": "test content"}
source_id = "abc123def456"
# Call the wrapper method
result = await doc_storage.extract_and_store_code_examples(
crawl_results,
url_to_full_document,
source_id,
None
)
# Verify the correct source_id was passed (now with cancellation_check parameter)
mock_extract.assert_called_once()
args, kwargs = mock_extract.call_args
assert args[0] == crawl_results
assert args[1] == url_to_full_document
assert args[2] == source_id
assert args[3] is None
assert args[4] is None
assert args[5] is None
assert args[6] is None
assert result == 5
@pytest.mark.asyncio
async def test_no_domain_extraction_from_url(self):
"""Test that we're NOT extracting domain from URL anymore."""
mock_supabase = Mock()
mock_supabase.table.return_value.select.return_value.eq.return_value.execute.return_value.data = []
code_service = CodeExtractionService(mock_supabase)
# Patch internal methods
code_service._get_setting = AsyncMock(return_value=True)
# Create a mock that will track what source_id is used
source_ids_seen = []
original_extract = code_service._extract_code_blocks_from_documents
async def track_source_id(crawl_results, source_id, progress_callback=None, cancellation_check=None):
source_ids_seen.append(source_id)
return [] # Return empty list to skip further processing
code_service._extract_code_blocks_from_documents = track_source_id
# Test with various URLs that would produce different domains
test_cases = [
("https://github.com/example/repo", "github123abc"),
("https://docs.python.org/guide", "python456def"),
("https://api.openai.com/v1", "openai789ghi"),
]
for url, expected_source_id in test_cases:
source_ids_seen.clear()
crawl_results = [{"url": url, "markdown": "# Test"}]
url_to_full_document = {url: "Full content"}
await code_service.extract_and_store_code_examples(
crawl_results,
url_to_full_document,
expected_source_id,
None
)
# Verify the provided source_id was used
assert len(source_ids_seen) == 1
assert source_ids_seen[0] == expected_source_id
# Verify it's NOT the domain
assert "github.com" not in source_ids_seen[0]
assert "python.org" not in source_ids_seen[0]
assert "openai.com" not in source_ids_seen[0]
def test_urlparse_not_imported(self):
"""Test that urlparse is not imported in code_extraction_service."""
import src.server.services.crawling.code_extraction_service as module
# Check that urlparse is not in the module's namespace
assert not hasattr(module, 'urlparse'), \
"urlparse should not be imported in code_extraction_service"
# Check the module's actual imports
import inspect
source = inspect.getsource(module)
assert "from urllib.parse import urlparse" not in source, \
"Should not import urlparse since we don't extract domain from URL anymore"