sauce aow

2025-12-24 02:39:17 -05:00 · 2025-10-08 21:39:04 +03:00
parent 3f0815b686
commit 9a60d6ae89
84 changed files with 17939 additions and 2 deletions
--- a/PRPs/PRD.md
+++ b/PRPs/PRD.md
--- a/PRPs/ai_docs/cc_cli_ref.md
+++ b/PRPs/ai_docs/cc_cli_ref.md
@@ -0,0 +1,89 @@
+# CLI reference
+
+> Complete reference for Claude Code command-line interface, including commands and flags.
+
+## CLI commands
+
+| Command                            | Description                                    | Example                                                            |
+| :--------------------------------- | :--------------------------------------------- | :----------------------------------------------------------------- |
+| `claude`                           | Start interactive REPL                         | `claude`                                                           |
+| `claude "query"`                   | Start REPL with initial prompt                 | `claude "explain this project"`                                    |
+| `claude -p "query"`                | Query via SDK, then exit                       | `claude -p "explain this function"`                                |
+| `cat file \| claude -p "query"`    | Process piped content                          | `cat logs.txt \| claude -p "explain"`                              |
+| `claude -c`                        | Continue most recent conversation              | `claude -c`                                                        |
+| `claude -c -p "query"`             | Continue via SDK                               | `claude -c -p "Check for type errors"`                             |
+| `claude -r "<session-id>" "query"` | Resume session by ID                           | `claude -r "abc123" "Finish this PR"`                              |
+| `claude update`                    | Update to latest version                       | `claude update`                                                    |
+| `claude mcp`                       | Configure Model Context Protocol (MCP) servers | See the [Claude Code MCP documentation](/en/docs/claude-code/mcp). |
+
+## CLI flags
+
+Customize Claude Code's behavior with these command-line flags:
+
+| Flag                             | Description                                                                                                                                              | Example                                                                                            |
+| :------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------- |
+| `--add-dir`                      | Add additional working directories for Claude to access (validates each path exists as a directory)                                                      | `claude --add-dir ../apps ../lib`                                                                  |
+| `--agents`                       | Define custom [subagents](/en/docs/claude-code/sub-agents) dynamically via JSON (see below for format)                                                   | `claude --agents '{"reviewer":{"description":"Reviews code","prompt":"You are a code reviewer"}}'` |
+| `--allowedTools`                 | A list of tools that should be allowed without prompting the user for permission, in addition to [settings.json files](/en/docs/claude-code/settings)    | `"Bash(git log:*)" "Bash(git diff:*)" "Read"`                                                      |
+| `--disallowedTools`              | A list of tools that should be disallowed without prompting the user for permission, in addition to [settings.json files](/en/docs/claude-code/settings) | `"Bash(git log:*)" "Bash(git diff:*)" "Edit"`                                                      |
+| `--print`, `-p`                  | Print response without interactive mode (see [SDK documentation](/en/docs/claude-code/sdk) for programmatic usage details)                               | `claude -p "query"`                                                                                |
+| `--append-system-prompt`         | Append to system prompt (only with `--print`)                                                                                                            | `claude --append-system-prompt "Custom instruction"`                                               |
+| `--output-format`                | Specify output format for print mode (options: `text`, `json`, `stream-json`)                                                                            | `claude -p "query" --output-format json`                                                           |
+| `--input-format`                 | Specify input format for print mode (options: `text`, `stream-json`)                                                                                     | `claude -p --output-format json --input-format stream-json`                                        |
+| `--include-partial-messages`     | Include partial streaming events in output (requires `--print` and `--output-format=stream-json`)                                                        | `claude -p --output-format stream-json --include-partial-messages "query"`                         |
+| `--verbose`                      | Enable verbose logging, shows full turn-by-turn output (helpful for debugging in both print and interactive modes)                                       | `claude --verbose`                                                                                 |
+| `--max-turns`                    | Limit the number of agentic turns in non-interactive mode                                                                                                | `claude -p --max-turns 3 "query"`                                                                  |
+| `--model`                        | Sets the model for the current session with an alias for the latest model (`sonnet` or `opus`) or a model's full name                                    | `claude --model claude-sonnet-4-5-20250929`                                                        |
+| `--permission-mode`              | Begin in a specified [permission mode](iam#permission-modes)                                                                                             | `claude --permission-mode plan`                                                                    |
+| `--permission-prompt-tool`       | Specify an MCP tool to handle permission prompts in non-interactive mode                                                                                 | `claude -p --permission-prompt-tool mcp_auth_tool "query"`                                         |
+| `--resume`                       | Resume a specific session by ID, or by choosing in interactive mode                                                                                      | `claude --resume abc123 "query"`                                                                   |
+| `--continue`                     | Load the most recent conversation in the current directory                                                                                               | `claude --continue`                                                                                |
+| `--dangerously-skip-permissions` | Skip permission prompts (use with caution)                                                                                                               | `claude --dangerously-skip-permissions`                                                            |
+
+<Tip>
+  The `--output-format json` flag is particularly useful for scripting and
+  automation, allowing you to parse Claude's responses programmatically.
+</Tip>
+
+### Agents flag format
+
+The `--agents` flag accepts a JSON object that defines one or more custom subagents. Each subagent requires a unique name (as the key) and a definition object with the following fields:
+
+| Field         | Required | Description                                                                                                     |
+| :------------ | :------- | :-------------------------------------------------------------------------------------------------------------- |
+| `description` | Yes      | Natural language description of when the subagent should be invoked                                             |
+| `prompt`      | Yes      | The system prompt that guides the subagent's behavior                                                           |
+| `tools`       | No       | Array of specific tools the subagent can use (e.g., `["Read", "Edit", "Bash"]`). If omitted, inherits all tools |
+| `model`       | No       | Model alias to use: `sonnet`, `opus`, or `haiku`. If omitted, uses the default subagent model                   |
+
+Example:
+
+```bash theme={null}
+claude --agents '{
+  "code-reviewer": {
+    "description": "Expert code reviewer. Use proactively after code changes.",
+    "prompt": "You are a senior code reviewer. Focus on code quality, security, and best practices.",
+    "tools": ["Read", "Grep", "Glob", "Bash"],
+    "model": "sonnet"
+  },
+  "debugger": {
+    "description": "Debugging specialist for errors and test failures.",
+    "prompt": "You are an expert debugger. Analyze errors, identify root causes, and provide fixes."
+  }
+}'
+```
+
+For more details on creating and using subagents, see the [subagents documentation](/en/docs/claude-code/sub-agents).
+
+For detailed information about print mode (`-p`) including output formats,
+streaming, verbose logging, and programmatic usage, see the
+[SDK documentation](/en/docs/claude-code/sdk).
+
+## See also
+
+- [Interactive mode](/en/docs/claude-code/interactive-mode) - Shortcuts, input modes, and interactive features
+- [Slash commands](/en/docs/claude-code/slash-commands) - Interactive session commands
+- [Quickstart guide](/en/docs/claude-code/quickstart) - Getting started with Claude Code
+- [Common workflows](/en/docs/claude-code/common-workflows) - Advanced workflows and patterns
+- [Settings](/en/docs/claude-code/settings) - Configuration options
+- [SDK documentation](/en/docs/claude-code/sdk) - Programmatic usage and integrations
--- a/PRPs/prd-types.md
+++ b/PRPs/prd-types.md
@@ -0,0 +1,660 @@
+# Data Models for Agent Work Order System
+
+**Purpose:** This document defines all data models needed for the agent work order feature in plain English.
+
+**Philosophy:** Git-first architecture - store minimal state in database, compute everything else from git.
+
+---
+
+## Table of Contents
+
+1. [Core Work Order Models](#core-work-order-models)
+2. [Workflow & Phase Tracking](#workflow--phase-tracking)
+3. [Sandbox Models](#sandbox-models)
+4. [GitHub Integration](#github-integration)
+5. [Agent Execution](#agent-execution)
+6. [Logging & Observability](#logging--observability)
+
+---
+
+## Core Work Order Models
+
+### AgentWorkOrderStateMinimal
+
+**What it is:** The absolute minimum state we persist in database/Supabase.
+
+**Purpose:** Following git-first philosophy - only store identifiers, query everything else from git.
+
+**Where stored:**
+- Phase 1: In-memory Python dictionary
+- Phase 2+: Supabase database
+
+**Fields:**
+
+| Field Name | Type | Required | Description | Example |
+|------------|------|----------|-------------|---------|
+| `agent_work_order_id` | string | Yes | Unique identifier for this work order | `"wo-a1b2c3d4"` |
+| `repository_url` | string | Yes | GitHub repository URL | `"https://github.com/user/repo.git"` |
+| `sandbox_identifier` | string | Yes | Execution environment identifier | `"git-worktree-wo-a1b2c3d4"` or `"e2b-sb-xyz789"` |
+| `git_branch_name` | string | No | Git branch created for this work order | `"feat-issue-42-wo-a1b2c3d4"` |
+| `agent_session_id` | string | No | Claude Code session ID (for resumption) | `"session-xyz789"` |
+
+**Why `sandbox_identifier` is separate from `git_branch_name`:**
+- `git_branch_name` = Git concept (what branch the code is on)
+- `sandbox_identifier` = Execution environment ID (where the agent runs)
+- Git worktree: `sandbox_identifier = "/Users/user/.worktrees/wo-abc123"` (path to worktree)
+- E2B: `sandbox_identifier = "e2b-sb-xyz789"` (E2B's sandbox ID)
+- Dagger: `sandbox_identifier = "dagger-container-abc123"` (container ID)
+
+**What we DON'T store:** Current phase, commit count, files changed, PR URL, test results, sandbox state (is_active) - all computed from git or sandbox APIs.
+
+---
+
+### AgentWorkOrder (Full Model)
+
+**What it is:** Complete work order model combining database state + computed fields from git/GitHub.
+
+**Purpose:** Used for API responses and UI display.
+
+**Fields:**
+
+**Core Identifiers (from database):**
+- `agent_work_order_id` - Unique ID
+- `repository_url` - GitHub repo URL
+- `sandbox_identifier` - Execution environment ID (e.g., worktree path, E2B sandbox ID)
+- `git_branch_name` - Branch name (null until created)
+- `agent_session_id` - Claude session ID (null until started)
+
+**Metadata (from database):**
+- `workflow_type` - Which workflow to run (plan/implement/validate/plan_implement/plan_implement_validate)
+- `sandbox_type` - Execution environment (git_branch/git_worktree/e2b/dagger)
+- `agent_model_type` - Claude model (sonnet/opus/haiku)
+- `status` - Current status (pending/initializing/running/completed/failed/cancelled)
+- `github_issue_number` - Optional issue number
+- `created_at` - When work order was created
+- `updated_at` - Last update timestamp
+- `execution_started_at` - When execution began
+- `execution_completed_at` - When execution finished
+- `error_message` - Error if failed
+- `error_details` - Detailed error info
+- `created_by_user_id` - User who created it (Phase 2+)
+
+**Computed Fields (from git/GitHub - NOT in database):**
+- `current_phase` - Current workflow phase (planning/implementing/validating/completed) - **computed by inspecting git commits**
+- `github_pull_request_url` - PR URL - **computed from GitHub API**
+- `github_pull_request_number` - PR number
+- `git_commit_count` - Number of commits - **computed from `git log --oneline | wc -l`**
+- `git_files_changed` - Files changed - **computed from `git diff --stat`**
+- `git_lines_added` - Lines added - **computed from `git diff --stat`**
+- `git_lines_removed` - Lines removed - **computed from `git diff --stat`**
+- `latest_git_commit_sha` - Latest commit SHA
+- `latest_git_commit_message` - Latest commit message
+
+---
+
+### CreateAgentWorkOrderRequest
+
+**What it is:** Request payload to create a new work order.
+
+**Purpose:** Sent from frontend to backend to initiate work order.
+
+**Fields:**
+- `repository_url` - GitHub repo URL to work on
+- `sandbox_type` - Which sandbox to use (git_branch/git_worktree/e2b/dagger)
+- `workflow_type` - Which workflow to execute
+- `agent_model_type` - Which Claude model to use (default: sonnet)
+- `github_issue_number` - Optional issue to work on
+- `initial_prompt` - Optional initial prompt to send to agent
+
+---
+
+### AgentWorkOrderResponse
+
+**What it is:** Response after creating or fetching a work order.
+
+**Purpose:** Returned by API endpoints.
+
+**Fields:**
+- `agent_work_order` - Full AgentWorkOrder object
+- `logs_url` - URL to fetch execution logs
+
+---
+
+### ListAgentWorkOrdersRequest
+
+**What it is:** Request to list work orders with filters.
+
+**Purpose:** Support filtering and pagination in UI.
+
+**Fields:**
+- `status_filter` - Filter by status (array)
+- `sandbox_type_filter` - Filter by sandbox type (array)
+- `workflow_type_filter` - Filter by workflow type (array)
+- `limit` - Results per page (default 50, max 100)
+- `offset` - Pagination offset
+- `sort_by` - Field to sort by (default: created_at)
+- `sort_order` - asc or desc (default: desc)
+
+---
+
+### ListAgentWorkOrdersResponse
+
+**What it is:** Response containing list of work orders.
+
+**Fields:**
+- `agent_work_orders` - Array of AgentWorkOrder objects
+- `total_count` - Total matching work orders
+- `has_more` - Whether more results available
+- `offset` - Current offset
+- `limit` - Current limit
+
+---
+
+## Workflow & Phase Tracking
+
+### WorkflowPhaseHistoryEntry
+
+**What it is:** Single phase execution record in workflow history.
+
+**Purpose:** Track timing and commits for each workflow phase.
+
+**How created:** Computed by analyzing git commits, not stored directly.
+
+**Fields:**
+- `phase_name` - Which phase (planning/implementing/validating/completed)
+- `phase_started_at` - When phase began
+- `phase_completed_at` - When phase finished (null if still running)
+- `phase_duration_seconds` - Duration (if completed)
+- `git_commits_in_phase` - Number of commits during this phase
+- `git_commit_shas` - Array of commit SHAs from this phase
+
+**Example:** "Planning phase started at 10:00:00, completed at 10:02:30, duration 150 seconds, 1 commit (abc123)"
+
+---
+
+### GitProgressSnapshot
+
+**What it is:** Point-in-time snapshot of work order progress via git inspection.
+
+**Purpose:** Polled by frontend every 3 seconds to show progress without streaming.
+
+**How created:** Backend queries git to compute current state.
+
+**Fields:**
+- `agent_work_order_id` - Work order ID
+- `current_phase` - Current workflow phase (computed from commits)
+- `git_commit_count` - Total commits on branch
+- `git_files_changed` - Total files changed
+- `git_lines_added` - Total lines added
+- `git_lines_removed` - Total lines removed
+- `latest_commit_message` - Most recent commit message
+- `latest_commit_sha` - Most recent commit SHA
+- `latest_commit_timestamp` - When latest commit was made
+- `snapshot_timestamp` - When this snapshot was taken
+- `phase_history` - Array of WorkflowPhaseHistoryEntry objects
+
+**Example UI usage:** Frontend polls `/api/agent-work-orders/{id}/git-progress` every 3 seconds to update progress bar.
+
+---
+
+## Sandbox Models
+
+### SandboxConfiguration
+
+**What it is:** Configuration for creating a sandbox instance.
+
+**Purpose:** Passed to sandbox factory to create appropriate sandbox type.
+
+**Fields:**
+
+**Common (all sandbox types):**
+- `sandbox_type` - Type of sandbox (git_branch/git_worktree/e2b/dagger)
+- `sandbox_identifier` - Unique ID (usually work order ID)
+- `repository_url` - Repo to clone
+- `git_branch_name` - Branch to create/use
+- `environment_variables` - Env vars to set in sandbox (dict)
+
+**E2B specific (Phase 2+):**
+- `e2b_template_id` - E2B template ID
+- `e2b_timeout_seconds` - Sandbox timeout
+
+**Dagger specific (Phase 2+):**
+- `dagger_image_name` - Docker image name
+- `dagger_container_config` - Additional Dagger config (dict)
+
+---
+
+### SandboxState
+
+**What it is:** Current state of an active sandbox.
+
+**Purpose:** Query sandbox status, returned by `sandbox.get_current_state()`.
+
+**Fields:**
+- `sandbox_identifier` - Sandbox ID
+- `sandbox_type` - Type of sandbox
+- `is_active` - Whether sandbox is currently active
+- `git_branch_name` - Current git branch
+- `working_directory` - Current working directory in sandbox
+- `sandbox_created_at` - When sandbox was created
+- `last_activity_at` - Last activity timestamp
+- `sandbox_metadata` - Sandbox-specific state (dict) - e.g., E2B sandbox ID, Docker container ID
+
+---
+
+### CommandExecutionResult
+
+**What it is:** Result of executing a command in a sandbox.
+
+**Purpose:** Returned by `sandbox.execute_command(command)`.
+
+**Fields:**
+- `command` - Command that was executed
+- `exit_code` - Command exit code (0 = success)
+- `stdout_output` - Standard output
+- `stderr_output` - Standard error output
+- `execution_success` - Whether command succeeded (exit_code == 0)
+- `execution_duration_seconds` - How long command took
+- `execution_timestamp` - When command was executed
+
+---
+
+## GitHub Integration
+
+### GitHubRepository
+
+**What it is:** GitHub repository information and access status.
+
+**Purpose:** Store repository metadata after verification.
+
+**Fields:**
+- `repository_owner` - Owner username (e.g., "user")
+- `repository_name` - Repo name (e.g., "repo")
+- `repository_url` - Full URL (e.g., "https://github.com/user/repo.git")
+- `repository_clone_url` - Git clone URL
+- `default_branch` - Default branch name (usually "main")
+- `is_accessible` - Whether we verified access
+- `is_private` - Whether repo is private
+- `access_verified_at` - When access was last verified
+- `repository_description` - Repo description
+
+---
+
+### GitHubRepositoryVerificationRequest
+
+**What it is:** Request to verify repository access.
+
+**Purpose:** Frontend asks backend to verify it can access a repo.
+
+**Fields:**
+- `repository_url` - Repo URL to verify
+
+---
+
+### GitHubRepositoryVerificationResponse
+
+**What it is:** Response from repository verification.
+
+**Purpose:** Tell frontend whether repo is accessible.
+
+**Fields:**
+- `repository` - GitHubRepository object with details
+- `verification_success` - Whether verification succeeded
+- `error_message` - Error message if failed
+- `error_details` - Detailed error info (dict)
+
+---
+
+### GitHubPullRequest
+
+**What it is:** Pull request model.
+
+**Purpose:** Represent a created PR.
+
+**Fields:**
+- `pull_request_number` - PR number
+- `pull_request_title` - PR title
+- `pull_request_body` - PR description
+- `pull_request_url` - PR URL
+- `pull_request_state` - State (open/closed/merged)
+- `pull_request_head_branch` - Source branch
+- `pull_request_base_branch` - Target branch
+- `pull_request_author` - GitHub user who created PR
+- `pull_request_created_at` - When created
+- `pull_request_updated_at` - When last updated
+- `pull_request_merged_at` - When merged (if applicable)
+- `pull_request_is_draft` - Whether it's a draft PR
+
+---
+
+### CreateGitHubPullRequestRequest
+
+**What it is:** Request to create a pull request.
+
+**Purpose:** Backend creates PR after work order completes.
+
+**Fields:**
+- `repository_owner` - Repo owner
+- `repository_name` - Repo name
+- `pull_request_title` - PR title
+- `pull_request_body` - PR description
+- `pull_request_head_branch` - Source branch (work order branch)
+- `pull_request_base_branch` - Target branch (usually "main")
+- `pull_request_is_draft` - Create as draft (default: false)
+
+---
+
+### GitHubIssue
+
+**What it is:** GitHub issue model.
+
+**Purpose:** Link work orders to GitHub issues.
+
+**Fields:**
+- `issue_number` - Issue number
+- `issue_title` - Issue title
+- `issue_body` - Issue description
+- `issue_state` - State (open/closed)
+- `issue_author` - User who created issue
+- `issue_assignees` - Assigned users (array)
+- `issue_labels` - Labels (array)
+- `issue_created_at` - When created
+- `issue_updated_at` - When last updated
+- `issue_closed_at` - When closed
+- `issue_url` - Issue URL
+
+---
+
+## Agent Execution
+
+### AgentCommandDefinition
+
+**What it is:** Represents a Claude Code slash command loaded from `.claude/commands/*.md`.
+
+**Purpose:** Catalog available commands for workflows.
+
+**Fields:**
+- `command_name` - Command name (e.g., "/agent_workflow_plan")
+- `command_file_path` - Path to .md file
+- `command_description` - Description from file
+- `command_arguments` - Expected arguments (array)
+- `command_content` - Full file content
+
+**How loaded:** Scan `.claude/commands/` directory at startup, parse markdown files.
+
+---
+
+### AgentCommandBuildRequest
+
+**What it is:** Request to build a Claude Code CLI command string.
+
+**Purpose:** Convert high-level command to actual CLI string.
+
+**Fields:**
+- `command_name` - Command to execute (e.g., "/plan")
+- `command_arguments` - Arguments (array)
+- `agent_model_type` - Claude model (sonnet/opus/haiku)
+- `output_format` - CLI output format (text/json/stream-json)
+- `dangerously_skip_permissions` - Skip permission prompts (required for automation)
+- `working_directory` - Directory to run in
+- `timeout_seconds` - Command timeout (default 300, max 3600)
+
+---
+
+### AgentCommandBuildResult
+
+**What it is:** Built CLI command ready to execute.
+
+**Purpose:** Actual command string to run via subprocess.
+
+**Fields:**
+- `cli_command_string` - Complete CLI command (e.g., `"claude -p '/plan Issue #42' --model sonnet --output-format stream-json"`)
+- `working_directory` - Directory to run in
+- `timeout_seconds` - Timeout value
+
+---
+
+### AgentCommandExecutionRequest
+
+**What it is:** High-level request to execute an agent command.
+
+**Purpose:** Frontend or orchestrator requests command execution.
+
+**Fields:**
+- `agent_work_order_id` - Work order this is for
+- `command_name` - Command to execute
+- `command_arguments` - Arguments (array)
+- `agent_model_type` - Model to use
+- `working_directory` - Execution directory
+
+---
+
+### AgentCommandExecutionResult
+
+**What it is:** Result of executing a Claude Code command.
+
+**Purpose:** Capture stdout/stderr, parse session ID, track timing.
+
+**Fields:**
+
+**Execution metadata:**
+- `command_name` - Command executed
+- `command_arguments` - Arguments used
+- `execution_success` - Whether succeeded
+- `exit_code` - Exit code
+- `execution_duration_seconds` - How long it took
+- `execution_started_at` - Start time
+- `execution_completed_at` - End time
+- `agent_work_order_id` - Work order ID
+
+**Output:**
+- `stdout_output` - Standard output (may be JSONL)
+- `stderr_output` - Standard error
+- `agent_session_id` - Claude session ID (parsed from output)
+
+**Parsed results (from JSONL output):**
+- `parsed_result_text` - Result text extracted from JSONL
+- `parsed_result_is_error` - Whether result indicates error
+- `parsed_result_total_cost_usd` - Total cost
+- `parsed_result_duration_ms` - Duration from result message
+
+**Example JSONL parsing:** Last line of stdout contains result message with session_id, cost, duration.
+
+---
+
+### SendAgentPromptRequest
+
+**What it is:** Request to send interactive prompt to running agent.
+
+**Purpose:** Allow user to interact with agent mid-execution.
+
+**Fields:**
+- `agent_work_order_id` - Active work order
+- `prompt_text` - Prompt to send (e.g., "Now implement the auth module")
+- `continue_session` - Continue existing session vs start new (default: true)
+
+---
+
+### SendAgentPromptResponse
+
+**What it is:** Response after sending prompt.
+
+**Purpose:** Confirm prompt was accepted.
+
+**Fields:**
+- `agent_work_order_id` - Work order ID
+- `prompt_accepted` - Whether prompt was accepted and queued
+- `execution_started` - Whether execution has started
+- `message` - Status message
+- `error_message` - Error if rejected
+
+---
+
+## Logging & Observability
+
+### AgentExecutionLogEntry
+
+**What it is:** Single structured log entry from work order execution.
+
+**Purpose:** Observability - track everything that happens during execution.
+
+**Fields:**
+- `log_entry_id` - Unique log ID
+- `agent_work_order_id` - Work order this belongs to
+- `log_timestamp` - When log was created
+- `log_level` - Level (debug/info/warning/error/critical)
+- `event_name` - Structured event name (e.g., "agent_command_started", "git_commit_created")
+- `log_message` - Human-readable message
+- `log_context` - Additional context data (dict)
+
+**Storage:**
+- Phase 1: Console output (pretty-print in dev)
+- Phase 2+: JSONL files + Supabase table
+
+**Example log events:**
+```
+event_name: "agent_work_order_created"
+event_name: "git_branch_created"
+event_name: "agent_command_started"
+event_name: "agent_command_completed"
+event_name: "workflow_phase_started"
+event_name: "workflow_phase_completed"
+event_name: "git_commit_created"
+event_name: "github_pull_request_created"
+```
+
+---
+
+### ListAgentExecutionLogsRequest
+
+**What it is:** Request to fetch execution logs.
+
+**Purpose:** UI can display logs for debugging.
+
+**Fields:**
+- `agent_work_order_id` - Work order to get logs for
+- `log_level_filter` - Filter by levels (array)
+- `event_name_filter` - Filter by event names (array)
+- `limit` - Results per page (default 100, max 1000)
+- `offset` - Pagination offset
+
+---
+
+### ListAgentExecutionLogsResponse
+
+**What it is:** Response containing log entries.
+
+**Fields:**
+- `agent_work_order_id` - Work order ID
+- `log_entries` - Array of AgentExecutionLogEntry objects
+- `total_count` - Total log entries
+- `has_more` - Whether more available
+
+---
+
+## Enums (Type Definitions)
+
+### AgentWorkOrderStatus
+
+**What it is:** Possible work order statuses.
+
+**Values:**
+- `pending` - Created, waiting to start
+- `initializing` - Setting up sandbox
+- `running` - Currently executing
+- `completed` - Finished successfully
+- `failed` - Execution failed
+- `cancelled` - User cancelled (Phase 2+)
+- `paused` - Paused by user (Phase 3+)
+
+---
+
+### AgentWorkflowType
+
+**What it is:** Supported workflow types.
+
+**Values:**
+- `agent_workflow_plan` - Planning only
+- `agent_workflow_implement` - Implementation only
+- `agent_workflow_validate` - Validation/testing only
+- `agent_workflow_plan_implement` - Plan + Implement
+- `agent_workflow_plan_implement_validate` - Full workflow
+- `agent_workflow_custom` - User-defined (Phase 3+)
+
+---
+
+### AgentWorkflowPhase
+
+**What it is:** Workflow execution phases (computed from git, not stored).
+
+**Values:**
+- `initializing` - Setting up environment
+- `planning` - Creating implementation plan
+- `implementing` - Writing code
+- `validating` - Running tests/validation
+- `completed` - All phases done
+
+**How detected:** By analyzing commit messages in git log.
+
+---
+
+### SandboxType
+
+**What it is:** Available sandbox environments.
+
+**Values:**
+- `git_branch` - Isolated git branch (Phase 1)
+- `git_worktree` - Git worktree (Phase 1) - better for parallel work orders
+- `e2b` - E2B cloud sandbox (Phase 2+) - primary cloud target
+- `dagger` - Dagger container (Phase 2+) - primary container target
+- `local_docker` - Local Docker (Phase 3+)
+
+---
+
+### AgentModelType
+
+**What it is:** Claude model options.
+
+**Values:**
+- `sonnet` - Claude 3.5 Sonnet (balanced, default)
+- `opus` - Claude 3 Opus (highest quality)
+- `haiku` - Claude 3.5 Haiku (fastest)
+
+---
+
+## Summary: What Gets Stored vs Computed
+
+### Stored in Database (Minimal State)
+
+**5 core fields:**
+1. `agent_work_order_id` - Unique ID
+2. `repository_url` - Repo URL
+3. `sandbox_identifier` - Execution environment ID (worktree path, E2B sandbox ID, etc.)
+4. `git_branch_name` - Branch name
+5. `agent_session_id` - Claude session
+
+**Metadata (for queries/filters):**
+- `workflow_type`, `sandbox_type`, `agent_model_type`
+- `status`, `github_issue_number`
+- `created_at`, `updated_at`, `execution_started_at`, `execution_completed_at`
+- `error_message`, `error_details`
+- `created_by_user_id` (Phase 2+)
+
+### Computed from Git/Sandbox APIs (NOT in database)
+
+**Everything else:**
+- `current_phase` → Analyze git commits
+- `git_commit_count` → `git log --oneline | wc -l`
+- `git_files_changed` → `git diff --stat`
+- `git_lines_added/removed` → `git diff --stat`
+- `latest_commit_sha/message` → `git log -1`
+- `phase_history` → Analyze commit timestamps and messages
+- `github_pull_request_url` → Query GitHub API
+- `sandbox_state` (is_active, etc.) → Query sandbox API or check filesystem
+- Test results → Read committed test_results.json file
+
+**This is the key insight:** Git is our database for work progress, sandbox APIs tell us execution state. We only store identifiers needed to find the right sandbox and git branch.
+
+---
+
+**End of Data Models Document**
--- a/PRPs/specs/add-user-request-field-to-work-orders.md
+++ b/PRPs/specs/add-user-request-field-to-work-orders.md
@@ -0,0 +1,643 @@
+# Feature: Add User Request Field to Agent Work Orders
+
+## Feature Description
+
+Add a required `user_request` field to the Agent Work Orders API to enable users to provide custom prompts describing the work they want done. This field will be the primary input to the classification and planning workflow, replacing the current dependency on GitHub issue numbers. The system will intelligently parse the user request to extract GitHub issue references if present, or use the request content directly for classification and planning.
+
+## User Story
+
+As a developer using the Agent Work Orders system
+I want to provide a natural language description of the work I need done
+So that the AI agents can understand my requirements and create an appropriate implementation plan without requiring a GitHub issue
+
+## Problem Statement
+
+Currently, the `CreateAgentWorkOrderRequest` API only accepts a `github_issue_number` parameter, with no way to provide a custom user request. This causes several critical issues:
+
+1. **Empty Context**: When a work order is created, the `issue_json` passed to the classifier is empty (`{}`), causing agents to lack context
+2. **GitHub Dependency**: Users must create a GitHub issue before using the system, adding unnecessary friction
+3. **Limited Flexibility**: Users cannot describe ad-hoc tasks or provide additional context beyond what's in a GitHub issue
+4. **Broken Classification**: The classifier receives empty input and makes arbitrary classifications without understanding the actual work needed
+5. **Failed Planning**: Planners cannot create meaningful plans without understanding what the user wants
+
+**Current Flow (Broken):**
+```
+API Request → {github_issue_number: "1"}
+         ↓
+Workflow: github_issue_json = None → defaults to "{}"
+         ↓
+Classifier receives: "{}" (empty)
+         ↓
+Planner receives: "/feature" but no context about what feature to build
+```
+
+## Solution Statement
+
+Add a required `user_request` field to `CreateAgentWorkOrderRequest` that accepts natural language descriptions of the work to be done. The workflow will:
+
+1. **Accept User Requests**: Users provide a clear description like "Add login authentication with JWT tokens" or "Fix the bug where users can't save their profile" or "Implement GitHub issue #42"
+2. **Classify Based on Content**: The classifier receives the full user request and classifies it as feature/bug/chore based on the actual content
+3. **Optionally Fetch GitHub Issues**: If the user mentions a GitHub issue (e.g., "implement issue #42"), the system fetches the issue details and merges them with the user request
+4. **Provide Full Context**: All workflow steps receive the complete user request and any fetched issue data, enabling meaningful planning and implementation
+
+**Intended Flow (Fixed):**
+```
+API Request → {user_request: "Add login feature with JWT authentication"}
+         ↓
+Classifier receives: "Add login feature with JWT authentication"
+         ↓
+Classifier returns: "/feature" (based on actual content)
+         ↓
+IF user request mentions "issue #N" or "GitHub issue N":
+  → Fetch issue details from GitHub
+  → Merge with user request
+ELSE:
+  → Use user request as-is
+         ↓
+Planner receives: Full context about what to build
+         ↓
+Planner creates: Detailed implementation plan based on user request
+```
+
+## Relevant Files
+
+Use these files to implement the feature:
+
+**Core Models** - Add user_request field
+- `python/src/agent_work_orders/models.py`:100-107 - `CreateAgentWorkOrderRequest` needs `user_request: str` field added
+
+**API Routes** - Pass user request to workflow
+- `python/src/agent_work_orders/api/routes.py`:54-124 - `create_agent_work_order()` needs to pass `user_request` to orchestrator
+
+**Workflow Orchestrator** - Accept and process user request
+- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`:48-56 - `execute_workflow()` signature needs `user_request` parameter
+- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`:96-103 - Classification step needs to receive `user_request` instead of empty JSON
+
+**GitHub Client** - Add method to fetch issue details
+- `python/src/agent_work_orders/github_integration/github_client.py` - Add `get_issue()` method to fetch issue by number
+
+**Workflow Operations** - Update classification to use user request
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:26-79 - `classify_issue()` may need parameter name updates for clarity
+
+**Tests** - Update and add test coverage
+- `python/tests/agent_work_orders/test_api.py` - Update all API tests to include `user_request` field
+- `python/tests/agent_work_orders/test_models.py` - Add tests for `user_request` field validation
+- `python/tests/agent_work_orders/test_github_integration.py` - Add tests for `get_issue()` method
+- `python/tests/agent_work_orders/test_workflow_operations.py` - Update mocks to use `user_request` content
+
+### New Files
+
+No new files needed - all changes are modifications to existing files.
+
+## Implementation Plan
+
+### Phase 1: Foundation - Model and API Updates
+
+Add the `user_request` field to the request model and update the API to accept it. This is backward-compatible if we keep `github_issue_number` optional.
+
+### Phase 2: Core Implementation - Workflow Integration
+
+Update the workflow orchestrator to receive and use the user request for classification and planning. Add logic to detect and fetch GitHub issues if mentioned.
+
+### Phase 3: Integration - GitHub Issue Fetching
+
+Add capability to fetch GitHub issue details when referenced in the user request, and merge that context with the user's description.
+
+## Step by Step Tasks
+
+IMPORTANT: Execute every step in order, top to bottom.
+
+### Add user_request Field to CreateAgentWorkOrderRequest Model
+
+- Open `python/src/agent_work_orders/models.py`
+- Locate the `CreateAgentWorkOrderRequest` class (line 100)
+- Add new required field after `workflow_type`:
+  ```python
+  user_request: str = Field(..., description="User's description of the work to be done")
+  ```
+- Update the docstring to explain that `user_request` is the primary input
+- Make `github_issue_number` truly optional (it already is, but update docs to clarify it's only needed for reference)
+- Save the file
+
+### Add get_issue() Method to GitHubClient
+
+- Open `python/src/agent_work_orders/github_integration/github_client.py`
+- Add new method after `get_repository_info()`:
+  ```python
+  async def get_issue(self, repository_url: str, issue_number: str) -> dict:
+      """Get GitHub issue details
+
+      Args:
+          repository_url: GitHub repository URL
+          issue_number: Issue number
+
+      Returns:
+          Issue details as JSON dict
+
+      Raises:
+          GitHubOperationError: If unable to fetch issue
+      """
+      self._logger.info("github_issue_fetch_started", repository_url=repository_url, issue_number=issue_number)
+
+      try:
+          owner, repo = self._parse_repository_url(repository_url)
+          repo_path = f"{owner}/{repo}"
+
+          process = await asyncio.create_subprocess_exec(
+              self.gh_cli_path,
+              "issue",
+              "view",
+              issue_number,
+              "--repo",
+              repo_path,
+              "--json",
+              "number,title,body,state,url",
+              stdout=asyncio.subprocess.PIPE,
+              stderr=asyncio.subprocess.PIPE,
+          )
+
+          stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=30)
+
+          if process.returncode != 0:
+              error = stderr.decode() if stderr else "Unknown error"
+              raise GitHubOperationError(f"Failed to fetch issue: {error}")
+
+          issue_data = json.loads(stdout.decode())
+          self._logger.info("github_issue_fetched", issue_number=issue_number)
+          return issue_data
+
+      except Exception as e:
+          self._logger.error("github_issue_fetch_failed", error=str(e), exc_info=True)
+          raise GitHubOperationError(f"Failed to fetch GitHub issue: {e}") from e
+  ```
+- Save the file
+
+### Update execute_workflow() Signature
+
+- Open `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
+- Locate the `execute_workflow()` method (line 48)
+- Add `user_request` parameter after `sandbox_type`:
+  ```python
+  async def execute_workflow(
+      self,
+      agent_work_order_id: str,
+      workflow_type: AgentWorkflowType,
+      repository_url: str,
+      sandbox_type: SandboxType,
+      user_request: str,  # NEW: Add this parameter
+      github_issue_number: str | None = None,
+      github_issue_json: str | None = None,
+  ) -> None:
+  ```
+- Update the docstring to include `user_request` parameter documentation
+- Save the file
+
+### Add Logic to Parse GitHub Issue References from User Request
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
+- After line 87 (after updating status to RUNNING), add logic to detect GitHub issues:
+  ```python
+  # Parse GitHub issue from user request if mentioned
+  import re
+  issue_match = re.search(r'(?:issue|#)\s*#?(\d+)', user_request, re.IGNORECASE)
+  if issue_match and not github_issue_number:
+      github_issue_number = issue_match.group(1)
+      bound_logger.info("github_issue_detected_in_request", issue_number=github_issue_number)
+
+  # Fetch GitHub issue if number provided
+  if github_issue_number and not github_issue_json:
+      try:
+          issue_data = await self.github_client.get_issue(repository_url, github_issue_number)
+          github_issue_json = json.dumps(issue_data)
+          bound_logger.info("github_issue_fetched", issue_number=github_issue_number)
+      except Exception as e:
+          bound_logger.warning("github_issue_fetch_failed", error=str(e))
+          # Continue without issue data - use user_request only
+
+  # Prepare classification input: merge user request with issue data if available
+  classification_input = user_request
+  if github_issue_json:
+      issue_data = json.loads(github_issue_json)
+      classification_input = f"User Request: {user_request}\n\nGitHub Issue Details:\nTitle: {issue_data.get('title', '')}\nBody: {issue_data.get('body', '')}"
+  ```
+- Add `import json` at the top of the file if not already present
+- Update the classify_issue call (line 97-103) to use `classification_input`:
+  ```python
+  classify_result = await workflow_operations.classify_issue(
+      self.agent_executor,
+      self.command_loader,
+      classification_input,  # Use classification_input instead of github_issue_json or "{}"
+      agent_work_order_id,
+      sandbox.working_dir,
+  )
+  ```
+- Save the file
+
+### Update API Route to Pass user_request
+
+- Open `python/src/agent_work_orders/api/routes.py`
+- Locate `create_agent_work_order()` function (line 54)
+- Update the `orchestrator.execute_workflow()` call (line 101-109) to include `user_request`:
+  ```python
+  asyncio.create_task(
+      orchestrator.execute_workflow(
+          agent_work_order_id=agent_work_order_id,
+          workflow_type=request.workflow_type,
+          repository_url=request.repository_url,
+          sandbox_type=request.sandbox_type,
+          user_request=request.user_request,  # NEW: Add this line
+          github_issue_number=request.github_issue_number,
+      )
+  )
+  ```
+- Save the file
+
+### Update Model Tests for user_request Field
+
+- Open `python/tests/agent_work_orders/test_models.py`
+- Find or add test for `CreateAgentWorkOrderRequest`:
+  ```python
+  def test_create_agent_work_order_request_with_user_request():
+      """Test CreateAgentWorkOrderRequest with user_request field"""
+      request = CreateAgentWorkOrderRequest(
+          repository_url="https://github.com/owner/repo",
+          sandbox_type=SandboxType.GIT_BRANCH,
+          workflow_type=AgentWorkflowType.PLAN,
+          user_request="Add user authentication with JWT tokens",
+      )
+
+      assert request.user_request == "Add user authentication with JWT tokens"
+      assert request.repository_url == "https://github.com/owner/repo"
+      assert request.github_issue_number is None
+
+  def test_create_agent_work_order_request_with_github_issue():
+      """Test CreateAgentWorkOrderRequest with both user_request and issue number"""
+      request = CreateAgentWorkOrderRequest(
+          repository_url="https://github.com/owner/repo",
+          sandbox_type=SandboxType.GIT_BRANCH,
+          workflow_type=AgentWorkflowType.PLAN,
+          user_request="Implement the feature described in issue #42",
+          github_issue_number="42",
+      )
+
+      assert request.user_request == "Implement the feature described in issue #42"
+      assert request.github_issue_number == "42"
+  ```
+- Save the file
+
+### Add GitHub Client Tests for get_issue()
+
+- Open `python/tests/agent_work_orders/test_github_integration.py`
+- Add new test function:
+  ```python
+  @pytest.mark.asyncio
+  async def test_get_issue_success():
+      """Test successful GitHub issue fetch"""
+      client = GitHubClient()
+
+      # Mock subprocess
+      mock_process = MagicMock()
+      mock_process.returncode = 0
+      issue_json = json.dumps({
+          "number": 42,
+          "title": "Add login feature",
+          "body": "Users need to log in with email and password",
+          "state": "open",
+          "url": "https://github.com/owner/repo/issues/42"
+      })
+      mock_process.communicate = AsyncMock(return_value=(issue_json.encode(), b""))
+
+      with patch("asyncio.create_subprocess_exec", return_value=mock_process):
+          issue_data = await client.get_issue("https://github.com/owner/repo", "42")
+
+      assert issue_data["number"] == 42
+      assert issue_data["title"] == "Add login feature"
+      assert issue_data["state"] == "open"
+
+  @pytest.mark.asyncio
+  async def test_get_issue_failure():
+      """Test failed GitHub issue fetch"""
+      client = GitHubClient()
+
+      # Mock subprocess
+      mock_process = MagicMock()
+      mock_process.returncode = 1
+      mock_process.communicate = AsyncMock(return_value=(b"", b"Issue not found"))
+
+      with patch("asyncio.create_subprocess_exec", return_value=mock_process):
+          with pytest.raises(GitHubOperationError, match="Failed to fetch issue"):
+              await client.get_issue("https://github.com/owner/repo", "999")
+  ```
+- Add necessary imports at the top (json, AsyncMock if not present)
+- Save the file
+
+### Update API Tests to Include user_request
+
+- Open `python/tests/agent_work_orders/test_api.py`
+- Find all tests that create work orders and add `user_request` field
+- Update `test_create_agent_work_order()`:
+  ```python
+  response = client.post(
+      "/agent-work-orders",
+      json={
+          "repository_url": "https://github.com/owner/repo",
+          "sandbox_type": "git_branch",
+          "workflow_type": "agent_workflow_plan",
+          "user_request": "Add user authentication feature",  # ADD THIS
+          "github_issue_number": "42",
+      },
+  )
+  ```
+- Update `test_create_agent_work_order_without_issue()`:
+  ```python
+  response = client.post(
+      "/agent-work-orders",
+      json={
+          "repository_url": "https://github.com/owner/repo",
+          "sandbox_type": "git_branch",
+          "workflow_type": "agent_workflow_plan",
+          "user_request": "Fix the login bug where users can't sign in",  # ADD THIS
+      },
+  )
+  ```
+- Update any other test cases that create work orders
+- Save the file
+
+### Update Workflow Operations Tests
+
+- Open `python/tests/agent_work_orders/test_workflow_operations.py`
+- Update `test_classify_issue_success()` to use meaningful user request:
+  ```python
+  result = await workflow_operations.classify_issue(
+      mock_executor,
+      mock_loader,
+      "Add user authentication with JWT tokens and refresh token support",  # Meaningful request
+      "wo-test",
+      "/tmp/working",
+  )
+  ```
+- Update other test cases to use meaningful user requests instead of empty JSON
+- Save the file
+
+### Run Model Unit Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py -v`
+- Verify new `user_request` tests pass
+- Fix any failures
+
+### Run GitHub Client Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_github_integration.py -v`
+- Verify `get_issue()` tests pass
+- Fix any failures
+
+### Run API Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_api.py -v`
+- Verify all API tests pass with `user_request` field
+- Fix any failures
+
+### Run All Agent Work Orders Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
+- Target: 100% of tests pass
+- Fix any failures
+
+### Run Type Checking
+
+- Execute: `cd python && uv run mypy src/agent_work_orders/`
+- Verify no type errors
+- Fix any issues
+
+### Run Linting
+
+- Execute: `cd python && uv run ruff check src/agent_work_orders/`
+- Verify no linting issues
+- Fix any issues
+
+### Manual End-to-End Test
+
+- Start server: `cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &`
+- Wait: `sleep 5`
+- Test with user request only:
+  ```bash
+  curl -X POST http://localhost:8888/agent-work-orders \
+    -H "Content-Type: application/json" \
+    -d '{
+      "repository_url": "https://github.com/Wirasm/dylan.git",
+      "sandbox_type": "git_branch",
+      "workflow_type": "agent_workflow_plan",
+      "user_request": "Add a new feature for user profile management with avatar upload"
+    }' | jq
+  ```
+- Get work order ID from response
+- Wait: `sleep 30`
+- Check status: `curl http://localhost:8888/agent-work-orders/{WORK_ORDER_ID} | jq`
+- Check steps: `curl http://localhost:8888/agent-work-orders/{WORK_ORDER_ID}/steps | jq`
+- Verify:
+  - Classifier received full user request (not empty JSON)
+  - Classifier returned appropriate classification
+  - Planner received the user request context
+  - Workflow progressed normally
+- Test with GitHub issue reference:
+  ```bash
+  curl -X POST http://localhost:8888/agent-work-orders \
+    -H "Content-Type: application/json" \
+    -d '{
+      "repository_url": "https://github.com/Wirasm/dylan.git",
+      "sandbox_type": "git_branch",
+      "workflow_type": "agent_workflow_plan",
+      "user_request": "Implement the feature described in GitHub issue #1"
+    }' | jq
+  ```
+- Verify:
+  - System detected issue reference
+  - Issue details were fetched
+  - Both user request and issue context passed to agents
+- Stop server: `pkill -f "uvicorn.*8888"`
+
+## Testing Strategy
+
+### Unit Tests
+
+**Model Tests:**
+- Test `user_request` field accepts string values
+- Test `user_request` field is required (validation fails if missing)
+- Test `github_issue_number` remains optional
+- Test model serialization with all fields
+
+**GitHub Client Tests:**
+- Test `get_issue()` with valid issue number
+- Test `get_issue()` with invalid issue number
+- Test `get_issue()` with network timeout
+- Test `get_issue()` returns correct JSON structure
+
+**Workflow Orchestrator Tests:**
+- Test GitHub issue regex detection from user request
+- Test fetching GitHub issue when detected
+- Test fallback to user request only if issue fetch fails
+- Test classification input merges user request with issue data
+
+### Integration Tests
+
+**Full Workflow Tests:**
+- Test complete workflow with user request only (no GitHub issue)
+- Test complete workflow with explicit GitHub issue number
+- Test complete workflow with GitHub issue mentioned in user request
+- Test workflow handles GitHub API failures gracefully
+
+**API Integration Tests:**
+- Test POST /agent-work-orders with user_request field
+- Test POST /agent-work-orders validates user_request is required
+- Test POST /agent-work-orders accepts both user_request and github_issue_number
+
+### Edge Cases
+
+**User Request Parsing:**
+- User request mentions "issue #42"
+- User request mentions "GitHub issue 42"
+- User request mentions "issue#42" (no space)
+- User request contains multiple issue references (use first one)
+- User request doesn't mention any issues
+- Very long user requests (>10KB)
+- Empty user request (should fail validation)
+
+**GitHub Issue Handling:**
+- Issue number provided but fetch fails
+- Issue exists but is closed
+- Issue exists but has no body
+- Issue number is invalid (non-numeric)
+- Repository doesn't have issues enabled
+
+**Backward Compatibility:**
+- Existing tests must still pass (with user_request added)
+- API accepts requests without github_issue_number
+
+## Acceptance Criteria
+
+**Core Functionality:**
+- ✅ `user_request` field added to `CreateAgentWorkOrderRequest` model
+- ✅ `user_request` field is required and validated
+- ✅ `github_issue_number` field remains optional
+- ✅ API accepts and passes `user_request` to workflow
+- ✅ Workflow uses `user_request` for classification (not empty JSON)
+- ✅ GitHub issue references auto-detected from user request
+- ✅ `get_issue()` method fetches GitHub issue details via gh CLI
+- ✅ Classification input merges user request with issue data when available
+
+**Test Coverage:**
+- ✅ All existing tests pass with zero regressions
+- ✅ New model tests for `user_request` field
+- ✅ New GitHub client tests for `get_issue()` method
+- ✅ Updated API tests include `user_request` field
+- ✅ Updated workflow tests use meaningful user requests
+
+**Code Quality:**
+- ✅ Type checking passes (mypy)
+- ✅ Linting passes (ruff)
+- ✅ Code follows existing patterns
+- ✅ Comprehensive docstrings
+
+**End-to-End Validation:**
+- ✅ User can create work order with custom request (no GitHub issue)
+- ✅ Classifier receives full user request context
+- ✅ Planner receives full user request context
+- ✅ Workflow progresses successfully with user request
+- ✅ System detects GitHub issue references in user request
+- ✅ System fetches and merges GitHub issue data when detected
+- ✅ Workflow handles missing GitHub issues gracefully
+
+## Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+```bash
+# Unit Tests
+cd python && uv run pytest tests/agent_work_orders/test_models.py -v
+cd python && uv run pytest tests/agent_work_orders/test_github_integration.py -v
+cd python && uv run pytest tests/agent_work_orders/test_api.py -v
+cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v
+
+# Full Test Suite
+cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
+cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing
+cd python && uv run pytest  # All backend tests
+
+# Quality Checks
+cd python && uv run mypy src/agent_work_orders/
+cd python && uv run ruff check src/agent_work_orders/
+
+# End-to-End Test
+cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
+sleep 5
+curl http://localhost:8888/health | jq
+
+# Test 1: User request only (no GitHub issue)
+WORK_ORDER=$(curl -X POST http://localhost:8888/agent-work-orders \
+  -H "Content-Type: application/json" \
+  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Add user profile management with avatar upload functionality"}' \
+  | jq -r '.agent_work_order_id')
+
+echo "Work Order 1: $WORK_ORDER"
+sleep 30
+
+# Verify classifier received user request
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER/steps | jq '.steps[] | {step, success, output}'
+
+# Test 2: User request with GitHub issue reference
+WORK_ORDER2=$(curl -X POST http://localhost:8888/agent-work-orders \
+  -H "Content-Type: application/json" \
+  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Implement the feature described in GitHub issue #1"}' \
+  | jq -r '.agent_work_order_id')
+
+echo "Work Order 2: $WORK_ORDER2"
+sleep 30
+
+# Verify issue was fetched and merged with user request
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER2/steps | jq '.steps[] | {step, success, output}'
+
+# Cleanup
+pkill -f "uvicorn.*8888"
+```
+
+## Notes
+
+**Design Decisions:**
+- `user_request` is required because it's the primary input to the system
+- `github_issue_number` remains optional for backward compatibility and explicit issue references
+- GitHub issue auto-detection uses regex to find common patterns ("issue #42", "GitHub issue 42")
+- If both explicit `github_issue_number` and detected issue exist, explicit takes precedence
+- If GitHub issue fetch fails, workflow continues with user request only (resilient design)
+- Classification input merges user request with issue data to provide maximum context
+
+**Why This Fixes the Problem:**
+```
+BEFORE:
+- No way to provide custom user requests
+- issue_json = "{}" (empty)
+- Classifier has no context
+- Planner has no context
+- Workflow fails or produces irrelevant output
+
+AFTER:
+- user_request field provides clear description
+- issue_json populated from user request + optional GitHub issue
+- Classifier receives: "Add user authentication with JWT tokens"
+- Planner receives: Full context about what to build
+- Workflow succeeds with meaningful output
+```
+
+**GitHub Issue Detection Examples:**
+- "Implement issue #42" → Detects issue 42
+- "Fix GitHub issue 123" → Detects issue 123
+- "Resolve issue#456 in the API" → Detects issue 456
+- "Add login feature" → No issue detected, uses request as-is
+
+**Future Enhancements:**
+- Support multiple GitHub issue references
+- Support GitHub PR references
+- Add user_request to work order state for historical tracking
+- Support Jira, Linear, or other issue tracker references
+- Add user_request validation (min/max length, profanity filter)
+- Support rich text formatting in user requests
+- Add example user requests in API documentation
--- a/PRPs/specs/agent-work-orders-mvp-v2.md
+++ b/PRPs/specs/agent-work-orders-mvp-v2.md
--- a/PRPs/specs/atomic-workflow-execution-refactor.md
+++ b/PRPs/specs/atomic-workflow-execution-refactor.md
--- a/PRPs/specs/awo-docker-integration-and-config-management.md
+++ b/PRPs/specs/awo-docker-integration-and-config-management.md
--- a/PRPs/specs/awo-docker-integration-mvp.md
+++ b/PRPs/specs/awo-docker-integration-mvp.md
--- a/PRPs/specs/fix-claude-cli-integration.md
+++ b/PRPs/specs/fix-claude-cli-integration.md
@@ -0,0 +1,365 @@
+# Feature: Fix Claude CLI Integration for Agent Work Orders
+
+## Feature Description
+
+Fix the Claude CLI integration in the Agent Work Orders system to properly execute agent workflows using the Claude Code CLI. The current implementation is missing the required `--verbose` flag and lacks other important CLI configuration options for reliable, automated agent execution.
+
+The system currently fails with error: `"Error: When using --print, --output-format=stream-json requires --verbose"` because the CLI command builder is incomplete. This feature will add all necessary CLI flags, improve error handling, and ensure robust integration with Claude Code CLI for automated agent workflows.
+
+## User Story
+
+As a developer using the Agent Work Orders system
+I want the system to properly execute Claude CLI commands with all required flags
+So that agent workflows complete successfully and I can automate development tasks reliably
+
+## Problem Statement
+
+The current CLI integration has several issues:
+
+1. **Missing `--verbose` flag**: When using `--print` with `--output-format=stream-json`, the `--verbose` flag is required by Claude Code CLI but not included in the command
+2. **No turn limits**: Workflows can run indefinitely without a safety mechanism to limit agentic turns
+3. **No permission handling**: Interactive permission prompts block automated workflows
+4. **Incomplete configuration**: Missing flags for model selection, working directories, and other important options
+5. **Test misalignment**: Tests were written expecting `-f` flag pattern but implementation uses stdin, causing confusion
+6. **Limited error context**: Error messages don't provide enough information for debugging CLI failures
+
+These issues prevent agent work orders from executing successfully and make the system unusable in its current state.
+
+## Solution Statement
+
+Implement a complete CLI integration by:
+
+1. **Add missing `--verbose` flag** to enable stream-json output format
+2. **Add safety limits** with `--max-turns` to prevent runaway executions
+3. **Enable automation** with `--dangerously-skip-permissions` for non-interactive operation
+4. **Add configuration options** for working directories and model selection
+5. **Update tests** to match the stdin-based implementation pattern
+6. **Improve error handling** with better error messages and validation
+7. **Add configuration** for customizable CLI flags via environment variables
+
+The solution maintains the existing architecture while fixing the CLI command builder and adding proper configuration management.
+
+## Relevant Files
+
+**Core Implementation Files:**
+- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py` (lines 24-58) - CLI command builder that needs fixing
+  - Currently missing `--verbose` flag
+  - Needs additional flags for safety and automation
+  - Error handling could be improved
+
+**Configuration:**
+- `python/src/agent_work_orders/config.py` (lines 17-30) - Configuration management
+  - Needs new configuration options for CLI flags
+  - Should support environment variable overrides
+
+**Tests:**
+- `python/tests/agent_work_orders/test_agent_executor.py` (lines 10-44) - Unit tests for CLI executor
+  - Tests expect `-f` flag pattern but implementation uses stdin
+  - Need to update tests to match current implementation
+  - Add tests for new CLI flags
+
+**Workflow Integration:**
+- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py` (lines 98-104) - Calls CLI executor
+  - Verify integration works with updated CLI command
+  - Ensure proper error propagation
+
+**Documentation:**
+- `PRPs/ai_docs/cc_cli_ref.md` - Claude CLI reference documentation
+  - Contains complete flag reference
+  - Guides implementation
+
+### New Files
+
+None - this is a fix to existing implementation.
+
+## Implementation Plan
+
+### Phase 1: Foundation - Fix Core CLI Command Builder
+
+Add the missing `--verbose` flag and implement basic safety flags to make the CLI integration functional. This unblocks agent workflow execution.
+
+**Changes:**
+- Add `--verbose` flag to command builder (required for stream-json)
+- Add `--max-turns` flag with default limit (safety)
+- Add `--dangerously-skip-permissions` flag (automation)
+- Update configuration with new options
+
+### Phase 2: Enhanced Configuration
+
+Add comprehensive configuration management for CLI flags, allowing operators to customize behavior via environment variables or config files.
+
+**Changes:**
+- Add configuration options for all CLI flags
+- Support environment variable overrides
+- Add validation for configuration values
+- Document configuration options
+
+### Phase 3: Testing and Validation
+
+Update tests to match the current stdin-based implementation and add comprehensive test coverage for new CLI flags.
+
+**Changes:**
+- Fix existing tests to match stdin pattern
+- Add tests for new CLI flags
+- Add integration tests for full workflow execution
+- Add error handling tests
+
+## Step by Step Tasks
+
+### Fix CLI Command Builder
+
+- Read the current implementation in `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
+- Update the `build_command` method to include the `--verbose` flag after `--output-format stream-json`
+- Add `--max-turns` flag with configurable value (default: 20)
+- Add `--dangerously-skip-permissions` flag for automation
+- Ensure command parts are joined correctly with proper spacing
+- Update the docstring to document all flags being added
+- Verify the command string format matches CLI expectations
+
+### Add Configuration Options
+
+- Read `python/src/agent_work_orders/config.py`
+- Add `CLAUDE_CLI_MAX_TURNS` config option (default: 20)
+- Add `CLAUDE_CLI_SKIP_PERMISSIONS` config option (default: True for automation)
+- Add `CLAUDE_CLI_VERBOSE` config option (default: True, required for stream-json)
+- Add docstrings explaining each configuration option
+- Ensure all config options support environment variable overrides
+
+### Update CLI Executor to Use Config
+
+- Update `agent_cli_executor.py` to read configuration values
+- Pass configuration to `build_command` method
+- Make flags configurable rather than hardcoded
+- Add parameter documentation for new options
+- Maintain backward compatibility with existing code
+
+### Improve Error Handling
+
+- Add validation for command file path existence before reading
+- Add better error messages when CLI execution fails
+- Include the full command in error logs (without sensitive data)
+- Add timeout context to error messages
+- Log CLI stdout/stderr even on success for debugging
+
+### Update Unit Tests
+
+- Read `python/tests/agent_work_orders/test_agent_executor.py`
+- Update `test_build_command` to verify `--verbose` flag is included
+- Update `test_build_command` to verify `--max-turns` flag is included
+- Update `test_build_command` to verify `--dangerously-skip-permissions` flag is included
+- Remove or update tests expecting `-f` flag pattern (no longer used)
+- Update test assertions to match stdin-based implementation
+- Add test for command with all flags enabled
+- Add test for command with custom max-turns value
+
+### Add Integration Tests
+
+- Create new test `test_build_command_with_config` that verifies configuration is used
+- Create test `test_execute_with_valid_command_file` that mocks file reading
+- Create test `test_execute_with_missing_command_file` that verifies error handling
+- Create test `test_cli_flags_in_correct_order` to ensure proper flag ordering
+- Verify all tests pass with `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v`
+
+### Test End-to-End Workflow
+
+- Start the agent work orders server with `cd python && uv run uvicorn src.agent_work_orders.main:app --host 0.0.0.0 --port 8888`
+- Create a test work order via curl: `curl -X POST http://localhost:8888/agent-work-orders -H "Content-Type: application/json" -d '{"repository_url": "https://github.com/anthropics/claude-code", "sandbox_type": "git_branch", "workflow_type": "agent_workflow_plan", "github_issue_number": "123"}'`
+- Monitor server logs to verify the CLI command includes all required flags
+- Verify the error message no longer appears: "Error: When using --print, --output-format=stream-json requires --verbose"
+- Check that workflow executes successfully or fails with a different (expected) error
+- Verify session ID extraction works from CLI output
+
+### Update Documentation
+
+- Update inline code comments in `agent_cli_executor.py` explaining why each flag is needed
+- Add comments documenting the Claude CLI requirements
+- Reference the CLI documentation file `PRPs/ai_docs/cc_cli_ref.md` in code comments
+- Ensure configuration options are documented with examples
+
+### Run Validation Commands
+
+Execute all validation commands listed in the Validation Commands section to ensure zero regressions and complete functionality.
+
+## Testing Strategy
+
+### Unit Tests
+
+**CLI Command Builder Tests:**
+- Verify `--verbose` flag is present in built command
+- Verify `--max-turns` flag is present with correct value
+- Verify `--dangerously-skip-permissions` flag is present
+- Verify flags are in correct order (order may matter for CLI parsing)
+- Verify command parts are properly space-separated
+- Verify prompt text is correctly prepared for stdin
+
+**Configuration Tests:**
+- Verify default configuration values are correct
+- Verify environment variables override defaults
+- Verify configuration validation works for invalid values
+
+**Error Handling Tests:**
+- Test with non-existent command file path
+- Test with invalid configuration values
+- Test with CLI execution failures
+- Test with timeout scenarios
+
+### Integration Tests
+
+**Full Workflow Tests:**
+- Test creating work order triggers CLI execution
+- Test CLI command includes all required flags
+- Test session ID extraction from CLI output
+- Test error propagation from CLI to API response
+
+**Sandbox Integration:**
+- Test CLI executes in correct working directory
+- Test prompt text is passed via stdin correctly
+- Test output parsing works with actual CLI format
+
+### Edge Cases
+
+**Command Building:**
+- Empty args list
+- Very long prompt text (test stdin limits)
+- Special characters in args
+- Non-existent command file path
+- Command file with no content
+
+**Configuration:**
+- Max turns = 0 (should error or use sensible minimum)
+- Max turns = 1000 (should cap at reasonable maximum)
+- Invalid boolean values for skip_permissions
+- Missing environment variables (should use defaults)
+
+**CLI Execution:**
+- CLI command times out
+- CLI command exits with non-zero code
+- CLI output contains no session ID
+- CLI output is malformed JSON
+- Claude CLI not installed or not in PATH
+
+## Acceptance Criteria
+
+**CLI Integration:**
+- ✅ Agent work orders execute without "requires --verbose" error
+- ✅ CLI command includes `--verbose` flag
+- ✅ CLI command includes `--max-turns` flag with configurable value
+- ✅ CLI command includes `--dangerously-skip-permissions` flag
+- ✅ Configuration options support environment variable overrides
+- ✅ Error messages include helpful context for debugging
+
+**Testing:**
+- ✅ All existing unit tests pass
+- ✅ New tests verify CLI flags are included
+- ✅ Integration test verifies end-to-end workflow
+- ✅ Test coverage for error handling scenarios
+
+**Functionality:**
+- ✅ Work orders can be created via API
+- ✅ Background workflow execution starts
+- ✅ CLI command executes with proper flags
+- ✅ Session ID is extracted from CLI output
+- ✅ Errors are properly logged and returned to API
+
+**Documentation:**
+- ✅ Code comments explain CLI requirements
+- ✅ Configuration options are documented
+- ✅ Error messages are clear and actionable
+
+## Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+```bash
+# Run all agent work orders tests
+cd python && uv run pytest tests/agent_work_orders/ -v
+
+# Run specific CLI executor tests
+cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v
+
+# Run type checking
+cd python && uv run mypy src/agent_work_orders/agent_executor/
+
+# Run linting
+cd python && uv run ruff check src/agent_work_orders/agent_executor/
+cd python && uv run ruff check src/agent_work_orders/config.py
+
+# Start server and test end-to-end
+cd python && uv run uvicorn src.agent_work_orders.main:app --host 0.0.0.0 --port 8888 &
+sleep 3
+
+# Test health endpoint
+curl -s http://localhost:8888/health | jq .
+
+# Create test work order
+curl -s -X POST http://localhost:8888/agent-work-orders \
+  -H "Content-Type: application/json" \
+  -d '{
+    "repository_url": "https://github.com/anthropics/claude-code",
+    "sandbox_type": "git_branch",
+    "workflow_type": "agent_workflow_plan",
+    "github_issue_number": "123"
+  }' | jq .
+
+# Wait for background execution to start
+sleep 5
+
+# Check work order status
+curl -s http://localhost:8888/agent-work-orders | jq '.[] | {id: .agent_work_order_id, status: .status, error: .error_message}'
+
+# Verify logs show proper CLI command with all flags (check server stdout)
+# Should see: claude --print --output-format stream-json --verbose --max-turns 20 --dangerously-skip-permissions
+
+# Stop server
+pkill -f "uvicorn src.agent_work_orders.main:app"
+```
+
+## Notes
+
+### CLI Flag Requirements
+
+Based on `PRPs/ai_docs/cc_cli_ref.md`:
+- `--verbose` is **required** when using `--print` with `--output-format=stream-json`
+- `--max-turns` should be set to prevent runaway executions (recommended: 10-50)
+- `--dangerously-skip-permissions` is needed for non-interactive automation
+- Flag order may matter - follow the order shown in documentation examples
+
+### Configuration Philosophy
+
+- Default values should enable successful automation
+- Environment variables allow per-deployment customization
+- Configuration should fail fast with clear errors
+- Document all configuration with examples
+
+### Future Enhancements (Out of Scope for This Feature)
+
+- Add support for `--add-dir` flag for multi-directory workspaces
+- Add support for `--agents` flag for custom subagents
+- Add support for `--model` flag for model selection
+- Add retry logic with exponential backoff for transient failures
+- Add metrics/telemetry for CLI execution success rates
+- Add support for resuming failed workflows with `--resume` flag
+
+### Testing Notes
+
+- Tests must not require actual Claude CLI installation
+- Mock subprocess execution for unit tests
+- Integration tests can assume Claude CLI is available
+- Consider adding e2e tests that use a mock CLI script
+- Validate session ID extraction with real CLI output examples
+
+### Debugging Tips
+
+When CLI execution fails:
+1. Check server logs for full command string
+2. Verify command file exists at expected path
+3. Test CLI command manually in terminal
+4. Check Claude CLI version (may have breaking changes)
+5. Verify working directory has correct permissions
+6. Check for prompt text issues (encoding, length)
+
+### Related Documentation
+
+- Claude Code CLI Reference: `PRPs/ai_docs/cc_cli_ref.md`
+- Agent Work Orders PRD: `PRPs/specs/agent-work-orders-mvp-v2.md`
+- SDK Documentation: https://docs.claude.com/claude-code/sdk
--- a/PRPs/specs/fix-jsonl-result-extraction-and-argument-passing.md
+++ b/PRPs/specs/fix-jsonl-result-extraction-and-argument-passing.md
@@ -0,0 +1,742 @@
+# Feature: Fix JSONL Result Extraction and Argument Passing
+
+## Feature Description
+
+Fix critical integration issues between Agent Work Orders system and Claude CLI that prevent workflow execution from completing successfully. The system currently fails to extract the actual result text from Claude CLI's JSONL output stream and doesn't properly pass arguments to command files using the $ARGUMENTS placeholder pattern.
+
+These fixes enable the atomic workflow execution pattern to work end-to-end by ensuring clean data flow between workflow steps.
+
+## User Story
+
+As a developer using the Agent Work Orders system
+I want workflows to execute successfully end-to-end
+So that I can automate development tasks via GitHub issues without manual intervention
+
+## Problem Statement
+
+The first real-world test of the atomic workflow execution system (work order wo-18d08ae8, repository: https://github.com/Wirasm/dylan.git, issue #1) revealed two critical failures that prevent workflow completion:
+
+**Problem 1: JSONL Result Not Extracted**
+- `workflow_operations.py` uses `result.stdout.strip()` to get agent output
+- `result.stdout` contains the entire JSONL stream (multiple lines of JSON messages)
+- The actual agent result is in the "result" field of the final JSONL message with `type:"result"`
+- Consequence: Downstream steps receive JSONL garbage instead of clean output
+
+**Observed Example:**
+```python
+# What we're currently doing (WRONG):
+issue_class = result.stdout.strip()
+# Gets: '{"type":"session_started","session_id":"..."}\n{"type":"result","result":"/feature","is_error":false}'
+
+# What we should do (CORRECT):
+issue_class = result.result_text.strip()
+# Gets: "/feature"
+```
+
+**Problem 2: $ARGUMENTS Placeholder Not Replaced**
+- Command files use `$ARGUMENTS` placeholder for dynamic content (ADW pattern)
+- `AgentCLIExecutor.build_command()` appends args to prompt but doesn't replace placeholder
+- Claude CLI receives literal "$ARGUMENTS" text instead of actual issue JSON
+- Consequence: Agents cannot access input data needed to perform their task
+
+**Observed Failure:**
+```
+Step 1 (Classifier): ✅ Executed BUT ❌ Wrong Output
+- Agent response: "I need to see the GitHub issue content. The $ARGUMENTS placeholder shows {}"
+- Output: Full JSONL stream instead of "/feature", "/bug", or "/chore"
+- Session ID: 06f225c7-bcd8-436c-8738-9fa744c8eee6
+
+Step 2 (Planner): ❌ Failed Immediately
+- Received JSONL as issue_class: {"type":"result"...}
+- Error: "Unknown issue class: {JSONL output...}"
+- Workflow halted - cannot proceed without clean classification
+```
+
+## Solution Statement
+
+Implement two critical fixes to enable proper Claude CLI integration:
+
+**Fix 1: Extract result_text from JSONL Output**
+- Add `result_text` field to `CommandExecutionResult` model
+- Extract the "result" field value from JSONL's final result message in `AgentCLIExecutor`
+- Update all `workflow_operations.py` functions to use `result.result_text` instead of `result.stdout`
+- Preserve `stdout` for debugging (contains full JSONL stream)
+
+**Fix 2: Replace $ARGUMENTS and Positional Placeholders**
+- Modify `AgentCLIExecutor.build_command()` to replace `$ARGUMENTS` with actual arguments
+- Support both `$ARGUMENTS` (all args) and `$1`, `$2`, `$3` (positional args)
+- Pre-process command file content before passing to Claude CLI
+- Remove old code that appended "Arguments: ..." to end of prompt
+
+This enables atomic workflows to execute correctly with clean data flow between steps.
+
+## Relevant Files
+
+Use these files to implement the feature:
+
+**Core Models** - Add result extraction field
+- `python/src/agent_work_orders/models.py`:180-190 - CommandExecutionResult model needs result_text field to store extracted result
+
+**Agent Executor** - Implement JSONL parsing and argument replacement
+- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:25-88 - build_command() needs $ARGUMENTS replacement logic (line 61-62 currently just appends args)
+- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:90-236 - execute_async() needs result_text extraction (around line 170-175)
+- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:337-363 - _extract_result_message() already extracts result dict, need to get "result" field value
+
+**Workflow Operations** - Use extracted result_text instead of stdout
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:26-79 - classify_issue() line 51 uses `result.stdout.strip()`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:82-155 - build_plan() line 133 uses `result.stdout`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:158-213 - find_plan_file() line 185 uses `result.stdout`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:216-267 - implement_plan() line 245 uses `result.stdout`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:270-326 - generate_branch() line 299 uses `result.stdout`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:329-385 - create_commit() line 358 uses `result.stdout`
+- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:388-444 - create_pull_request() line 417 uses `result.stdout`
+
+**Tests** - Update and add test coverage
+- `python/tests/agent_work_orders/test_models.py` - Add tests for CommandExecutionResult with result_text field
+- `python/tests/agent_work_orders/test_agent_executor.py` - Add tests for result extraction and argument replacement
+- `python/tests/agent_work_orders/test_workflow_operations.py`:1-398 - Update ALL mocks to include result_text field (currently missing)
+
+**Command Files** - Examples using $ARGUMENTS that need to work
+- `.claude/commands/agent-work-orders/classify_issue.md`:19-21 - Uses `$ARGUMENTS` placeholder
+- `.claude/commands/agent-work-orders/feature.md` - Uses `$ARGUMENTS` placeholder
+- `.claude/commands/agent-work-orders/bug.md` - Uses positional `$1`, `$2`, `$3`
+
+### New Files
+
+No new files needed - all changes are modifications to existing files.
+
+## Implementation Plan
+
+### Phase 1: Foundation - Model Enhancement
+
+Add the result_text field to CommandExecutionResult so we can store the extracted result value separately from the raw JSONL stdout. This is a backward-compatible change.
+
+### Phase 2: Core Implementation - Result Extraction
+
+Implement the logic to parse JSONL output and extract the "result" field value into result_text during command execution in AgentCLIExecutor.
+
+### Phase 3: Core Implementation - Argument Replacement
+
+Implement placeholder replacement logic in build_command() to support $ARGUMENTS and $1, $2, $3 patterns in command files.
+
+### Phase 4: Integration - Update Workflow Operations
+
+Update all 7 workflow operation functions to use result_text instead of stdout for cleaner data flow between atomic steps.
+
+### Phase 5: Testing and Validation
+
+Comprehensive test coverage for both fixes and end-to-end validation with actual workflow execution.
+
+## Step by Step Tasks
+
+IMPORTANT: Execute every step in order, top to bottom.
+
+### Add result_text Field to CommandExecutionResult Model
+
+- Open `python/src/agent_work_orders/models.py`
+- Locate the `CommandExecutionResult` class (line 180)
+- Add new optional field after stdout:
+  ```python
+  result_text: str | None = None
+  ```
+- Add inline comment above the field: `# Extracted result text from JSONL "result" field (if available)`
+- Verify the model definition is complete and properly formatted
+- Save the file
+
+### Implement Result Text Extraction in execute_async()
+
+- Open `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
+- Locate the `execute_async()` method
+- Find the section around line 170-175 where `_extract_result_message()` is called
+- After line 173 `result_message = self._extract_result_message(stdout_text)`, add:
+  ```python
+  # Extract result text from JSONL result message
+  result_text: str | None = None
+  if result_message and "result" in result_message:
+      result_value = result_message.get("result")
+      # Convert result to string (handles both str and other types)
+      result_text = str(result_value) if result_value is not None else None
+  else:
+      result_text = None
+  ```
+- Update the `CommandExecutionResult` instantiation (around line 191) to include the new field:
+  ```python
+  result = CommandExecutionResult(
+      success=success,
+      stdout=stdout_text,
+      result_text=result_text,  # NEW: Add this line
+      stderr=stderr_text,
+      exit_code=process.returncode or 0,
+      session_id=session_id,
+      error_message=error_message,
+      duration_seconds=duration,
+  )
+  ```
+- Add debug logging after extraction (before the result object is created):
+  ```python
+  if result_text:
+      self._logger.debug(
+          "result_text_extracted",
+          result_text_preview=result_text[:100] if len(result_text) > 100 else result_text,
+          work_order_id=work_order_id
+      )
+  ```
+- Save the file
+
+### Implement $ARGUMENTS Placeholder Replacement in build_command()
+
+- Still in `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
+- Locate the `build_command()` method (line 25-88)
+- Find the section around line 60-62 that handles arguments
+- Replace the current args handling code:
+  ```python
+  # OLD CODE TO REMOVE:
+  # if args:
+  #     prompt_text += f"\n\nArguments: {', '.join(args)}"
+
+  # NEW CODE:
+  # Replace argument placeholders in prompt text
+  if args:
+      # Replace $ARGUMENTS with first arg (or all args joined if multiple)
+      prompt_text = prompt_text.replace("$ARGUMENTS", args[0] if len(args) == 1 else ", ".join(args))
+
+      # Replace positional placeholders ($1, $2, $3, etc.)
+      for i, arg in enumerate(args, start=1):
+          prompt_text = prompt_text.replace(f"${i}", arg)
+  ```
+- Save the file
+
+### Update classify_issue() to Use result_text
+
+- Open `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `classify_issue()` function (starts at line 26)
+- Find line 50-51 that extracts issue_class
+- Replace with:
+  ```python
+  # OLD: if result.success and result.stdout:
+  #         issue_class = result.stdout.strip()
+
+  # NEW: Use result_text which contains the extracted result
+  if result.success and result.result_text:
+      issue_class = result.result_text.strip()
+  ```
+- Verify the rest of the function logic remains unchanged
+- Save the file
+
+### Update build_plan() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `build_plan()` function (starts at line 82)
+- Find line 133 in the success case
+- Replace `output=result.stdout or ""` with:
+  ```python
+  output=result.result_text or result.stdout or ""
+  ```
+- Note: We use fallback to stdout for backward compatibility during transition
+- Save the file
+
+### Update find_plan_file() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `find_plan_file()` function (starts at line 158)
+- Find line 185 that checks stdout
+- Replace with:
+  ```python
+  # OLD: if result.success and result.stdout and result.stdout.strip() != "0":
+  #         plan_file_path = result.stdout.strip()
+
+  # NEW: Use result_text
+  if result.success and result.result_text and result.result_text.strip() != "0":
+      plan_file_path = result.result_text.strip()
+  ```
+- Save the file
+
+### Update implement_plan() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `implement_plan()` function (starts at line 216)
+- Find line 245 in the success case
+- Replace `output=result.stdout or ""` with:
+  ```python
+  output=result.result_text or result.stdout or ""
+  ```
+- Save the file
+
+### Update generate_branch() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `generate_branch()` function (starts at line 270)
+- Find line 298-299 that extracts branch_name
+- Replace with:
+  ```python
+  # OLD: if result.success and result.stdout:
+  #         branch_name = result.stdout.strip()
+
+  # NEW: Use result_text
+  if result.success and result.result_text:
+      branch_name = result.result_text.strip()
+  ```
+- Save the file
+
+### Update create_commit() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `create_commit()` function (starts at line 329)
+- Find line 357-358 that extracts commit_message
+- Replace with:
+  ```python
+  # OLD: if result.success and result.stdout:
+  #         commit_message = result.stdout.strip()
+
+  # NEW: Use result_text
+  if result.success and result.result_text:
+      commit_message = result.result_text.strip()
+  ```
+- Save the file
+
+### Update create_pull_request() to Use result_text
+
+- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Locate the `create_pull_request()` function (starts at line 388)
+- Find line 416-417 that extracts pr_url
+- Replace with:
+  ```python
+  # OLD: if result.success and result.stdout:
+  #         pr_url = result.stdout.strip()
+
+  # NEW: Use result_text
+  if result.success and result.result_text:
+      pr_url = result.result_text.strip()
+  ```
+- Save the file
+- Verify all 7 workflow operations now use result_text
+
+### Add Model Tests for result_text Field
+
+- Open `python/tests/agent_work_orders/test_models.py`
+- Add new test function at the end of the file:
+  ```python
+  def test_command_execution_result_with_result_text():
+      """Test CommandExecutionResult includes result_text field"""
+      result = CommandExecutionResult(
+          success=True,
+          stdout='{"type":"result","result":"/feature"}',
+          result_text="/feature",
+          stderr=None,
+          exit_code=0,
+          session_id="session-123",
+      )
+      assert result.result_text == "/feature"
+      assert result.stdout == '{"type":"result","result":"/feature"}'
+      assert result.success is True
+
+  def test_command_execution_result_without_result_text():
+      """Test CommandExecutionResult works without result_text (backward compatibility)"""
+      result = CommandExecutionResult(
+          success=True,
+          stdout="raw output",
+          stderr=None,
+          exit_code=0,
+      )
+      assert result.result_text is None
+      assert result.stdout == "raw output"
+  ```
+- Save the file
+
+### Add Agent Executor Tests for Result Extraction
+
+- Open `python/tests/agent_work_orders/test_agent_executor.py`
+- Add new test function:
+  ```python
+  @pytest.mark.asyncio
+  async def test_execute_async_extracts_result_text():
+      """Test that result text is extracted from JSONL output"""
+      executor = AgentCLIExecutor()
+
+      # Mock subprocess that returns JSONL with result
+      jsonl_output = '{"type":"session_started","session_id":"test-123"}\n{"type":"result","result":"/feature","is_error":false}'
+
+      with patch("asyncio.create_subprocess_shell") as mock_subprocess:
+          mock_process = AsyncMock()
+          mock_process.communicate = AsyncMock(return_value=(jsonl_output.encode(), b""))
+          mock_process.returncode = 0
+          mock_subprocess.return_value = mock_process
+
+          result = await executor.execute_async(
+              "claude --print",
+              "/tmp/test",
+              prompt_text="test prompt",
+              work_order_id="wo-test"
+          )
+
+          assert result.success is True
+          assert result.result_text == "/feature"
+          assert result.session_id == "test-123"
+          assert '{"type":"result"' in result.stdout
+  ```
+- Save the file
+
+### Add Agent Executor Tests for Argument Replacement
+
+- Still in `python/tests/agent_work_orders/test_agent_executor.py`
+- Add new test functions:
+  ```python
+  def test_build_command_replaces_arguments_placeholder():
+      """Test that $ARGUMENTS placeholder is replaced with actual arguments"""
+      executor = AgentCLIExecutor()
+
+      # Create temp command file with $ARGUMENTS
+      import tempfile
+      with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
+          f.write("Classify this issue:\\n\\n$ARGUMENTS")
+          temp_file = f.name
+
+      try:
+          command, prompt = executor.build_command(
+              temp_file,
+              args=['{"title": "Add feature", "body": "description"}']
+          )
+
+          assert "$ARGUMENTS" not in prompt
+          assert '{"title": "Add feature"' in prompt
+          assert "Classify this issue:" in prompt
+      finally:
+          import os
+          os.unlink(temp_file)
+
+  def test_build_command_replaces_positional_arguments():
+      """Test that $1, $2, $3 are replaced with positional arguments"""
+      executor = AgentCLIExecutor()
+
+      import tempfile
+      with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
+          f.write("Issue: $1\\nWorkOrder: $2\\nData: $3")
+          temp_file = f.name
+
+      try:
+          command, prompt = executor.build_command(
+              temp_file,
+              args=["42", "wo-test", '{"title":"Test"}']
+          )
+
+          assert "$1" not in prompt
+          assert "$2" not in prompt
+          assert "$3" not in prompt
+          assert "Issue: 42" in prompt
+          assert "WorkOrder: wo-test" in prompt
+          assert 'Data: {"title":"Test"}' in prompt
+      finally:
+          import os
+          os.unlink(temp_file)
+  ```
+- Save the file
+
+### Update All Workflow Operations Test Mocks
+
+- Open `python/tests/agent_work_orders/test_workflow_operations.py`
+- Find every `CommandExecutionResult` mock and add `result_text` field
+- Update test_classify_issue_success (line 27-34):
+  ```python
+  mock_executor.execute_async = AsyncMock(
+      return_value=CommandExecutionResult(
+          success=True,
+          stdout='{"type":"result","result":"/feature"}',
+          result_text="/feature",  # ADD THIS
+          stderr=None,
+          exit_code=0,
+          session_id="session-123",
+      )
+  )
+  ```
+- Repeat for all other test functions:
+  - test_build_plan_feature_success (line 93-100) - add `result_text="Plan created successfully"`
+  - test_build_plan_bug_success (line 128-135) - add `result_text="Bug plan created"`
+  - test_find_plan_file_success (line 180-187) - add `result_text="specs/issue-42-wo-test-planner-feature.md"`
+  - test_find_plan_file_not_found (line 213-220) - add `result_text="0"`
+  - test_implement_plan_success (line 243-250) - add `result_text="Implementation completed"`
+  - test_generate_branch_success (line 274-281) - add `result_text="feat-issue-42-wo-test-add-feature"`
+  - test_create_commit_success (line 307-314) - add `result_text="implementor: feat: add user authentication"`
+  - test_create_pull_request_success (line 339-346) - add `result_text="https://github.com/owner/repo/pull/123"`
+- Save the file
+
+### Run Model Unit Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py::test_command_execution_result_with_result_text -v`
+- Verify test passes
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py::test_command_execution_result_without_result_text -v`
+- Verify test passes
+
+### Run Agent Executor Unit Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_execute_async_extracts_result_text -v`
+- Verify result extraction test passes
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_build_command_replaces_arguments_placeholder -v`
+- Verify $ARGUMENTS replacement test passes
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_build_command_replaces_positional_arguments -v`
+- Verify positional argument test passes
+
+### Run Workflow Operations Unit Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v`
+- Verify all 9+ tests pass with updated mocks
+- Check for any assertion failures related to result_text
+
+### Run Full Test Suite
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
+- Target: 100% of tests pass
+- If any tests fail, fix them immediately before proceeding
+- Execute: `cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing`
+- Verify >80% coverage for modified files
+
+### Run Type Checking
+
+- Execute: `cd python && uv run mypy src/agent_work_orders/models.py`
+- Verify no type errors in models
+- Execute: `cd python && uv run mypy src/agent_work_orders/agent_executor/agent_cli_executor.py`
+- Verify no type errors in executor
+- Execute: `cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Verify no type errors in workflow operations
+
+### Run Linting
+
+- Execute: `cd python && uv run ruff check src/agent_work_orders/models.py`
+- Execute: `cd python && uv run ruff check src/agent_work_orders/agent_executor/agent_cli_executor.py`
+- Execute: `cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_operations.py`
+- Fix any linting issues if found
+
+### Run End-to-End Integration Test
+
+- Start server: `cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &`
+- Wait for startup: `sleep 5`
+- Test health: `curl http://localhost:8888/health`
+- Create work order:
+  ```bash
+  WORK_ORDER_ID=$(curl -X POST http://localhost:8888/agent-work-orders \
+    -H "Content-Type: application/json" \
+    -d '{
+      "repository_url": "https://github.com/Wirasm/dylan.git",
+      "sandbox_type": "git_branch",
+      "workflow_type": "agent_workflow_plan",
+      "github_issue_number": "1"
+    }' | jq -r '.agent_work_order_id')
+  echo "Work Order ID: $WORK_ORDER_ID"
+  ```
+- Monitor: `sleep 30`
+- Check status: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq`
+- Check steps: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps[] | {step: .step, agent: .agent_name, success: .success, output: .output[:50]}'`
+- Verify:
+  - Classifier step shows `output: "/feature"` (NOT JSONL)
+  - Planner step succeeded (received clean classification)
+  - All subsequent steps executed
+  - Final status is "completed" or shows specific error
+- Inspect logs: `ls -la /tmp/agent-work-orders/*/`
+- Check artifacts: `cat /tmp/agent-work-orders/$WORK_ORDER_ID/outputs/*.jsonl | grep '"result"'`
+- Stop server: `pkill -f "uvicorn.*8888"`
+
+### Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+- `cd python && uv run pytest tests/agent_work_orders/test_models.py -v` - Verify model tests pass
+- `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v` - Verify executor tests pass
+- `cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v` - Verify workflow operations tests pass
+- `cd python && uv run pytest tests/agent_work_orders/ -v` - All agent work orders tests
+- `cd python && uv run pytest` - Entire backend test suite (zero regressions)
+- `cd python && uv run mypy src/agent_work_orders/` - Type check all modified code
+- `cd python && uv run ruff check src/agent_work_orders/` - Lint all modified code
+- End-to-end test: Start server and create work order as documented above
+- Verify classifier returns clean "/feature" not JSONL
+- Verify planner receives correct classification
+- Verify workflow completes successfully
+
+## Testing Strategy
+
+### Unit Tests
+
+**CommandExecutionResult Model**
+- Test result_text field accepts string values
+- Test result_text field accepts None (optional)
+- Test model serialization with result_text
+- Test backward compatibility (result_text=None works)
+
+**AgentCLIExecutor Result Extraction**
+- Test extraction from valid JSONL with result field
+- Test extraction when result is string
+- Test extraction when result is number (should stringify)
+- Test extraction when result is object (should stringify)
+- Test no extraction when JSONL has no result message
+- Test no extraction when result message missing "result" field
+- Test handles malformed JSONL gracefully
+
+**AgentCLIExecutor Argument Replacement**
+- Test $ARGUMENTS with single argument
+- Test $ARGUMENTS with multiple arguments
+- Test $1, $2, $3 positional replacement
+- Test mixed placeholders in one file
+- Test no replacement when args is None
+- Test no replacement when args is empty
+- Test command without placeholders
+
+**Workflow Operations**
+- Test each operation uses result_text
+- Test each operation handles None result_text
+- Test fallback to stdout works
+- Test clean output flows to next step
+
+### Integration Tests
+
+**Complete Workflow**
+- Test full workflow with real JSONL parsing
+- Test classifier → planner data flow
+- Test each step receives clean input
+- Test step history contains result_text values
+- Test error handling when result_text is None
+
+**Error Scenarios**
+- Test malformed JSONL output
+- Test missing result field in JSONL
+- Test agent returns error in result
+- Test $ARGUMENTS not in command file (should still work)
+
+### Edge Cases
+
+**JSONL Parsing**
+- Result message not last in stream
+- Multiple result messages
+- Result with is_error:true
+- Result value is null
+- Result value is boolean true/false
+- Result value is large object
+- Result value contains newlines
+
+**Argument Replacement**
+- $ARGUMENTS appears multiple times
+- Positional args exceed provided args count
+- Args contain special characters
+- Args contain literal $ character
+- Very long arguments (>10KB)
+- Empty string arguments
+
+**Backward Compatibility**
+- Old commands without placeholders
+- Workflow handles result_text=None gracefully
+- stdout still accessible for debugging
+
+## Acceptance Criteria
+
+**Core Functionality:**
+- ✅ CommandExecutionResult model has result_text field
+- ✅ result_text extracted from JSONL "result" field
+- ✅ $ARGUMENTS placeholder replaced with arguments
+- ✅ $1, $2, $3 positional placeholders replaced
+- ✅ All 7 workflow operations use result_text
+- ✅ stdout preserved for debugging (backward compatible)
+
+**Test Results:**
+- ✅ All existing tests pass (zero regressions)
+- ✅ New model tests pass
+- ✅ New executor tests pass
+- ✅ Updated workflow operations tests pass
+- ✅ >80% test coverage for modified files
+
+**Code Quality:**
+- ✅ Type checking passes with no errors
+- ✅ Linting passes with no warnings
+- ✅ Code follows existing patterns
+- ✅ Docstrings updated where needed
+
+**End-to-End:**
+- ✅ Classifier returns clean output: "/feature", "/bug", or "/chore"
+- ✅ Planner receives correct issue class (not JSONL)
+- ✅ All workflow steps execute successfully
+- ✅ Step history shows clean result_text values
+- ✅ Logs show result extraction working
+- ✅ Complete workflow creates PR
+
+## Validation Commands
+
+```bash
+# Unit Tests
+cd python && uv run pytest tests/agent_work_orders/test_models.py -v
+cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v
+cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v
+
+# Full Suite
+cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
+cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing
+cd python && uv run pytest  # All backend tests
+
+# Quality Checks
+cd python && uv run mypy src/agent_work_orders/
+cd python && uv run ruff check src/agent_work_orders/
+
+# Integration Test
+cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
+sleep 5
+curl http://localhost:8888/health | jq
+
+# Create test work order
+WORK_ORDER=$(curl -X POST http://localhost:8888/agent-work-orders \
+  -H "Content-Type: application/json" \
+  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","github_issue_number":"1"}' \
+  | jq -r '.agent_work_order_id')
+
+echo "Work Order: $WORK_ORDER"
+sleep 20
+
+# Check execution
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER | jq
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER/steps | jq '.steps[] | {step, agent_name, success, output}'
+
+# Verify logs
+ls /tmp/agent-work-orders/*/outputs/
+cat /tmp/agent-work-orders/*/outputs/*.jsonl | grep '"result"'
+
+# Cleanup
+pkill -f "uvicorn.*8888"
+```
+
+## Notes
+
+**Design Decisions:**
+- Preserve `stdout` containing raw JSONL for debugging
+- `result_text` is the new preferred field for clean output
+- Fallback to `stdout` in some workflow operations (defensive)
+- Support both `$ARGUMENTS` and `$1, $2, $3` for flexibility
+- Backward compatible - optional fields, graceful fallbacks
+
+**Why This Fixes the Issue:**
+```
+Before Fix:
+  Classifier stdout: '{"type":"result","result":"/feature","is_error":false}'
+  Planner receives:  '{"type":"result","result":"/feature","is_error":false}' ❌
+  Error: "Unknown issue class: {JSONL...}"
+
+After Fix:
+  Classifier stdout:      '{"type":"result","result":"/feature","is_error":false}'
+  Classifier result_text: "/feature"
+  Planner receives:       "/feature" ✅
+  Success: Clean classification flows to next step
+```
+
+**Claude CLI JSONL Format:**
+```json
+{"type":"session_started","session_id":"abc-123"}
+{"type":"text","text":"I'm analyzing..."}
+{"type":"result","result":"/feature","is_error":false}
+```
+
+**Future Improvements:**
+- Add result_json field for structured data
+- Support more placeholder patterns (${ISSUE_NUMBER}, etc.)
+- Validate command files have required placeholders
+- Add metrics for result_text extraction success rate
+- Consider streaming result extraction for long-running agents
+
+**Migration Path:**
+1. Add result_text field (backward compatible)
+2. Extract in executor (backward compatible)
+3. Update workflow operations (backward compatible - fallback)
+4. Deploy and validate
+5. Future: Remove stdout usage entirely
--- a/PRPs/specs/incremental-step-history-tracking.md
+++ b/PRPs/specs/incremental-step-history-tracking.md
@@ -0,0 +1,724 @@
+# Feature: Incremental Step History Tracking for Real-Time Workflow Observability
+
+## Feature Description
+
+Enable real-time progress visibility for Agent Work Orders by saving step history incrementally after each workflow step completes, rather than waiting until the end. This critical observability fix allows users to monitor workflow execution in real-time via the `/agent-work-orders/{id}/steps` API endpoint, providing immediate feedback on which steps have completed, which are in progress, and which have failed.
+
+Currently, step history is only saved at two points: when the entire workflow completes successfully (line 260 in orchestrator) or when the workflow fails with an exception (line 269). This means users polling the steps endpoint see zero progress information until the workflow reaches one of these terminal states, creating a black-box execution experience that can last several minutes.
+
+## User Story
+
+As a developer using the Agent Work Orders system
+I want to see real-time progress as each workflow step completes
+So that I can monitor execution, debug failures quickly, and understand what the system is doing without waiting for the entire workflow to finish
+
+## Problem Statement
+
+The current implementation has a critical observability gap that prevents real-time progress tracking:
+
+**Root Cause:**
+- Step history is initialized at workflow start: `step_history = StepHistory(agent_work_order_id=agent_work_order_id)` (line 82)
+- After each step executes, results are appended: `step_history.steps.append(result)` (lines 130, 150, 166, 186, 205, 224, 241)
+- **BUT** step history is only saved to state at:
+  - Line 260: `await self.state_repository.save_step_history(...)` - After ALL 7 steps complete successfully
+  - Line 269: `await self.state_repository.save_step_history(...)` - In exception handler when workflow fails
+
+**Impact:**
+1. **Zero Real-Time Visibility**: Users polling `/agent-work-orders/{id}/steps` see an empty array until workflow completes or fails
+2. **Poor Debugging Experience**: Cannot see which step failed until the entire workflow terminates
+3. **Uncertain Progress**: Long-running workflows (3-5 minutes) appear frozen with no progress indication
+4. **Wasted API Calls**: Clients poll repeatedly but get no new information until terminal state
+5. **Bad User Experience**: Cannot show meaningful progress bars, step indicators, or real-time status updates in UI
+
+**Example Scenario:**
+```
+User creates work order → Polls /steps endpoint every 3 seconds
+  0s: [] (empty)
+  3s: [] (empty)
+  6s: [] (empty)
+  ... workflow running ...
+  120s: [] (empty)
+  123s: [] (empty)
+  ... workflow running ...
+  180s: [all 7 steps] (suddenly all appear at once)
+```
+
+This creates a frustrating experience where users have no insight into what's happening for minutes at a time.
+
+## Solution Statement
+
+Implement incremental step history persistence by adding a single `await self.state_repository.save_step_history()` call immediately after each step result is appended to the history. This simple change enables real-time progress tracking with minimal code modification and zero performance impact.
+
+**Implementation:**
+- After each `step_history.steps.append(result)` call, immediately save: `await self.state_repository.save_step_history(agent_work_order_id, step_history)`
+- Apply this pattern consistently across all 7 workflow steps
+- Preserve existing end-of-workflow and error-handler saves for robustness
+- No changes needed to API, models, or state repository (already supports incremental saves)
+
+**Result:**
+```
+User creates work order → Polls /steps endpoint every 3 seconds
+  0s: [] (empty - workflow starting)
+  3s: [{classify step}] (classification complete!)
+  10s: [{classify}, {plan}] (planning complete!)
+  20s: [{classify}, {plan}, {find_plan}] (plan file found!)
+  ... progress visible at each step ...
+  180s: [all 7 steps] (complete with full history)
+```
+
+This provides immediate feedback, enables meaningful progress UIs, and dramatically improves the developer experience.
+
+## Relevant Files
+
+Use these files to implement the feature:
+
+**Core Implementation:**
+- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py` (lines 122-269)
+  - Main orchestration logic where step history is managed
+  - Currently appends to step_history but doesn't save incrementally
+  - Need to add `save_step_history()` calls after each step completion (7 locations)
+  - Lines to modify: 130, 150, 166, 186, 205, 224, 241 (add save call after each append)
+
+**State Management (No Changes Needed):**
+- `python/src/agent_work_orders/state_manager/work_order_repository.py` (lines 147-163)
+  - Already implements `save_step_history()` method with proper locking
+  - Thread-safe with asyncio.Lock for concurrent access
+  - Logs each save operation for observability
+  - Works perfectly for incremental saves - no modifications required
+
+**API Layer (No Changes Needed):**
+- `python/src/agent_work_orders/api/routes.py` (lines 220-240)
+  - Already implements `GET /agent-work-orders/{id}/steps` endpoint
+  - Returns step history from state repository
+  - Will automatically return incremental results once orchestrator saves them
+
+**Models (No Changes Needed):**
+- `python/src/agent_work_orders/models.py` (lines 213-246)
+  - `StepHistory` model is immutable-friendly (each save creates full snapshot)
+  - `StepExecutionResult` captures all step details
+  - Models already support incremental history updates
+
+### New Files
+
+No new files needed - this is a simple enhancement to existing workflow orchestrator.
+
+## Implementation Plan
+
+### Phase 1: Foundation - Add Incremental Saves After Each Step
+
+Add `save_step_history()` calls immediately after each step result is appended to enable real-time progress tracking. This is the core fix.
+
+### Phase 2: Testing - Verify Real-Time Updates
+
+Create comprehensive tests to verify step history is saved incrementally and accessible via API throughout workflow execution.
+
+### Phase 3: Validation - End-to-End Testing
+
+Validate with real workflow execution that step history appears incrementally when polling the steps endpoint.
+
+## Step by Step Tasks
+
+IMPORTANT: Execute every step in order, top to bottom.
+
+### Read Current Implementation
+
+- Open `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
+- Review the workflow execution flow from lines 122-269
+- Identify all 7 locations where `step_history.steps.append()` is called
+- Understand the pattern: append result → log completion → (currently missing: save history)
+- Note that `save_step_history()` already exists in state_repository and is thread-safe
+
+### Add Incremental Save After Classify Step
+
+- Locate line 130: `step_history.steps.append(classify_result)`
+- Immediately after line 130, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of classification result in real-time
+- Save the file
+
+### Add Incremental Save After Plan Step
+
+- Locate line 150: `step_history.steps.append(plan_result)`
+- Immediately after line 150, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of planning result in real-time
+- Save the file
+
+### Add Incremental Save After Find Plan Step
+
+- Locate line 166: `step_history.steps.append(plan_finder_result)`
+- Immediately after line 166, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of plan file discovery in real-time
+- Save the file
+
+### Add Incremental Save After Branch Generation Step
+
+- Locate line 186: `step_history.steps.append(branch_result)`
+- Immediately after line 186, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of branch creation in real-time
+- Save the file
+
+### Add Incremental Save After Implementation Step
+
+- Locate line 205: `step_history.steps.append(implement_result)`
+- Immediately after line 205, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of implementation progress in real-time
+- This is especially important as implementation can take 1-2 minutes
+- Save the file
+
+### Add Incremental Save After Commit Step
+
+- Locate line 224: `step_history.steps.append(commit_result)`
+- Immediately after line 224, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of commit creation in real-time
+- Save the file
+
+### Add Incremental Save After PR Creation Step
+
+- Locate line 241: `step_history.steps.append(pr_result)`
+- Immediately after line 241, add:
+  ```python
+  await self.state_repository.save_step_history(agent_work_order_id, step_history)
+  ```
+- This enables visibility of PR creation result in real-time
+- Save the file
+- Verify all 7 locations now have incremental saves
+
+### Add Comprehensive Unit Test for Incremental Saves
+
+- Open `python/tests/agent_work_orders/test_workflow_engine.py`
+- Add new test function at the end of file:
+  ```python
+  @pytest.mark.asyncio
+  async def test_orchestrator_saves_step_history_incrementally():
+      """Test that step history is saved after each step, not just at the end"""
+      from src.agent_work_orders.models import (
+          CommandExecutionResult,
+          StepExecutionResult,
+          WorkflowStep,
+      )
+      from src.agent_work_orders.workflow_engine.agent_names import CLASSIFIER
+
+      # Create mocks
+      mock_executor = MagicMock()
+      mock_sandbox_factory = MagicMock()
+      mock_github_client = MagicMock()
+      mock_phase_tracker = MagicMock()
+      mock_command_loader = MagicMock()
+      mock_state_repository = MagicMock()
+
+      # Track save_step_history calls
+      save_calls = []
+      async def track_save(wo_id, history):
+          save_calls.append(len(history.steps))
+
+      mock_state_repository.save_step_history = AsyncMock(side_effect=track_save)
+      mock_state_repository.update_status = AsyncMock()
+      mock_state_repository.update_git_branch = AsyncMock()
+
+      # Mock sandbox
+      mock_sandbox = MagicMock()
+      mock_sandbox.working_dir = "/tmp/test"
+      mock_sandbox.setup = AsyncMock()
+      mock_sandbox.cleanup = AsyncMock()
+      mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
+
+      # Mock GitHub client
+      mock_github_client.get_issue = AsyncMock(return_value={
+          "title": "Test Issue",
+          "body": "Test body"
+      })
+
+      # Create orchestrator
+      orchestrator = WorkflowOrchestrator(
+          agent_executor=mock_executor,
+          sandbox_factory=mock_sandbox_factory,
+          github_client=mock_github_client,
+          phase_tracker=mock_phase_tracker,
+          command_loader=mock_command_loader,
+          state_repository=mock_state_repository,
+      )
+
+      # Mock workflow operations to return success for all steps
+      with patch("src.agent_work_orders.workflow_engine.workflow_operations.classify_issue") as mock_classify:
+          with patch("src.agent_work_orders.workflow_engine.workflow_operations.build_plan") as mock_plan:
+              with patch("src.agent_work_orders.workflow_engine.workflow_operations.find_plan_file") as mock_find:
+                  with patch("src.agent_work_orders.workflow_engine.workflow_operations.generate_branch") as mock_branch:
+                      with patch("src.agent_work_orders.workflow_engine.workflow_operations.implement_plan") as mock_implement:
+                          with patch("src.agent_work_orders.workflow_engine.workflow_operations.create_commit") as mock_commit:
+                              with patch("src.agent_work_orders.workflow_engine.workflow_operations.create_pull_request") as mock_pr:
+
+                                  # Mock successful results for each step
+                                  mock_classify.return_value = StepExecutionResult(
+                                      step=WorkflowStep.CLASSIFY,
+                                      agent_name=CLASSIFIER,
+                                      success=True,
+                                      output="/feature",
+                                      duration_seconds=1.0,
+                                  )
+
+                                  mock_plan.return_value = StepExecutionResult(
+                                      step=WorkflowStep.PLAN,
+                                      agent_name="planner",
+                                      success=True,
+                                      output="Plan created",
+                                      duration_seconds=2.0,
+                                  )
+
+                                  mock_find.return_value = StepExecutionResult(
+                                      step=WorkflowStep.FIND_PLAN,
+                                      agent_name="plan_finder",
+                                      success=True,
+                                      output="specs/plan.md",
+                                      duration_seconds=0.5,
+                                  )
+
+                                  mock_branch.return_value = StepExecutionResult(
+                                      step=WorkflowStep.GENERATE_BRANCH,
+                                      agent_name="branch_generator",
+                                      success=True,
+                                      output="feat-issue-1-wo-test",
+                                      duration_seconds=1.0,
+                                  )
+
+                                  mock_implement.return_value = StepExecutionResult(
+                                      step=WorkflowStep.IMPLEMENT,
+                                      agent_name="implementor",
+                                      success=True,
+                                      output="Implementation complete",
+                                      duration_seconds=5.0,
+                                  )
+
+                                  mock_commit.return_value = StepExecutionResult(
+                                      step=WorkflowStep.COMMIT,
+                                      agent_name="committer",
+                                      success=True,
+                                      output="Commit created",
+                                      duration_seconds=1.0,
+                                  )
+
+                                  mock_pr.return_value = StepExecutionResult(
+                                      step=WorkflowStep.CREATE_PR,
+                                      agent_name="pr_creator",
+                                      success=True,
+                                      output="https://github.com/owner/repo/pull/1",
+                                      duration_seconds=1.0,
+                                  )
+
+                                  # Execute workflow
+                                  await orchestrator.execute_workflow(
+                                      agent_work_order_id="wo-test",
+                                      workflow_type=AgentWorkflowType.PLAN,
+                                      repository_url="https://github.com/owner/repo",
+                                      sandbox_type=SandboxType.GIT_BRANCH,
+                                      user_request="Test feature request",
+                                  )
+
+      # Verify save_step_history was called after EACH step (7 times) + final save (8 total)
+      # OR at minimum, verify it was called MORE than just once at the end
+      assert len(save_calls) >= 7, f"Expected at least 7 incremental saves, got {len(save_calls)}"
+
+      # Verify the progression: 1 step, 2 steps, 3 steps, etc.
+      assert save_calls[0] == 1, "First save should have 1 step"
+      assert save_calls[1] == 2, "Second save should have 2 steps"
+      assert save_calls[2] == 3, "Third save should have 3 steps"
+      assert save_calls[3] == 4, "Fourth save should have 4 steps"
+      assert save_calls[4] == 5, "Fifth save should have 5 steps"
+      assert save_calls[5] == 6, "Sixth save should have 6 steps"
+      assert save_calls[6] == 7, "Seventh save should have 7 steps"
+  ```
+- Save the file
+
+### Add Integration Test for Real-Time Step Visibility
+
+- Still in `python/tests/agent_work_orders/test_workflow_engine.py`
+- Add another test function:
+  ```python
+  @pytest.mark.asyncio
+  async def test_step_history_visible_during_execution():
+      """Test that step history can be retrieved during workflow execution"""
+      from src.agent_work_orders.models import StepHistory
+
+      # Create real state repository (in-memory)
+      from src.agent_work_orders.state_manager.work_order_repository import WorkOrderRepository
+      state_repo = WorkOrderRepository()
+
+      # Create empty step history
+      step_history = StepHistory(agent_work_order_id="wo-test")
+
+      # Simulate incremental saves during workflow
+      from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
+
+      # Step 1: Classify
+      step_history.steps.append(StepExecutionResult(
+          step=WorkflowStep.CLASSIFY,
+          agent_name="classifier",
+          success=True,
+          output="/feature",
+          duration_seconds=1.0,
+      ))
+      await state_repo.save_step_history("wo-test", step_history)
+
+      # Retrieve and verify
+      retrieved = await state_repo.get_step_history("wo-test")
+      assert retrieved is not None
+      assert len(retrieved.steps) == 1
+      assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
+
+      # Step 2: Plan
+      step_history.steps.append(StepExecutionResult(
+          step=WorkflowStep.PLAN,
+          agent_name="planner",
+          success=True,
+          output="Plan created",
+          duration_seconds=2.0,
+      ))
+      await state_repo.save_step_history("wo-test", step_history)
+
+      # Retrieve and verify progression
+      retrieved = await state_repo.get_step_history("wo-test")
+      assert len(retrieved.steps) == 2
+      assert retrieved.steps[1].step == WorkflowStep.PLAN
+
+      # Verify both steps are present
+      assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
+      assert retrieved.steps[1].step == WorkflowStep.PLAN
+  ```
+- Save the file
+
+### Run Unit Tests for Workflow Engine
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_orchestrator_saves_step_history_incrementally -v`
+- Verify the test passes and confirms incremental saves occur
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_step_history_visible_during_execution -v`
+- Verify the test passes
+- Fix any failures before proceeding
+
+### Run All Workflow Engine Tests
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py -v`
+- Ensure all existing tests still pass (zero regressions)
+- Verify new tests are included in the run
+- Fix any failures
+
+### Run Complete Agent Work Orders Test Suite
+
+- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
+- Ensure all tests across all modules pass
+- This validates no regressions were introduced
+- Pay special attention to state manager and API tests
+- Fix any failures
+
+### Run Type Checking
+
+- Execute: `cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
+- Verify no type errors in the orchestrator
+- Execute: `cd python && uv run mypy src/agent_work_orders/`
+- Verify no type errors in the entire module
+- Fix any type issues
+
+### Run Linting
+
+- Execute: `cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
+- Verify no linting issues in orchestrator
+- Execute: `cd python && uv run ruff check src/agent_work_orders/`
+- Verify no linting issues in entire module
+- Fix any issues found
+
+### Perform Manual End-to-End Validation
+
+- Start the Agent Work Orders server:
+  ```bash
+  cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
+  ```
+- Wait for startup: `sleep 5`
+- Verify health: `curl http://localhost:8888/health | jq`
+- Create a test work order:
+  ```bash
+  WORK_ORDER_ID=$(curl -s -X POST http://localhost:8888/agent-work-orders \
+    -H "Content-Type: application/json" \
+    -d '{
+      "repository_url": "https://github.com/Wirasm/dylan.git",
+      "sandbox_type": "git_branch",
+      "workflow_type": "agent_workflow_plan",
+      "user_request": "Add a test feature for real-time step tracking validation"
+    }' | jq -r '.agent_work_order_id')
+  echo "Created work order: $WORK_ORDER_ID"
+  ```
+- Immediately start polling for steps (in a loop or manually):
+  ```bash
+  # Poll every 3 seconds to observe real-time progress
+  for i in {1..60}; do
+    echo "=== Poll $i ($(date +%H:%M:%S)) ==="
+    curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length'
+    curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps[-1] | {step: .step, agent: .agent_name, success: .success}'
+    sleep 3
+  done
+  ```
+- Observe that step count increases incrementally: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7
+- Verify each step appears immediately after completion (not all at once at the end)
+- Verify you can see progress in real-time
+- Check final status: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq '{status: .status, steps_completed: (.git_commit_count // 0)}'`
+- Stop the server: `pkill -f "uvicorn.*8888"`
+
+### Document the Improvement
+
+- Open `PRPs/specs/agent-work-orders-mvp-v2.md` (or relevant spec file)
+- Add a note in the Observability or Implementation Notes section:
+  ```markdown
+  ### Real-Time Progress Tracking
+
+  Step history is saved incrementally after each workflow step completes, enabling
+  real-time progress visibility via the `/agent-work-orders/{id}/steps` endpoint.
+  This allows users to monitor execution as it happens rather than waiting for the
+  entire workflow to complete.
+
+  Implementation: `save_step_history()` is called after each `steps.append()` in
+  the workflow orchestrator, providing immediate feedback to polling clients.
+  ```
+- Save the file
+
+### Run Final Validation Commands
+
+- Execute all validation commands listed in the Validation Commands section below
+- Ensure every command executes successfully
+- Verify zero regressions across the entire codebase
+- Confirm real-time progress tracking works end-to-end
+
+## Testing Strategy
+
+### Unit Tests
+
+**Workflow Orchestrator Tests:**
+- Test that `save_step_history()` is called after each workflow step
+- Test that step history is saved 7+ times during successful execution (once per step + final save)
+- Test that step count increases incrementally (1, 2, 3, 4, 5, 6, 7)
+- Test that step history is saved even when workflow fails mid-execution
+- Test that each save contains all steps completed up to that point
+
+**State Repository Tests:**
+- Test that `save_step_history()` handles concurrent calls safely (already implemented with asyncio.Lock)
+- Test that retrieving step history returns the most recently saved version
+- Test that step history can be saved and retrieved multiple times for same work order
+- Test that step history overwrites previous version (not appends)
+
+### Integration Tests
+
+**End-to-End Workflow Tests:**
+- Test that step history can be retrieved via API during workflow execution
+- Test that polling `/agent-work-orders/{id}/steps` shows progressive updates
+- Test that step history contains correct number of steps at each save point
+- Test that step history is accessible immediately after each step completes
+- Test that failed steps are visible in step history before workflow terminates
+
+**API Integration Tests:**
+- Test GET `/agent-work-orders/{id}/steps` returns empty array before first step
+- Test GET `/agent-work-orders/{id}/steps` returns 1 step after classification
+- Test GET `/agent-work-orders/{id}/steps` returns N steps after N steps complete
+- Test GET `/agent-work-orders/{id}/steps` returns complete history after workflow finishes
+
+### Edge Cases
+
+**Concurrent Access:**
+- Multiple clients polling `/agent-work-orders/{id}/steps` simultaneously
+- Step history being saved while another request reads it (handled by asyncio.Lock)
+- Workflow fails while client is retrieving step history
+
+**Performance:**
+- Large step history (7 steps * 100+ lines each) saved multiple times
+- Multiple work orders executing simultaneously with incremental saves
+- High polling frequency (1 second intervals) during workflow execution
+
+**Failure Scenarios:**
+- Step history save fails (network/disk error) - workflow should continue
+- Step history is saved but retrieval fails - should return appropriate error
+- Workflow interrupted mid-execution - partial step history should be preserved
+
+## Acceptance Criteria
+
+**Core Functionality:**
+- ✅ Step history is saved after each workflow step completes
+- ✅ Step history is saved 7 times during successful workflow execution (once per step)
+- ✅ Each incremental save contains all steps completed up to that point
+- ✅ Step history is accessible via API immediately after each step
+- ✅ Real-time progress visible when polling `/agent-work-orders/{id}/steps`
+
+**Backward Compatibility:**
+- ✅ All existing tests pass without modification
+- ✅ API behavior unchanged (same endpoints, same response format)
+- ✅ No breaking changes to models or state repository
+- ✅ Performance impact negligible (save operations are fast)
+
+**Testing:**
+- ✅ New unit test verifies incremental saves occur
+- ✅ New integration test verifies step history visibility during execution
+- ✅ All existing workflow engine tests pass
+- ✅ All agent work orders tests pass
+- ✅ Manual end-to-end test confirms real-time progress tracking
+
+**Code Quality:**
+- ✅ Type checking passes (mypy)
+- ✅ Linting passes (ruff)
+- ✅ Code follows existing patterns and conventions
+- ✅ Structured logging used for save operations
+
+**Documentation:**
+- ✅ Implementation documented in spec file
+- ✅ Acceptance criteria met and verified
+- ✅ Validation commands executed successfully
+
+## Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+```bash
+# Unit Tests - Verify incremental saves
+cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_orchestrator_saves_step_history_incrementally -v
+cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_step_history_visible_during_execution -v
+
+# Workflow Engine Tests - Ensure no regressions
+cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py -v
+
+# State Manager Tests - Verify save_step_history works correctly
+cd python && uv run pytest tests/agent_work_orders/test_state_manager.py -v
+
+# API Tests - Ensure steps endpoint still works
+cd python && uv run pytest tests/agent_work_orders/test_api.py -v
+
+# Complete Agent Work Orders Test Suite
+cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
+
+# Type Checking
+cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_orchestrator.py
+cd python && uv run mypy src/agent_work_orders/
+
+# Linting
+cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_orchestrator.py
+cd python && uv run ruff check src/agent_work_orders/
+
+# Full Backend Test Suite (zero regressions)
+cd python && uv run pytest
+
+# Manual End-to-End Validation
+cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
+sleep 5
+curl http://localhost:8888/health | jq
+
+# Create work order
+WORK_ORDER_ID=$(curl -s -X POST http://localhost:8888/agent-work-orders \
+  -H "Content-Type: application/json" \
+  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Test real-time progress"}' \
+  | jq -r '.agent_work_order_id')
+
+echo "Work Order: $WORK_ORDER_ID"
+
+# Poll for real-time progress (observe step count increase: 0->1->2->3->4->5->6->7)
+for i in {1..30}; do
+  STEP_COUNT=$(curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length')
+  LAST_STEP=$(curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq -r '.steps[-1].step // "none"')
+  echo "Poll $i: $STEP_COUNT steps completed, last: $LAST_STEP"
+  sleep 3
+done
+
+# Verify final state
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq '{status: .status}'
+curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length'
+
+# Cleanup
+pkill -f "uvicorn.*8888"
+```
+
+## Notes
+
+### Performance Considerations
+
+**Save Operation Performance:**
+- `save_step_history()` is a fast in-memory operation (Phase 1 MVP)
+- Uses asyncio.Lock to prevent race conditions
+- No network I/O or disk writes in current implementation
+- Future Supabase migration (Phase 2) will add network latency but async execution prevents blocking
+
+**Impact Analysis:**
+- Adding 7 incremental saves adds ~7ms total overhead (1ms per save in-memory)
+- This is negligible compared to agent execution time (30-60 seconds per step)
+- Total workflow time increase: <0.01% (unmeasurable)
+- Trade-off: Tiny performance cost for massive observability improvement
+
+### Why This Fix is Critical
+
+**User Experience Impact:**
+- **Before**: Black-box execution with 3-5 minute wait, zero feedback
+- **After**: Real-time progress updates every 30-60 seconds as steps complete
+
+**Debugging Benefits:**
+- Immediately see which step failed without waiting for entire workflow
+- Monitor long-running implementation steps for progress
+- Identify bottlenecks in workflow execution
+
+**API Efficiency:**
+- Clients still poll every 3 seconds, but now get meaningful updates
+- Reduces frustrated users refreshing pages or restarting work orders
+- Enables progress bars, step indicators, and real-time status UIs
+
+### Implementation Simplicity
+
+This is one of the simplest high-value features to implement:
+- **7 lines of code** (one `await save_step_history()` call per step)
+- **Zero API changes** (existing endpoint already works)
+- **Zero model changes** (StepHistory already supports this pattern)
+- **Zero state repository changes** (save_step_history() already thread-safe)
+- **High impact** (transforms user experience from frustrating to delightful)
+
+### Future Enhancements
+
+**Phase 2 - Supabase Persistence:**
+- When migrating to Supabase, the same incremental save pattern works
+- May want to batch saves (every 2-3 steps) to reduce DB writes
+- Consider write-through cache for high-frequency polling
+
+**Phase 3 - WebSocket Support:**
+- Instead of polling, push step updates via WebSocket
+- Even better real-time experience with lower latency
+- Incremental saves still required as source of truth
+
+**Advanced Observability:**
+- Add step timing metrics (time between saves = step duration)
+- Track which steps consistently take longest
+- Alert on unusually slow step execution
+- Historical analysis of workflow performance
+
+### Testing Philosophy
+
+**Focus on Real-Time Visibility:**
+- Primary test: verify saves occur after each step (not just at end)
+- Secondary test: verify step count progression (1, 2, 3, 4, 5, 6, 7)
+- Integration test: confirm API returns incremental results during execution
+- Manual test: observe real progress while workflow runs
+
+**Regression Prevention:**
+- All existing tests must pass unchanged
+- No API contract changes
+- No model changes
+- Performance impact negligible and measured
+
+### Related Documentation
+
+- Agent Work Orders MVP v2 Spec: `PRPs/specs/agent-work-orders-mvp-v2.md`
+- Atomic Workflow Execution: `PRPs/specs/atomic-workflow-execution-refactor.md`
+- PRD: `PRPs/PRD.md`