refactor: simplify workflow to user-selectable 6-command architecture

Simplifies the workflow orchestrator from hardcoded 11-step atomic operations to user-selectable 6-command workflow with context passing. Core changes: - WorkflowStep enum: 11 steps → 6 commands (create-branch, planning, execute, commit, create-pr, prp-review) - workflow_orchestrator.py: 367 lines → 200 lines with command stitching loop - Remove workflow_type field, add selected_commands parameter - Simplify agent names from 11 → 6 constants - Remove test/review phase config flags (now optional commands) Deletions: - Remove test_workflow.py, review_workflow.py, workflow_phase_tracker.py - Remove 32 old command files from .claude/commands - Remove PRPs/specs and PRD files from version control - Update .gitignore to exclude specs, features, and validation markdown files Breaking changes: - AgentWorkOrder no longer has workflow_type field - CreateAgentWorkOrderRequest now uses selected_commands instead of workflow_type - WorkflowStep enum values incompatible with old step history 56 files changed, 625 insertions(+), 15,007 deletions(-)
2025-12-23 18:29:18 -05:00 · 2025-10-16 19:11:54 +03:00
parent 1c0020946b
commit fd81505908
70 changed files with 2432 additions and 15034 deletions
--- a/.claude/commands/agent-work-orders/agent_workflow_plan.md
+++ b/.claude/commands/agent-work-orders/agent_workflow_plan.md
@@ -1,56 +0,0 @@
-# Agent Workflow: Plan
-
-You are executing a planning workflow for a GitHub issue or project task.
-
-## Your Task
-
-1. Read the GitHub issue description (if provided via issue number)
-2. Analyze the requirements thoroughly
-3. Create a detailed implementation plan
-4. Save the plan to `PRPs/specs/plan-{work_order_id}.md`
-5. Create a git branch named `feat-wo-{work_order_id}`
-6. Commit all changes to git with clear commit messages
-
-## Branch Naming
-
-Use format: `feat-wo-{work_order_id}`
-
-Example: `feat-wo-a3c2f1e4`
-
-## Commit Message Format
-
-```
-plan: Create implementation plan for work order
-
- Analyzed requirements
- Created detailed plan
- Documented approach
-
-Work Order: {work_order_id}
-```
-
-## Deliverables
-
- Git branch created following naming convention
- `PRPs/specs/plan-{work_order_id}.md` file with detailed plan
- All changes committed to git
- Clear commit messages documenting the work
-
-## Plan Structure
-
-Your plan should include:
-
-1. **Feature Description** - What is being built
-2. **Problem Statement** - What problem does this solve
-3. **Solution Statement** - How will we solve it
-4. **Architecture** - Technical design decisions
-5. **Implementation Plan** - Step-by-step tasks
-6. **Testing Strategy** - How to verify it works
-7. **Acceptance Criteria** - Definition of done
-
-## Important Notes
-
- Always create a new branch for your work
- Commit frequently with descriptive messages
- Include the work order ID in branch name and commits
- Focus on creating a comprehensive, actionable plan
--- a/.claude/commands/agent-work-orders/bug.md
+++ b/.claude/commands/agent-work-orders/bug.md
@@ -1,97 +0,0 @@
-# Bug Planning
-
-Create a new plan to resolve the `Bug` using the exact specified markdown `Plan Format`. Follow the `Instructions` to create the plan use the `Relevant Files` to focus on the right files.
-
-## Variables
-issue_number: $1
-adw_id: $2
-issue_json: $3
-
-## Instructions
-
- IMPORTANT: You're writing a plan to resolve a bug based on the `Bug` that will add value to the application.
- IMPORTANT: The `Bug` describes the bug that will be resolved but remember we're not resolving the bug, we're creating the plan that will be used to resolve the bug based on the `Plan Format` below.
- You're writing a plan to resolve a bug, it should be thorough and precise so we fix the root cause and prevent regressions.
- Create the plan in the `specs/` directory with filename: `issue-{issue_number}-adw-{adw_id}-sdlc_planner-{descriptive-name}.md`
-  - Replace `{descriptive-name}` with a short, descriptive name based on the bug (e.g., "fix-login-error", "resolve-timeout", "patch-memory-leak")
- Use the plan format below to create the plan. 
- Research the codebase to understand the bug, reproduce it, and put together a plan to fix it.
- IMPORTANT: Replace every <placeholder> in the `Plan Format` with the requested value. Add as much detail as needed to fix the bug.
- Use your reasoning model: THINK HARD about the bug, its root cause, and the steps to fix it properly.
- IMPORTANT: Be surgical with your bug fix, solve the bug at hand and don't fall off track.
- IMPORTANT: We want the minimal number of changes that will fix and address the bug.
- Don't use decorators. Keep it simple.
- If you need a new library, use `uv add` and be sure to report it in the `Notes` section of the `Plan Format`.
- IMPORTANT: If the bug affects the UI or user interactions:
-  - Add a task in the `Step by Step Tasks` section to create a separate E2E test file in `.claude/commands/e2e/test_<descriptive_name>.md` based on examples in that directory
-  - Add E2E test validation to your Validation Commands section
-  - IMPORTANT: When you fill out the `Plan Format: Relevant Files` section, add an instruction to read `.claude/commands/test_e2e.md`, and `.claude/commands/e2e/test_basic_query.md` to understand how to create an E2E test file. List your new E2E test file to the `Plan Format: New Files` section.
-  - To be clear, we're not creating a new E2E test file, we're creating a task to create a new E2E test file in the `Plan Format` below
- Respect requested files in the `Relevant Files` section.
- Start your research by reading the `README.md` file.
-
-## Relevant Files
-
-Focus on the following files:
- `README.md` - Contains the project overview and instructions.
- `app/**` - Contains the codebase client/server.
- `scripts/**` - Contains the scripts to start and stop the server + client.
- `adws/**` - Contains the AI Developer Workflow (ADW) scripts.
-
-Ignore all other files in the codebase.
-
-## Plan Format
-
-```md
-# Bug: <bug name>
-
-## Bug Description
-<describe the bug in detail, including symptoms and expected vs actual behavior>
-
-## Problem Statement
-<clearly define the specific problem that needs to be solved>
-
-## Solution Statement
-<describe the proposed solution approach to fix the bug>
-
-## Steps to Reproduce
-<list exact steps to reproduce the bug>
-
-## Root Cause Analysis
-<analyze and explain the root cause of the bug>
-
-## Relevant Files
-Use these files to fix the bug:
-
-<find and list the files that are relevant to the bug describe why they are relevant in bullet points. If there are new files that need to be created to fix the bug, list them in an h3 'New Files' section.>
-
-## Step by Step Tasks
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. use as many h3 headers as needed to fix the bug. Order matters, start with the foundational shared changes required to fix the bug then move on to the specific changes required to fix the bug. Include tests that will validate the bug is fixed with zero regressions.>
-
-<If the bug affects UI, include a task to create a E2E test file. Your task should look like: "Read `.claude/commands/e2e/test_basic_query.md` and `.claude/commands/e2e/test_complex_query.md` and create a new E2E test file in `.claude/commands/e2e/test_<descriptive_name>.md` that validates the bug is fixed, be specific with the steps to prove the bug is fixed. We want the minimal set of steps to validate the bug is fixed and screen shots to prove it if possible.">
-
-<Your last step should be running the `Validation Commands` to validate the bug is fixed with zero regressions.>
-
-## Validation Commands
-Execute every command to validate the bug is fixed with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the bug is fixed with zero regressions. every command must execute without errors so be specific about what you want to run to validate the bug is fixed with zero regressions. Include commands to reproduce the bug before and after the fix.>
-
-<If you created an E2E test, include the following validation step: "Read .claude/commands/test_e2e.md`, then read and execute your new E2E `.claude/commands/e2e/test_<descriptive_name>.md` test file to validate this functionality works.">
-
- `cd app/server && uv run pytest` - Run server tests to validate the bug is fixed with zero regressions
- `cd app/client && bun tsc --noEmit` - Run frontend tests to validate the bug is fixed with zero regressions
- `cd app/client && bun run build` - Run frontend build to validate the bug is fixed with zero regressions
-
-## Notes
-<optionally list any additional notes or context that are relevant to the bug that will be helpful to the developer>
-```
-
-## Bug
-Extract the bug details from the `issue_json` variable (parse the JSON and use the title and body fields).
-
-## Report
- Summarize the work you've just done in a concise bullet point list.
- Include the full path to the plan file you created (e.g., `specs/issue-123-adw-abc123-sdlc_planner-fix-login-error.md`)
--- a/.claude/commands/agent-work-orders/chore.md
+++ b/.claude/commands/agent-work-orders/chore.md
@@ -1,69 +0,0 @@
-# Chore Planning
-
-Create a new plan to resolve the `Chore` using the exact specified markdown `Plan Format`. Follow the `Instructions` to create the plan use the `Relevant Files` to focus on the right files. Follow the `Report` section to properly report the results of your work.
-
-## Variables
-issue_number: $1
-adw_id: $2
-issue_json: $3
-
-## Instructions
-
- IMPORTANT: You're writing a plan to resolve a chore based on the `Chore` that will add value to the application.
- IMPORTANT: The `Chore` describes the chore that will be resolved but remember we're not resolving the chore, we're creating the plan that will be used to resolve the chore based on the `Plan Format` below.
- You're writing a plan to resolve a chore, it should be simple but we need to be thorough and precise so we don't miss anything or waste time with any second round of changes.
- Create the plan in the `specs/` directory with filename: `issue-{issue_number}-adw-{adw_id}-sdlc_planner-{descriptive-name}.md`
-  - Replace `{descriptive-name}` with a short, descriptive name based on the chore (e.g., "update-readme", "fix-tests", "refactor-auth")
- Use the plan format below to create the plan. 
- Research the codebase and put together a plan to accomplish the chore.
- IMPORTANT: Replace every <placeholder> in the `Plan Format` with the requested value. Add as much detail as needed to accomplish the chore.
- Use your reasoning model: THINK HARD about the plan and the steps to accomplish the chore.
- Respect requested files in the `Relevant Files` section.
- Start your research by reading the `README.md` file.
- `adws/*.py` contain astral uv single file python scripts. So if you want to run them use `uv run <script_name>`.
- When you finish creating the plan for the chore, follow the `Report` section to properly report the results of your work.
-
-## Relevant Files
-
-Focus on the following files:
- `README.md` - Contains the project overview and instructions.
- `app/**` - Contains the codebase client/server.
- `scripts/**` - Contains the scripts to start and stop the server + client.
- `adws/**` - Contains the AI Developer Workflow (ADW) scripts.
-
-Ignore all other files in the codebase.
-
-## Plan Format
-
-```md
-# Chore: <chore name>
-
-## Chore Description
-<describe the chore in detail>
-
-## Relevant Files
-Use these files to resolve the chore:
-
-<find and list the files that are relevant to the chore describe why they are relevant in bullet points. If there are new files that need to be created to accomplish the chore, list them in an h3 'New Files' section.>
-
-## Step by Step Tasks
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. use as many h3 headers as needed to accomplish the chore. Order matters, start with the foundational shared changes required to fix the chore then move on to the specific changes required to fix the chore. Your last step should be running the `Validation Commands` to validate the chore is complete with zero regressions.>
-
-## Validation Commands
-Execute every command to validate the chore is complete with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the chore is complete with zero regressions. every command must execute without errors so be specific about what you want to run to validate the chore is complete with zero regressions. Don't validate with curl commands.>
- `cd app/server && uv run pytest` - Run server tests to validate the chore is complete with zero regressions
-
-## Notes
-<optionally list any additional notes or context that are relevant to the chore that will be helpful to the developer>
-```
-
-## Chore
-Extract the chore details from the `issue_json` variable (parse the JSON and use the title and body fields).
-
-## Report
- Summarize the work you've just done in a concise bullet point list.
- Include the full path to the plan file you created (e.g., `specs/issue-7-adw-abc123-sdlc_planner-update-readme.md`)
--- a/.claude/commands/agent-work-orders/classify_adw.md
+++ b/.claude/commands/agent-work-orders/classify_adw.md
@@ -1,39 +0,0 @@
-# ADW Workflow Extraction
-
-Extract ADW workflow information from the text below and return a JSON response.
-
-## Instructions
-
- Look for ADW workflow commands in the text (e.g., `/adw_plan`, `/adw_test`, `/adw_build`, `/adw_plan_build`, `/adw_plan_build_test`)
- Look for ADW IDs (8-character alphanumeric strings, often after "adw_id:" or "ADW ID:" or similar)
- Return a JSON object with the extracted information
- If no ADW workflow is found, return empty JSON: `{}`
-
-## Valid ADW Commands
-
- `/adw_plan` - Planning only
- `/adw_build` - Building only (requires adw_id)
- `/adw_test` - Testing only  
- `/adw_plan_build` - Plan + Build
- `/adw_plan_build_test` - Plan + Build + Test
-
-## Response Format
-
-Respond ONLY with a JSON object in this format:
-```json
-{
-  "adw_slash_command": "/adw_plan",
-  "adw_id": "abc12345"
-}
-```
-
-Fields:
- `adw_slash_command`: The ADW command found (include the slash)
- `adw_id`: The 8-character ADW ID if found
-
-If only one field is found, include only that field.
-If nothing is found, return: `{}`
-
-## Text to Analyze
-
-$ARGUMENTS
--- a/.claude/commands/agent-work-orders/classify_issue.md
+++ b/.claude/commands/agent-work-orders/classify_issue.md
@@ -1,21 +0,0 @@
-# Github Issue Command Selection
-
-Based on the `Github Issue` below, follow the `Instructions` to select the appropriate command to execute based on the `Command Mapping`.
-
-## Instructions
-
- Based on the details in the `Github Issue`, select the appropriate command to execute.
- IMPORTANT: Respond exclusively with '/' followed by the command to execute based on the `Command Mapping` below.
- Use the command mapping to help you decide which command to respond with.
- Don't examine the codebase just focus on the `Github Issue` and the `Command Mapping` below to determine the appropriate command to execute.
-
-## Command Mapping
-
- Respond with `/chore` if the issue is a chore.
- Respond with `/bug` if the issue is a bug.
- Respond with `/feature` if the issue is a feature.
- Respond with `0` if the issue isn't any of the above.
-
-## Github Issue
-
-$ARGUMENTS
--- a/.claude/commands/agent-work-orders/commit.md
+++ b/.claude/commands/agent-work-orders/commit.md
@@ -1,33 +1,55 @@
-# Generate Git Commit
+# Create Git Commit

-Based on the `Instructions` below, take the `Variables` follow the `Run` section to create a git commit with a properly formatted message. Then follow the `Report` section to report the results of your work.
+Create an atomic git commit with a properly formatted commit message following best practices for the uncommited changes or these specific files if specified.

-## Variables
+Specific files (skip if not specified):

-agent_name: $1
-issue_class: $2
-issue: $3
+- File 1: $1
+- File 2: $2
+- File 3: $3
+- File 4: $4
+- File 5: $5

 ## Instructions

- Generate a concise commit message in the format: `<agent_name>: <issue_class>: <commit message>`
- The `<commit message>` should be:
-  - Present tense (e.g., "add", "fix", "update", not "added", "fixed", "updated")
-  - 50 characters or less
-  - Descriptive of the actual changes made
-  - No period at the end
- Examples:
-  - `sdlc_planner: feat: add user authentication module`
-  - `sdlc_implementor: bug: fix login validation error`
-  - `sdlc_planner: chore: update dependencies to latest versions`
- Extract context from the issue JSON to make the commit message relevant
+**Commit Message Format:**
+
+- Use conventional commits: `<type>: <description>`
+- Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
+- Present tense (e.g., "add", "fix", "update", not "added", "fixed", "updated")
+- 50 characters or less for the subject line
+- Lowercase subject line
+- No period at the end
+- Be specific and descriptive
+
+**Examples:**
+
+- `feat: add web search tool with structured logging`
+- `fix: resolve type errors in middleware`
+- `test: add unit tests for config module`
+- `docs: update CLAUDE.md with testing guidelines`
+- `refactor: simplify logging configuration`
+- `chore: update dependencies`
+
+**Atomic Commits:**
+
+- One logical change per commit
+- If you've made multiple unrelated changes, consider splitting into separate commits
+- Commit should be self-contained and not break the build
+
+**IMPORTANT**
+
+- NEVER mention claude code, anthropic, co authored by or anything similar in the commit messages

 ## Run

-1. Run `git diff HEAD` to understand what changes have been made
-2. Run `git add -A` to stage all changes
-3. Run `git commit -m "<generated_commit_message>"` to create the commit
+1. Review changes: `git diff HEAD`
+2. Check status: `git status`
+3. Stage changes: `git add -A`
+4. Create commit: `git commit -m "<type>: <description>"`

 ## Report

-Return ONLY the commit message that was used (no other text)
+- Output the commit message used
+- Confirm commit was successful with commit hash
+- List files that were committed
--- a/.claude/commands/agent-work-orders/e2e/test_basic_query.md
+++ b/.claude/commands/agent-work-orders/e2e/test_basic_query.md
@@ -1,38 +0,0 @@
-# E2E Test: Basic Query Execution
-
-Test basic query functionality in the Natural Language SQL Interface application.
-
-## User Story
-
-As a user  
-I want to query my data using natural language  
-So that I can access information without writing SQL
-
-## Test Steps
-
-1. Navigate to the `Application URL`
-2. Take a screenshot of the initial state
-3. **Verify** the page title is "Natural Language SQL Interface"
-4. **Verify** core UI elements are present:
-   - Query input textbox
-   - Query button
-   - Upload Data button
-   - Available Tables section
-
-5. Enter the query: "Show me all users from the users table"
-6. Take a screenshot of the query input
-7. Click the Query button
-8. **Verify** the query results appear
-9. **Verify** the SQL translation is displayed (should contain "SELECT * FROM users")
-10. Take a screenshot of the SQL translation
-11. **Verify** the results table contains data
-12. Take a screenshot of the results
-13. Click "Hide" button to close results
-
-## Success Criteria
- Query input accepts text
- Query button triggers execution
- Results display correctly
- SQL translation is shown
- Hide button works
- 3 screenshots are taken
--- a/.claude/commands/agent-work-orders/e2e/test_complex_query.md
+++ b/.claude/commands/agent-work-orders/e2e/test_complex_query.md
@@ -1,33 +0,0 @@
-# E2E Test: Complex Query with Filtering
-
-Test complex query capabilities with filtering conditions.
-
-## User Story
-
-As a user  
-I want to query data using natural language with complex filtering conditions  
-So that I can retrieve specific subsets of data without needing to write SQL
-
-## Test Steps
-
-1. Navigate to the `Application URL`
-2. Take a screenshot of the initial state
-3. Clear the query input
-4. Enter: "Show users older than 30 who live in cities starting with 'S'"
-5. Take a screenshot of the query input
-6. Click Query button
-7. **Verify** results appear with filtered data
-8. **Verify** the generated SQL contains WHERE clause
-9. Take a screenshot of the SQL translation
-10. Count the number of results returned
-11. Take a screenshot of the filtered results
-12. Click "Hide" button to close results
-13. Take a screenshot of the final state
-
-## Success Criteria
- Complex natural language is correctly interpreted
- SQL contains appropriate WHERE conditions
- Results are properly filtered
- No errors occur during execution
- Hide button works
- 5 screenshots are taken
--- a/.claude/commands/agent-work-orders/e2e/test_sql_injection.md
+++ b/.claude/commands/agent-work-orders/e2e/test_sql_injection.md
@@ -1,30 +0,0 @@
-# E2E Test: SQL Injection Protection
-
-Test the application's protection against SQL injection attacks.
-
-## User Story
-
-As a user  
-I want to be protected from SQL injection attacks when using the query interface  
-So that my data remains secure and the database integrity is maintained
-
-## Test Steps
-
-1. Navigate to the `Application URL`
-2. Take a screenshot of the initial state
-3. Clear the query input
-4. Enter: "DROP TABLE users;"
-5. Take a screenshot of the malicious query input
-6. Click Query button
-7. **Verify** an error message appears containing "Security error" or similar
-8. Take a screenshot of the security error
-9. **Verify** the users table still exists in Available Tables section
-10. Take a screenshot showing the tables are intact
-
-## Success Criteria
- SQL injection attempt is blocked
- Appropriate security error message is displayed
- No damage to the database
- Tables remain intact
- Query input accepts the malicious text
- 4 screenshots are taken
--- a/.claude/commands/agent-work-orders/execute.md
+++ b/.claude/commands/agent-work-orders/execute.md
@@ -0,0 +1,27 @@
+# Execute PRP Plan
+
+Implement a feature plan from the PRPs directory by following its Step by Step Tasks section.
+
+## Variables
+
+Plan file: $ARGUMENTS
+
+## Instructions
+
+- Read the entire plan file carefully
+- Execute **every step** in the "Step by Step Tasks" section in order, top to bottom
+- Follow the "Testing Strategy" to create proper unit and integration tests
+- Complete all "Validation Commands" at the end
+- Ensure all linters pass and all tests pass before finishing
+- Follow CLAUDE.md guidelines for type safety, logging, and docstrings
+
+## When done
+
+- Move the PRP file to the completed directory in PRPs/features/completed
+
+## Report
+
+- Summarize completed work in a concise bullet point list
+- Show files and lines changed: `git diff --stat`
+- Confirm all validation commands passed
+- Note any deviations from the plan (if any)
--- a/.claude/commands/agent-work-orders/feature.md
+++ b/.claude/commands/agent-work-orders/feature.md
@@ -1,120 +0,0 @@
-# Feature Planning
-
-Create a new plan in PRPs/specs/\*.md to implement the `Feature` using the exact specified markdown `Plan Format`. Follow the `Instructions` to create the plan use the `Relevant Files` to focus on the right files.
-
-## Instructions
-
- IMPORTANT: You're writing a plan to implement a net new feature based on the `Feature` that will add value to the application.
- IMPORTANT: The `Feature` describes the feature that will be implemented but remember we're not implementing a new feature, we're creating the plan that will be used to implement the feature based on the `Plan Format` below.
- Create the plan in the `PRPs/specs/*.md` file. Name it appropriately based on the `Feature`.
- Use the `Plan Format` below to create the plan.
- Research the codebase to understand existing patterns, architecture, and conventions before planning the feature.
- IMPORTANT: Replace every <placeholder> in the `Plan Format` with the requested value. Add as much detail as needed to implement the feature successfully.
- Use your reasoning model: THINK HARD about the feature requirements, design, and implementation approach.
- Follow existing patterns and conventions in the codebase. Don't reinvent the wheel.
- Design for extensibility and maintainability.
- If you need a new library, use `uv add` and be sure to report it in the `Notes` section of the `Plan Format`.
- Respect requested files in the `Relevant Files` section.
- Start your research by reading the `README.md` file.
- ultrathink about the research before you create the plan.
-
-## Relevant Files
-
-Focus on the following files:
-
- `README.md` - Contains the project overview and instructions.
- `app/server/**` - Contains the codebase server.
- `app/client/**` - Contains the codebase client.
- `scripts/**` - Contains the scripts to start and stop the server + client.
- `adws/**` - Contains the AI Developer Workflow (ADW) scripts.
-
-Ignore all other files in the codebase.
-
-## Plan Format
-
-```md
-# Feature: <feature name>
-
-## Feature Description
-
-<describe the feature in detail, including its purpose and value to users>
-
-## User Story
-
-As a <type of user>
-I want to <action/goal>
-So that <benefit/value>
-
-## Problem Statement
-
-<clearly define the specific problem or opportunity this feature addresses>
-
-## Solution Statement
-
-<describe the proposed solution approach and how it solves the problem>
-
-## Relevant Files
-
-Use these files to implement the feature:
-
-<find and list the files that are relevant to the feature describe why they are relevant in bullet points. If there are new files that need to be created to implement the feature, list them in an h3 'New Files' section.>
-
-## Implementation Plan
-
-### Phase 1: Foundation
-
-<describe the foundational work needed before implementing the main feature>
-
-### Phase 2: Core Implementation
-
-<describe the main implementation work for the feature>
-
-### Phase 3: Integration
-
-<describe how the feature will integrate with existing functionality>
-
-## Step by Step Tasks
-
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. use as many h3 headers as needed to implement the feature. Order matters, start with the foundational shared changes required then move on to the specific implementation. Include creating tests throughout the implementation process. Your last step should be running the `Validation Commands` to validate the feature works correctly with zero regressions.>
-
-## Testing Strategy
-
-### Unit Tests
-
-<describe unit tests needed for the feature>
-
-### Integration Tests
-
-<describe integration tests needed for the feature>
-
-### Edge Cases
-
-<list edge cases that need to be tested>
-
-## Acceptance Criteria
-
-<list specific, measurable criteria that must be met for the feature to be considered complete>
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the feature is implemented correctly with zero regressions. every command must execute without errors so be specific about what you want to run to validate the feature works as expected. Include commands to test the feature end-to-end.>
-
- `cd app/server && uv run pytest` - Run server tests to validate the feature works with zero regressions
-
-## Notes
-
-<optionally list any additional notes, future considerations, or context that are relevant to the feature that will be helpful to the developer>
-```
-
-## Feature
-
-$ARGUMENTS
-
-## Report
-
- Summarize the work you've just done in a concise bullet point list.
- Include a path to the plan you created in the `PRPs/specs/*.md` file.
--- a/.claude/commands/agent-work-orders/find_plan_file.md
+++ b/.claude/commands/agent-work-orders/find_plan_file.md
@@ -1,24 +0,0 @@
-# Find Plan File
-
-Based on the variables and `Previous Step Output` below, follow the `Instructions` to find the path to the plan file that was just created.
-
-## Variables
-issue_number: $1
-adw_id: $2
-previous_output: $3
-
-## Instructions
-
- The previous step created a plan file. Find the exact file path.
- The plan filename follows the pattern: `issue-{issue_number}-adw-{adw_id}-sdlc_planner-{descriptive-name}.md`
- You can use these approaches to find it:
-  - First, try: `ls specs/issue-{issue_number}-adw-{adw_id}-sdlc_planner-*.md`
-  - Check git status for new untracked files matching the pattern
-  - Use `find specs -name "issue-{issue_number}-adw-{adw_id}-sdlc_planner-*.md" -type f`
-  - Parse the previous output which should mention where the plan was saved
- Return ONLY the file path (e.g., "specs/issue-7-adw-abc123-sdlc_planner-update-readme.md") or "0" if not found.
- Do not include any explanation, just the path or "0" if not found.
-
-## Previous Step Output
-
-Use the `previous_output` variable content to help locate the file if it mentions the path.
--- a/.claude/commands/agent-work-orders/generate_branch_name.md
+++ b/.claude/commands/agent-work-orders/generate_branch_name.md
@@ -1,36 +0,0 @@
-# Generate Git Branch Name
-
-Based on the `Instructions` below, take the `Variables` follow the `Run` section to generate a concise Git branch name following the specified format. Then follow the `Report` section to report the results of your work.
-
-## Variables
-
-issue_class: $1
-adw_id: $2
-issue: $3
-
-## Instructions
-
- Generate a branch name in the format: `<issue_class>-issue-<issue_number>-adw-<adw_id>-<concise_name>`
- The `<concise_name>` should be:
-  - 3-6 words maximum
-  - All lowercase
-  - Words separated by hyphens
-  - Descriptive of the main task/feature
-  - No special characters except hyphens
- Examples:
-  - `feat-issue-123-adw-a1b2c3d4-add-user-auth`
-  - `bug-issue-456-adw-e5f6g7h8-fix-login-error`
-  - `chore-issue-789-adw-i9j0k1l2-update-dependencies`
-  - `test-issue-323-adw-m3n4o5p6-fix-failing-tests`
- Extract the issue number, title, and body from the issue JSON
-
-## Run
-
-Run `git checkout main` to switch to the main branch
-Run `git pull` to pull the latest changes from the main branch
-Run `git checkout -b <branch_name>` to create and switch to the new branch
-
-## Report
-
-After generating the branch name:
-Return ONLY the branch name that was created (no other text)
--- a/.claude/commands/agent-work-orders/implement.md
+++ b/.claude/commands/agent-work-orders/implement.md
@@ -1,16 +0,0 @@
-# Implement the following plan
-
-Follow the `Instructions` to implement the `Plan` then `Report` the completed work.
-
-## Instructions
-
- Read the plan, ultrathink about the plan and implement the plan.
-
-## Plan
-
-$ARGUMENTS
-
-## Report
-
- Summarize the work you've just done in a concise bullet point list.
- Report the files and total lines changed with `git diff --stat`
--- a/.claude/commands/agent-work-orders/noqa.md
+++ b/.claude/commands/agent-work-orders/noqa.md
@@ -0,0 +1,176 @@
+# NOQA Analysis and Resolution
+
+Find all noqa/type:ignore comments in the codebase, investigate why they exist, and provide recommendations for resolution or justification.
+
+## Instructions
+
+**Step 1: Find all NOQA comments**
+
+- Use Grep tool to find all noqa comments: pattern `noqa|type:\s*ignore`
+- Use output_mode "content" with line numbers (-n flag)
+- Search across all Python files (type: "py")
+- Document total count of noqa comments found
+
+**Step 2: For EACH noqa comment (repeat this process):**
+
+- Read the file containing the noqa comment with sufficient context (at least 10 lines before and after)
+- Identify the specific linting rule or type error being suppressed
+- Understand the code's purpose and why the suppression was added
+- Investigate if the suppression is still necessary or can be resolved
+
+**Step 3: Investigation checklist for each noqa:**
+
+- What specific error/warning is being suppressed? (e.g., `type: ignore[arg-type]`, `noqa: F401`)
+- Why was the suppression necessary? (legacy code, false positive, legitimate limitation, technical debt)
+- Can the underlying issue be fixed? (refactor code, update types, improve imports)
+- What would it take to remove the suppression? (effort estimate, breaking changes, architectural changes)
+- Is the suppression justified long-term? (external library limitation, Python limitation, intentional design)
+
+**Step 4: Research solutions:**
+
+- Check if newer versions of tools (mypy, ruff) handle the case better
+- Look for alternative code patterns that avoid the suppression
+- Consider if type stubs or Protocol definitions could help
+- Evaluate if refactoring would be worthwhile
+
+## Report Format
+
+Create a markdown report file (create the reports directory if not created yet): `PRPs/reports/noqa-analysis-{YYYY-MM-DD}.md`
+
+Use this structure for the report:
+
+````markdown
+# NOQA Analysis Report
+
+**Generated:** {date}
+**Total NOQA comments found:** {count}
+
+---
+
+## Summary
+
+- Total suppressions: {count}
+- Can be removed: {count}
+- Should remain: {count}
+- Requires investigation: {count}
+
+---
+
+## Detailed Analysis
+
+### 1. {File path}:{line number}
+
+**Location:** `{file_path}:{line_number}`
+
+**Suppression:** `{noqa comment or type: ignore}`
+
+**Code context:**
+
+```python
+{relevant code snippet}
+```
+````
+
+**Why it exists:**
+{explanation of why the suppression was added}
+
+**Options to resolve:**
+
+1. {Option 1: description}
+   - Effort: {Low/Medium/High}
+   - Breaking: {Yes/No}
+   - Impact: {description}
+
+2. {Option 2: description}
+   - Effort: {Low/Medium/High}
+   - Breaking: {Yes/No}
+   - Impact: {description}
+
+**Tradeoffs:**
+
+- {Tradeoff 1}
+- {Tradeoff 2}
+
+**Recommendation:** {Remove | Keep | Refactor}
+{Justification for recommendation}
+
+---
+
+{Repeat for each noqa comment}
+
+````
+
+## Example Analysis Entry
+
+```markdown
+### 1. src/shared/config.py:45
+
+**Location:** `src/shared/config.py:45`
+
+**Suppression:** `# type: ignore[assignment]`
+
+**Code context:**
+```python
+@property
+def openai_api_key(self) -> str:
+    key = os.getenv("OPENAI_API_KEY")
+    if not key:
+        raise ValueError("OPENAI_API_KEY not set")
+    return key  # type: ignore[assignment]
+````
+
+**Why it exists:**
+MyPy cannot infer that the ValueError prevents None from being returned, so it thinks the return type could be `str | None`.
+
+**Options to resolve:**
+
+1. Use assert to help mypy narrow the type
+   - Effort: Low
+   - Breaking: No
+   - Impact: Cleaner code, removes suppression
+
+2. Add explicit cast with typing.cast()
+   - Effort: Low
+   - Breaking: No
+   - Impact: More verbose but type-safe
+
+3. Refactor to use separate validation method
+   - Effort: Medium
+   - Breaking: No
+   - Impact: Better separation of concerns
+
+**Tradeoffs:**
+
+- Option 1 (assert) is cleanest but asserts can be disabled with -O flag
+- Option 2 (cast) is most explicit but adds import and verbosity
+- Option 3 is most robust but requires more refactoring
+
+**Recommendation:** Remove (use Option 1)
+Replace the type:ignore with an assert statement after the if check. This helps mypy understand the control flow while maintaining runtime safety. The assert will never fail in practice since the ValueError is raised first.
+
+**Implementation:**
+
+```python
+@property
+def openai_api_key(self) -> str:
+    key = os.getenv("OPENAI_API_KEY")
+    if not key:
+        raise ValueError("OPENAI_API_KEY not set")
+    assert key is not None  # Help mypy understand control flow
+    return key
+```
+
+```
+
+## Report
+
+After completing the analysis:
+
+- Output the path to the generated report file
+- Summarize findings:
+  - Total suppressions found
+  - How many can be removed immediately (low effort)
+  - How many should remain (justified)
+  - How many need deeper investigation or refactoring
+- Highlight any quick wins (suppressions that can be removed with minimal effort)
+```
--- a/.claude/commands/agent-work-orders/planning.md
+++ b/.claude/commands/agent-work-orders/planning.md
@@ -0,0 +1,176 @@
+# Feature Planning
+
+Create a new plan to implement the `PRP` using the exact specified markdown `PRP Format`. Follow the `Instructions` to create the plan use the `Relevant Files` to focus on the right files.
+
+## Variables
+
+FEATURE $1 $2
+
+## Instructions
+
+- IMPORTANT: You're writing a plan to implement a net new feature based on the `Feature` that will add value to the application.
+- IMPORTANT: The `Feature` describes the feature that will be implemented but remember we're not implementing a new feature, we're creating the plan that will be used to implement the feature based on the `PRP Format` below.
+- Create the plan in the `PRPs/features/` directory with filename: `{descriptive-name}.md`
+  - Replace `{descriptive-name}` with a short, descriptive name based on the feature (e.g., "add-auth-system", "implement-search", "create-dashboard")
+- Use the `PRP Format` below to create the plan.
+- Deeply research the codebase to understand existing patterns, architecture, and conventions before planning the feature.
+- If no patterns are established or are unclear ask the user for clarifications while providing best recommendations and options
+- IMPORTANT: Replace every <placeholder> in the `PRP Format` with the requested value. Add as much detail as needed to implement the feature successfully.
+- Use your reasoning model: THINK HARD about the feature requirements, design, and implementation approach.
+- Follow existing patterns and conventions in the codebase. Don't reinvent the wheel.
+- Design for extensibility and maintainability.
+- Deeply do web research to understand the latest trends and technologies in the field.
+- Figure out latest best practices and library documentation.
+- Include links to relevant resources and documentation with anchor tags for easy navigation.
+- If you need a new library, use `uv add <package>` and report it in the `Notes` section.
+- Read `CLAUDE.md` for project principles, logging rules, testing requirements, and docstring style.
+- All code MUST have type annotations (strict mypy enforcement).
+- Use Google-style docstrings for all functions, classes, and modules.
+- Every new file in `src/` MUST have a corresponding test file in `tests/`.
+- Respect requested files in the `Relevant Files` section.
+
+## Relevant Files
+
+Focus on the following files and vertical slice structure:
+
+**Core Files:**
+
+- `CLAUDE.md` - Project instructions, logging rules, testing requirements, docstring style
+  app/backend core files
+  app/frontend core files
+
+## PRP Format
+
+```md
+# Feature: <feature name>
+
+## Feature Description
+
+<describe the feature in detail, including its purpose and value to users>
+
+## User Story
+
+As a <type of user>
+I want to <action/goal>
+So that <benefit/value>
+
+## Problem Statement
+
+<clearly define the specific problem or opportunity this feature addresses>
+
+## Solution Statement
+
+<describe the proposed solution approach and how it solves the problem>
+
+## Relevant Files
+
+Use these files to implement the feature:
+
+<find and list the files that are relevant to the feature describe why they are relevant in bullet points. If there are new files that need to be created to implement the feature, list them in an h3 'New Files' section. inlcude line numbers for the relevant sections>
+
+## Relevant research docstring
+
+Use these documentation files and links to help with understanding the technology to use:
+
+- [Documentation Link 1](https://example.com/doc1)
+  - [Anchor tag]
+  - [Short summary]
+- [Documentation Link 2](https://example.com/doc2)
+  - [Anchor tag]
+  - [Short summary]
+
+## Implementation Plan
+
+### Phase 1: Foundation
+
+<describe the foundational work needed before implementing the main feature>
+
+### Phase 2: Core Implementation
+
+<describe the main implementation work for the feature>
+
+### Phase 3: Integration
+
+<describe how the feature will integrate with existing functionality>
+
+## Step by Step Tasks
+
+IMPORTANT: Execute every step in order, top to bottom.
+
+<list step by step tasks as h3 headers plus bullet points. use as many h3 headers as needed to implement the feature. Order matters:
+
+1. Start with foundational shared changes (schemas, types)
+2. Implement core functionality with proper logging
+3. Create corresponding test files (unit tests mirror src/ structure)
+4. Add integration tests if feature interacts with multiple components
+5. Verify linters pass: `uv run ruff check src/ && uv run mypy src/`
+6. Ensure all tests pass: `uv run pytest tests/`
+7. Your last step should be running the `Validation Commands`>
+
+<For tool implementations:
+
+- Define Pydantic schemas in `schemas.py`
+- Implement tool with structured logging and type hints
+- Register tool with Pydantic AI agent
+- Create unit tests in `tests/tools/<name>/test_<module>.py`
+- Add integration test in `tests/integration/` if needed>
+
+## Testing Strategy
+
+See `CLAUDE.md` for complete testing requirements. Every file in `src/` must have a corresponding test file in `tests/`.
+
+### Unit Tests
+
+<describe unit tests needed for the feature. Mark with @pytest.mark.unit. Test individual components in isolation.>
+
+### Integration Tests
+
+<if the feature interacts with multiple components, describe integration tests needed. Mark with @pytest.mark.integration. Place in tests/integration/ when testing full application stack.>
+
+### Edge Cases
+
+<list edge cases that need to be tested>
+
+## Acceptance Criteria
+
+<list specific, measurable criteria that must be met for the feature to be considered complete>
+
+## Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+<list commands you'll use to validate with 100% confidence the feature is implemented correctly with zero regressions. Include (example for BE Biome and TS checks are used for FE):
+
+- Linting: `uv run ruff check src/`
+- Type checking: `uv run mypy src/`
+- Unit tests: `uv run pytest tests/ -m unit -v`
+- Integration tests: `uv run pytest tests/ -m integration -v` (if applicable)
+- Full test suite: `uv run pytest tests/ -v`
+- Manual API testing if needed (curl commands, test requests)>
+
+**Required validation commands:**
+
+- `uv run ruff check src/` - Lint check must pass
+- `uv run mypy src/` - Type check must pass
+- `uv run pytest tests/ -v` - All tests must pass with zero regressions
+
+**Run server and test core endpoints:**
+
+- Start server: @.claude/start-server
+- Test endpoints with curl (at minimum: health check, main functionality)
+- Verify structured logs show proper correlation IDs and context
+- Stop server after validation
+
+## Notes
+
+<optionally list any additional notes, future considerations, or context that are relevant to the feature that will be helpful to the developer>
+```
+
+## Feature
+
+Extract the feature details from the `issue_json` variable (parse the JSON and use the title and body fields).
+
+## Report
+
+- Summarize the work you've just done in a concise bullet point list.
+- Include the full path to the plan file you created (e.g., `PRPs/features/add-auth-system.md`)
--- a/.claude/commands/agent-work-orders/prime.md
+++ b/.claude/commands/agent-work-orders/prime.md
@@ -1,12 +1,28 @@
 # Prime

-> Execute the following sections to understand the codebase then summarize your understanding.
+Execute the following sections to understand the codebase before starting new work, then summarize your understanding.

 ## Run

-git ls-files
+- List all tracked files: `git ls-files`
+- Show project structure: `tree -I '.venv|__pycache__|*.pyc|.pytest_cache|.mypy_cache|.ruff_cache' -L 3`

 ## Read

-README.md
-please read PRPs/PRD.md and core files in PRPs/specs
+- `CLAUDE.md` - Core project instructions, principles, logging rules, testing requirements
+- `python/src/agent_work_orders` - Project overview and setup (if exists)
+
+- Identify core files in the agent work orders directory to understand what we are woerking on and its intent
+
+## Report
+
+Provide a concise summary of:
+
+1. **Project Purpose**: What this application does
+2. **Architecture**: Key patterns (vertical slice, FastAPI + Pydantic AI)
+3. **Core Principles**: TYPE SAFETY, KISS, YAGNI
+4. **Tech Stack**: Main dependencies and tools
+5. **Key Requirements**: Logging, testing, type annotations
+6. **Current State**: What's implemented
+
+Keep the summary brief (5-10 bullet points) and focused on what you need to know to contribute effectively.
--- a/.claude/commands/agent-work-orders/prp-review.md
+++ b/.claude/commands/agent-work-orders/prp-review.md
@@ -0,0 +1,89 @@
+# Code Review
+
+Review implemented work against a PRP specification to ensure code quality, correctness, and adherence to project standards.
+
+## Variables
+
+Plan file: $ARGUMENTS (e.g., `PRPs/features/add-web-search.md`)
+
+## Instructions
+
+**Understand the Changes:**
+
+- Check current branch: `git branch`
+- Review changes: `git diff origin/main` (or `git diff HEAD` if not on a branch)
+- Read the PRP plan file to understand requirements
+
+**Code Quality Review:**
+
+- **Type Safety**: Verify all functions have type annotations, mypy passes
+- **Logging**: Check structured logging is used correctly (event names, context, exception handling)
+- **Docstrings**: Ensure Google-style docstrings on all functions/classes
+- **Testing**: Verify unit tests exist for all new files, integration tests if needed
+- **Architecture**: Confirm vertical slice structure is followed
+- **CLAUDE.md Compliance**: Check adherence to core principles (KISS, YAGNI, TYPE SAFETY)
+
+**Validation Ruff for BE and Biome for FE:**
+
+- Run linters: `uv run ruff check src/ && uv run mypy src/`
+- Run tests: `uv run pytest tests/ -v`
+- Start server and test endpoints with curl (if applicable)
+- Verify structured logs show proper correlation IDs and context
+
+**Issue Severity:**
+
+- `blocker` - Must fix before merge (breaks build, missing tests, type errors, security issues)
+- `major` - Should fix (missing logging, incomplete docstrings, poor patterns)
+- `minor` - Nice to have (style improvements, optimization opportunities)
+
+## Report
+
+Return ONLY valid JSON (no markdown, no explanations) save to [report-#.json] in prps/reports directory create the directory if it doesn't exist. Output will be parsed with JSON.parse().
+
+### Output Structure
+
+```json
+{
+  "success": "boolean - true if NO BLOCKER issues, false if BLOCKER issues exist",
+  "review_summary": "string - 2-4 sentences: what was built, does it match spec, quality assessment",
+  "review_issues": [
+    {
+      "issue_number": "number - issue index",
+      "file_path": "string - file with the issue (if applicable)",
+      "issue_description": "string - what's wrong",
+      "issue_resolution": "string - how to fix it",
+      "severity": "string - blocker|major|minor"
+    }
+  ],
+  "validation_results": {
+    "linting_passed": "boolean",
+    "type_checking_passed": "boolean",
+    "tests_passed": "boolean",
+    "api_endpoints_tested": "boolean - true if endpoints were tested with curl"
+  }
+}
+```
+
+## Example Success Review
+
+```json
+{
+  "success": true,
+  "review_summary": "The web search tool has been implemented with proper type annotations, structured logging, and comprehensive tests. The implementation follows the vertical slice architecture and matches all spec requirements. Code quality is high with proper error handling and documentation.",
+  "review_issues": [
+    {
+      "issue_number": 1,
+      "file_path": "src/tools/web_search/tool.py",
+      "issue_description": "Missing debug log for API response",
+      "issue_resolution": "Add logger.debug with response metadata",
+      "severity": "minor"
+    }
+  ],
+  "validation_results": {
+    "linting_passed": true,
+    "type_checking_passed": true,
+    "tests_passed": true,
+    "api_endpoints_tested": true
+  }
+}
+```
--- a/.claude/commands/agent-work-orders/pull_request.md
+++ b/.claude/commands/agent-work-orders/pull_request.md
@@ -1,41 +0,0 @@
-# Create Pull Request
-
-Based on the `Instructions` below, take the `Variables` follow the `Run` section to create a pull request. Then follow the `Report` section to report the results of your work.
-
-## Variables
-
-branch_name: $1
-issue: $2
-plan_file: $3
-adw_id: $4
-
-## Instructions
-
- Generate a pull request title in the format: `<issue_type>: #<issue_number> - <issue_title>`
- The PR body should include:
-  - A summary section with the issue context
-  - Link to the implementation `plan_file` if it exists
-  - Reference to the issue (Closes #<issue_number>)
-  - ADW tracking ID
-  - A checklist of what was done
-  - A summary of key changes made
- Extract issue number, type, and title from the issue JSON
- Examples of PR titles:
-  - `feat: #123 - Add user authentication`
-  - `bug: #456 - Fix login validation error`
-  - `chore: #789 - Update dependencies`
-  - `test: #1011 - Test xyz`
- Don't mention Claude Code in the PR body - let the author get credit for this.
-
-## Run
-
-1. Run `git diff origin/main...HEAD --stat` to see a summary of changed files
-2. Run `git log origin/main..HEAD --oneline` to see the commits that will be included
-3. Run `git diff origin/main...HEAD --name-only` to get a list of changed files
-4. Run `git push -u origin <branch_name>` to push the branch
-5. Set GH_TOKEN environment variable from GITHUB_PAT if available, then run `gh pr create --title "<pr_title>" --body "<pr_body>" --base main` to create the PR
-6. Capture the PR URL from the output
-
-## Report
-
-Return ONLY the PR URL that was created (no other text)
--- a/.claude/commands/agent-work-orders/resolve_failed_e2e_test.md
+++ b/.claude/commands/agent-work-orders/resolve_failed_e2e_test.md
@@ -1,51 +0,0 @@
-# Resolve Failed E2E Test
-
-Fix a specific failing E2E test using the provided failure details.
-
-## Instructions
-
-1. **Analyze the E2E Test Failure**
-   - Review the JSON data in the `Test Failure Input`, paying attention to:
-     - `test_name`: The name of the failing test
-     - `test_path`: The path to the test file (you will need this for re-execution)
-     - `error`: The specific error that occurred
-     - `screenshots`: Any captured screenshots showing the failure state
-   - Understand what the test is trying to validate from a user interaction perspective
-
-2. **Understand Test Execution**
-   - Read `.claude/commands/test_e2e.md` to understand how E2E tests are executed
-   - Read the test file specified in the `test_path` field from the JSON
-   - Note the test steps, user story, and success criteria
-
-3. **Reproduce the Failure**
-   - IMPORTANT: Use the `test_path` from the JSON to re-execute the specific E2E test
-   - Follow the execution pattern from `.claude/commands/test_e2e.md`
-   - Observe the browser behavior and confirm you can reproduce the exact failure
-   - Compare the error you see with the error reported in the JSON
-
-4. **Fix the Issue**
-   - Based on your reproduction, identify the root cause
-   - Make minimal, targeted changes to resolve only this E2E test failure
-   - Consider common E2E issues:
-     - Element selector changes
-     - Timing issues (elements not ready)
-     - UI layout changes
-     - Application logic modifications
-   - Ensure the fix aligns with the user story and test purpose
-
-5. **Validate the Fix**
-   - Re-run the same E2E test step by step using the `test_path` to confirm it now passes
-   - IMPORTANT: The test must complete successfully before considering it resolved
-   - Do NOT run other tests or the full test suite
-   - Focus only on fixing this specific E2E test
-
-## Test Failure Input
-
-$ARGUMENTS
-
-## Report
-
-Provide a concise summary of:
- Root cause identified (e.g., missing element, timing issue, incorrect selector)
- Specific fix applied
- Confirmation that the E2E test now passes after your fix
--- a/.claude/commands/agent-work-orders/resolve_failed_review.md
+++ b/.claude/commands/agent-work-orders/resolve_failed_review.md
@@ -1,46 +0,0 @@
-# Resolve Failed Review Issue
-
-Fix a specific blocker issue identified during the review phase.
-
-## Arguments
-
-1. review_issue_json: JSON string containing the review issue to fix
-
-## Instructions
-
-1. **Parse Review Issue**
-   - Extract issue_title, issue_description, issue_severity, and affected_files from the JSON
-   - Ensure this is a "blocker" severity issue (tech_debt and skippable are not resolved here)
-
-2. **Understand the Issue**
-   - Read the issue description carefully
-   - Review the affected files listed
-   - If a spec file was referenced in the original review, re-read relevant sections
-
-3. **Create Fix Plan**
-   - Determine what changes are needed to resolve the issue
-   - Identify all files that need to be modified
-   - Plan minimal, targeted changes
-
-4. **Implement the Fix**
-   - Make only the changes necessary to resolve this specific issue
-   - Ensure code quality and consistency
-   - Follow project conventions and patterns
-   - Do not make unrelated changes
-
-5. **Verify the Fix**
-   - Re-run relevant tests if applicable
-   - Check that the issue is actually resolved
-   - Ensure no new issues were introduced
-
-## Review Issue Input
-
-$ARGUMENT_1
-
-## Report
-
-Provide a concise summary of:
- Root cause of the blocker issue
- Specific changes made to resolve it
- Files modified
- Confirmation that the issue is resolved
--- a/.claude/commands/agent-work-orders/resolve_failed_test.md
+++ b/.claude/commands/agent-work-orders/resolve_failed_test.md
@@ -1,41 +0,0 @@
-# Resolve Failed Test
-
-Fix a specific failing test using the provided failure details.
-
-## Instructions
-
-1. **Analyze the Test Failure**
-   - Review the test name, purpose, and error message from the `Test Failure Input`
-   - Understand what the test is trying to validate
-   - Identify the root cause from the error details
-
-2. **Context Discovery**
-   - Check recent changes: `git diff origin/main --stat --name-only`
-   - If a relevant spec exists in `specs/*.md`, read it to understand requirements
-   - Focus only on files that could impact this specific test
-
-3. **Reproduce the Failure**
-   - IMPORTANT: Use the `execution_command` provided in the test data
-   - Run it to see the full error output and stack trace
-   - Confirm you can reproduce the exact failure
-
-4. **Fix the Issue**
-   - Make minimal, targeted changes to resolve only this test failure
-   - Ensure the fix aligns with the test purpose and any spec requirements
-   - Do not modify unrelated code or tests
-
-5. **Validate the Fix**
-   - Re-run the same `execution_command` to confirm the test now passes
-   - Do NOT run other tests or the full test suite
-   - Focus only on fixing this specific test
-
-## Test Failure Input
-
-$ARGUMENTS
-
-## Report
-
-Provide a concise summary of:
- Root cause identified
- Specific fix applied
- Confirmation that the test now passes
--- a/.claude/commands/agent-work-orders/review_runner.md
+++ b/.claude/commands/agent-work-orders/review_runner.md
@@ -1,101 +0,0 @@
-# Review Implementation Against Specification
-
-Compare the current implementation against the specification file and identify any issues that need to be addressed before creating a pull request.
-
-## Variables
-
-REVIEW_TIMEOUT: 10 minutes
-
-## Arguments
-
-1. spec_file_path: Path to the specification file (e.g., "PRPs/specs/my-feature.md")
-2. work_order_id: The work order ID for context
-
-## Instructions
-
-1. **Read the Specification**
-   - Read the specification file at `$ARGUMENT_1`
-   - Understand all requirements, acceptance criteria, and deliverables
-   - Note any specific constraints or implementation details
-
-2. **Analyze Current Implementation**
-   - Review the code changes made in the current branch
-   - Check if all files mentioned in the spec have been created/modified
-   - Verify implementation matches the spec requirements
-
-3. **Capture Screenshots** (if applicable)
-   - If the feature includes UI components:
-     - Start the application if needed
-     - Take screenshots of key UI flows
-     - Save screenshots to `screenshots/wo-$ARGUMENT_2/` directory
-   - If no UI: skip this step
-
-4. **Compare Implementation vs Specification**
-   - Identify any missing features or incomplete implementations
-   - Check for deviations from the spec
-   - Verify all acceptance criteria are met
-   - Look for potential bugs or issues
-
-5. **Categorize Issues by Severity**
-   - **blocker**: Must be fixed before PR (breaks functionality, missing critical features)
-   - **tech_debt**: Should be fixed but can be addressed later
-   - **skippable**: Nice-to-have, documentation improvements, minor polish
-
-6. **Generate Review Report**
-   - Return ONLY the JSON object as specified below
-   - Do not include any additional text, explanations, or markdown formatting
-   - List all issues found, even if none are blockers
-
-## Report
-
-Return ONLY a valid JSON object with the following structure:
-
-```json
-{
-  "review_passed": boolean,
-  "review_issues": [
-    {
-      "issue_title": "string",
-      "issue_description": "string",
-      "issue_severity": "blocker|tech_debt|skippable",
-      "affected_files": ["string"],
-      "screenshots": ["string"]
-    }
-  ],
-  "screenshots": ["string"]
-}
-```
-
-### Field Descriptions
-
- `review_passed`: true if no blocker issues found, false otherwise
- `review_issues`: Array of all issues found (blockers, tech_debt, skippable)
- `issue_severity`: Must be one of: "blocker", "tech_debt", "skippable"
- `affected_files`: List of file paths that need changes to fix this issue
- `screenshots`: List of screenshot file paths for this specific issue (if applicable)
- `screenshots` (root level): List of all screenshot paths taken during review
-
-### Example Output
-
-```json
-{
-  "review_passed": false,
-  "review_issues": [
-    {
-      "issue_title": "Missing error handling in API endpoint",
-      "issue_description": "The /api/work-orders endpoint doesn't handle invalid repository URLs. The spec requires validation with clear error messages.",
-      "issue_severity": "blocker",
-      "affected_files": ["python/src/agent_work_orders/api/routes.py"],
-      "screenshots": []
-    },
-    {
-      "issue_title": "Incomplete test coverage",
-      "issue_description": "Only 60% test coverage achieved, spec requires >80%",
-      "issue_severity": "tech_debt",
-      "affected_files": ["python/tests/agent_work_orders/"],
-      "screenshots": []
-    }
-  ],
-  "screenshots": []
-}
-```
--- a/.claude/commands/agent-work-orders/start-server.md
+++ b/.claude/commands/agent-work-orders/start-server.md
@@ -0,0 +1,33 @@
+# Start Servers
+
+Start both the FastAPI backend and React frontend development servers with hot reload.
+
+## Run
+
+### Run in the background with bash tool
+
+- Ensure you are in the right PWD
+- Use the Bash tool to run the servers in the background so you can read the shell outputs
+- IMPORTANT: run `git ls-files` first so you know where directories are located before you start
+
+### Backend Server (FastAPI)
+
+- Navigate to backend: `cd app/backend`
+- Start server in background: `uv sync && uv run python run_api.py`
+- Wait 2-3 seconds for startup
+- Test health endpoint: `curl http://localhost:8000/health`
+- Test products endpoint: `curl http://localhost:8000/api/products`
+
+### Frontend Server (Bun + React)
+
+- Navigate to frontend: `cd ../app/frontend`
+- Start server in background: `bun install && bun dev`
+- Wait 2-3 seconds for startup
+- Frontend should be accessible at `http://localhost:3000`
+
+## Report
+
+- Confirm backend is running on `http://localhost:8000`
+- Confirm frontend is running on `http://localhost:3000`
+- Show the health check response from backend
+- Mention: "Backend logs will show structured JSON logging for all requests"
--- a/.claude/commands/agent-work-orders/test.md
+++ b/.claude/commands/agent-work-orders/test.md
@@ -1,115 +0,0 @@
-# Application Validation Test Suite
-
-Execute comprehensive validation tests for both frontend and backend components, returning results in a standardized JSON format for automated processing.
-
-## Purpose
-
-Proactively identify and fix issues in the application before they impact users or developers. By running this comprehensive test suite, you can:
- Detect syntax errors, type mismatches, and import failures
- Identify broken tests or security vulnerabilities  
- Verify build processes and dependencies
- Ensure the application is in a healthy state
-
-## Variables
-
-TEST_COMMAND_TIMEOUT: 5 minutes
-
-## Instructions
-
- Execute each test in the sequence provided below
- Capture the result (passed/failed) and any error messages
- IMPORTANT: Return ONLY the JSON array with test results
-  - IMPORTANT: Do not include any additional text, explanations, or markdown formatting
-  - We'll immediately run JSON.parse() on the output, so make sure it's valid JSON
- If a test passes, omit the error field
- If a test fails, include the error message in the error field
- Execute all tests even if some fail
- Error Handling:
-  - If a command returns non-zero exit code, mark as failed and immediately stop processing tests
-  - Capture stderr output for error field
-  - Timeout commands after `TEST_COMMAND_TIMEOUT`
-  - IMPORTANT: If a test fails, stop processing tests and return the results thus far
- Some tests may have dependencies (e.g., server must be stopped for port availability)
- API health check is required
- Test execution order is important - dependencies should be validated first
- All file paths are relative to the project root
- Always run `pwd` and `cd` before each test to ensure you're operating in the correct directory for the given test
-
-## Test Execution Sequence
-
-### Backend Tests
-
-1. **Python Syntax Check**
-   - Preparation Command: None
-   - Command: `cd app/server && uv run python -m py_compile server.py main.py core/*.py`
-   - test_name: "python_syntax_check"
-   - test_purpose: "Validates Python syntax by compiling source files to bytecode, catching syntax errors like missing colons, invalid indentation, or malformed statements"
-
-2. **Backend Code Quality Check**
-   - Preparation Command: None
-   - Command: `cd app/server && uv run ruff check .`
-   - test_name: "backend_linting"
-   - test_purpose: "Validates Python code quality, identifies unused imports, style violations, and potential bugs"
-
-3. **All Backend Tests**
-   - Preparation Command: None
-   - Command: `cd app/server && uv run pytest tests/ -v --tb=short`
-   - test_name: "all_backend_tests"
-   - test_purpose: "Validates all backend functionality including file processing, SQL security, LLM integration, and API endpoints"
-
-### Frontend Tests
-
-4. **TypeScript Type Check**
-   - Preparation Command: None
-   - Command: `cd app/client && bun tsc --noEmit`
-   - test_name: "typescript_check"
-   - test_purpose: "Validates TypeScript type correctness without generating output files, catching type errors, missing imports, and incorrect function signatures"
-
-5. **Frontend Build**
-   - Preparation Command: None
-   - Command: `cd app/client && bun run build`
-   - test_name: "frontend_build"
-   - test_purpose: "Validates the complete frontend build process including bundling, asset optimization, and production compilation"
-
-## Report
-
- IMPORTANT: Return results exclusively as a JSON array based on the `Output Structure` section below.
- Sort the JSON array with failed tests (passed: false) at the top
- Include all tests in the output, both passed and failed
- The execution_command field should contain the exact command that can be run to reproduce the test
- This allows subsequent agents to quickly identify and resolve errors
-
-### Output Structure
-
-```json
-[
-  {
-    "test_name": "string",
-    "passed": boolean,
-    "execution_command": "string",
-    "test_purpose": "string",
-    "error": "optional string"
-  },
-  ...
-]
-```
-
-### Example Output
-
-```json
-[
-  {
-    "test_name": "frontend_build",
-    "passed": false,
-    "execution_command": "cd app/client && bun run build",
-    "test_purpose": "Validates TypeScript compilation, module resolution, and production build process for the frontend application",
-    "error": "TS2345: Argument of type 'string' is not assignable to parameter of type 'number'"
-  },
-  {
-    "test_name": "all_backend_tests",
-    "passed": true,
-    "execution_command": "cd app/server && uv run pytest tests/ -v --tb=short",
-    "test_purpose": "Validates all backend functionality including file processing, SQL security, LLM integration, and API endpoints"
-  }
-]
-```
--- a/.claude/commands/agent-work-orders/test_e2e.md
+++ b/.claude/commands/agent-work-orders/test_e2e.md
@@ -1,64 +0,0 @@
-# E2E Test Runner
-
-Execute end-to-end (E2E) tests using Playwright browser automation (MCP Server). If any errors occur and assertions fail mark the test as failed and explain exactly what went wrong.
-
-## Variables
-
-adw_id: $1 if provided, otherwise generate a random 8 character hex string
-agent_name: $2 if provided, otherwise use 'test_e2e'
-e2e_test_file: $3
-application_url: $4 if provided, otherwise use http://localhost:5173
-
-## Instructions
-
- Read the `e2e_test_file`
- Digest the `User Story` to first understand what we're validating
- IMPORTANT: Execute the `Test Steps` detailed in the `e2e_test_file` using Playwright browser automation
- Review the `Success Criteria` and if any of them fail, mark the test as failed and explain exactly what went wrong
- Review the steps that say '**Verify**...' and if they fail, mark the test as failed and explain exactly what went wrong
- Capture screenshots as specified
- IMPORTANT: Return results in the format requested by the `Output Format`
- Initialize Playwright browser in headed mode for visibility
- Use the `application_url`
- Allow time for async operations and element visibility
- IMPORTANT: After taking each screenshot, save it to `Screenshot Directory` with descriptive names. Use absolute paths to move the files to the `Screenshot Directory` with the correct name.
- Capture and report any errors encountered
- Ultra think about the `Test Steps` and execute them in order
- If you encounter an error, mark the test as failed immediately and explain exactly what went wrong and on what step it occurred. For example: '(Step 1 ❌) Failed to find element with selector "query-input" on page "http://localhost:5173"'
- Use `pwd` or equivalent to get the absolute path to the codebase for writing and displaying the correct paths to the screenshots
-
-## Setup
-
- IMPORTANT: Reset the database by running `scripts/reset_db.sh`
- IMPORTANT: Make sure the server and client are running on a background process before executing the test steps. Read `scripts/` and `README.md` for more information on how to start, stop and reset the server and client
-
-
-## Screenshot Directory
-
-<absolute path to codebase>/agents/<adw_id>/<agent_name>/img/<directory name based on test file name>/*.png
-
-Each screenshot should be saved with a descriptive name that reflects what is being captured. The directory structure ensures that:
- Screenshots are organized by ADW ID (workflow run)
- They are stored under the specified agent name (e.g., e2e_test_runner_0, e2e_test_resolver_iter1_0)
- Each test has its own subdirectory based on the test file name (e.g., test_basic_query → basic_query/)
-
-## Report
-
- Exclusively return the JSON output as specified in the test file
- Capture any unexpected errors
- IMPORTANT: Ensure all screenshots are saved in the `Screenshot Directory`
-
-### Output Format
-
-```json
-{
-  "test_name": "Test Name Here",
-  "status": "passed|failed",
-  "screenshots": [
-    "<absolute path to codebase>/agents/<adw_id>/<agent_name>/img/<test name>/01_<descriptive name>.png",
-    "<absolute path to codebase>/agents/<adw_id>/<agent_name>/img/<test name>/02_<descriptive name>.png",
-    "<absolute path to codebase>/agents/<adw_id>/<agent_name>/img/<test name>/03_<descriptive name>.png"
-  ],
-  "error": null
-}
-```
--- a/.claude/commands/agent-work-orders/tools.md
+++ b/.claude/commands/agent-work-orders/tools.md
@@ -1,3 +0,0 @@
-# List Built-in Tools
-
-List all core, built-in non-mcp development tools available to you. Display in bullet format. Use typescript function syntax with parameters.
--- a/PRPs/PRD.md
+++ b/PRPs/PRD.md
--- a/PRPs/prd-types.md
+++ b/PRPs/prd-types.md
@@ -1,660 +0,0 @@
-# Data Models for Agent Work Order System
-
-**Purpose:** This document defines all data models needed for the agent work order feature in plain English.
-
-**Philosophy:** Git-first architecture - store minimal state in database, compute everything else from git.
-
---
-
-## Table of Contents
-
-1. [Core Work Order Models](#core-work-order-models)
-2. [Workflow & Phase Tracking](#workflow--phase-tracking)
-3. [Sandbox Models](#sandbox-models)
-4. [GitHub Integration](#github-integration)
-5. [Agent Execution](#agent-execution)
-6. [Logging & Observability](#logging--observability)
-
---
-
-## Core Work Order Models
-
-### AgentWorkOrderStateMinimal
-
-**What it is:** The absolute minimum state we persist in database/Supabase.
-
-**Purpose:** Following git-first philosophy - only store identifiers, query everything else from git.
-
-**Where stored:**
- Phase 1: In-memory Python dictionary
- Phase 2+: Supabase database
-
-**Fields:**
-
-| Field Name | Type | Required | Description | Example |
-|------------|------|----------|-------------|---------|
-| `agent_work_order_id` | string | Yes | Unique identifier for this work order | `"wo-a1b2c3d4"` |
-| `repository_url` | string | Yes | GitHub repository URL | `"https://github.com/user/repo.git"` |
-| `sandbox_identifier` | string | Yes | Execution environment identifier | `"git-worktree-wo-a1b2c3d4"` or `"e2b-sb-xyz789"` |
-| `git_branch_name` | string | No | Git branch created for this work order | `"feat-issue-42-wo-a1b2c3d4"` |
-| `agent_session_id` | string | No | Claude Code session ID (for resumption) | `"session-xyz789"` |
-
-**Why `sandbox_identifier` is separate from `git_branch_name`:**
- `git_branch_name` = Git concept (what branch the code is on)
- `sandbox_identifier` = Execution environment ID (where the agent runs)
- Git worktree: `sandbox_identifier = "/Users/user/.worktrees/wo-abc123"` (path to worktree)
- E2B: `sandbox_identifier = "e2b-sb-xyz789"` (E2B's sandbox ID)
- Dagger: `sandbox_identifier = "dagger-container-abc123"` (container ID)
-
-**What we DON'T store:** Current phase, commit count, files changed, PR URL, test results, sandbox state (is_active) - all computed from git or sandbox APIs.
-
---
-
-### AgentWorkOrder (Full Model)
-
-**What it is:** Complete work order model combining database state + computed fields from git/GitHub.
-
-**Purpose:** Used for API responses and UI display.
-
-**Fields:**
-
-**Core Identifiers (from database):**
- `agent_work_order_id` - Unique ID
- `repository_url` - GitHub repo URL
- `sandbox_identifier` - Execution environment ID (e.g., worktree path, E2B sandbox ID)
- `git_branch_name` - Branch name (null until created)
- `agent_session_id` - Claude session ID (null until started)
-
-**Metadata (from database):**
- `workflow_type` - Which workflow to run (plan/implement/validate/plan_implement/plan_implement_validate)
- `sandbox_type` - Execution environment (git_branch/git_worktree/e2b/dagger)
- `agent_model_type` - Claude model (sonnet/opus/haiku)
- `status` - Current status (pending/initializing/running/completed/failed/cancelled)
- `github_issue_number` - Optional issue number
- `created_at` - When work order was created
- `updated_at` - Last update timestamp
- `execution_started_at` - When execution began
- `execution_completed_at` - When execution finished
- `error_message` - Error if failed
- `error_details` - Detailed error info
- `created_by_user_id` - User who created it (Phase 2+)
-
-**Computed Fields (from git/GitHub - NOT in database):**
- `current_phase` - Current workflow phase (planning/implementing/validating/completed) - **computed by inspecting git commits**
- `github_pull_request_url` - PR URL - **computed from GitHub API**
- `github_pull_request_number` - PR number
- `git_commit_count` - Number of commits - **computed from `git log --oneline | wc -l`**
- `git_files_changed` - Files changed - **computed from `git diff --stat`**
- `git_lines_added` - Lines added - **computed from `git diff --stat`**
- `git_lines_removed` - Lines removed - **computed from `git diff --stat`**
- `latest_git_commit_sha` - Latest commit SHA
- `latest_git_commit_message` - Latest commit message
-
---
-
-### CreateAgentWorkOrderRequest
-
-**What it is:** Request payload to create a new work order.
-
-**Purpose:** Sent from frontend to backend to initiate work order.
-
-**Fields:**
- `repository_url` - GitHub repo URL to work on
- `sandbox_type` - Which sandbox to use (git_branch/git_worktree/e2b/dagger)
- `workflow_type` - Which workflow to execute
- `agent_model_type` - Which Claude model to use (default: sonnet)
- `github_issue_number` - Optional issue to work on
- `initial_prompt` - Optional initial prompt to send to agent
-
---
-
-### AgentWorkOrderResponse
-
-**What it is:** Response after creating or fetching a work order.
-
-**Purpose:** Returned by API endpoints.
-
-**Fields:**
- `agent_work_order` - Full AgentWorkOrder object
- `logs_url` - URL to fetch execution logs
-
---
-
-### ListAgentWorkOrdersRequest
-
-**What it is:** Request to list work orders with filters.
-
-**Purpose:** Support filtering and pagination in UI.
-
-**Fields:**
- `status_filter` - Filter by status (array)
- `sandbox_type_filter` - Filter by sandbox type (array)
- `workflow_type_filter` - Filter by workflow type (array)
- `limit` - Results per page (default 50, max 100)
- `offset` - Pagination offset
- `sort_by` - Field to sort by (default: created_at)
- `sort_order` - asc or desc (default: desc)
-
---
-
-### ListAgentWorkOrdersResponse
-
-**What it is:** Response containing list of work orders.
-
-**Fields:**
- `agent_work_orders` - Array of AgentWorkOrder objects
- `total_count` - Total matching work orders
- `has_more` - Whether more results available
- `offset` - Current offset
- `limit` - Current limit
-
---
-
-## Workflow & Phase Tracking
-
-### WorkflowPhaseHistoryEntry
-
-**What it is:** Single phase execution record in workflow history.
-
-**Purpose:** Track timing and commits for each workflow phase.
-
-**How created:** Computed by analyzing git commits, not stored directly.
-
-**Fields:**
- `phase_name` - Which phase (planning/implementing/validating/completed)
- `phase_started_at` - When phase began
- `phase_completed_at` - When phase finished (null if still running)
- `phase_duration_seconds` - Duration (if completed)
- `git_commits_in_phase` - Number of commits during this phase
- `git_commit_shas` - Array of commit SHAs from this phase
-
-**Example:** "Planning phase started at 10:00:00, completed at 10:02:30, duration 150 seconds, 1 commit (abc123)"
-
---
-
-### GitProgressSnapshot
-
-**What it is:** Point-in-time snapshot of work order progress via git inspection.
-
-**Purpose:** Polled by frontend every 3 seconds to show progress without streaming.
-
-**How created:** Backend queries git to compute current state.
-
-**Fields:**
- `agent_work_order_id` - Work order ID
- `current_phase` - Current workflow phase (computed from commits)
- `git_commit_count` - Total commits on branch
- `git_files_changed` - Total files changed
- `git_lines_added` - Total lines added
- `git_lines_removed` - Total lines removed
- `latest_commit_message` - Most recent commit message
- `latest_commit_sha` - Most recent commit SHA
- `latest_commit_timestamp` - When latest commit was made
- `snapshot_timestamp` - When this snapshot was taken
- `phase_history` - Array of WorkflowPhaseHistoryEntry objects
-
-**Example UI usage:** Frontend polls `/api/agent-work-orders/{id}/git-progress` every 3 seconds to update progress bar.
-
---
-
-## Sandbox Models
-
-### SandboxConfiguration
-
-**What it is:** Configuration for creating a sandbox instance.
-
-**Purpose:** Passed to sandbox factory to create appropriate sandbox type.
-
-**Fields:**
-
-**Common (all sandbox types):**
- `sandbox_type` - Type of sandbox (git_branch/git_worktree/e2b/dagger)
- `sandbox_identifier` - Unique ID (usually work order ID)
- `repository_url` - Repo to clone
- `git_branch_name` - Branch to create/use
- `environment_variables` - Env vars to set in sandbox (dict)
-
-**E2B specific (Phase 2+):**
- `e2b_template_id` - E2B template ID
- `e2b_timeout_seconds` - Sandbox timeout
-
-**Dagger specific (Phase 2+):**
- `dagger_image_name` - Docker image name
- `dagger_container_config` - Additional Dagger config (dict)
-
---
-
-### SandboxState
-
-**What it is:** Current state of an active sandbox.
-
-**Purpose:** Query sandbox status, returned by `sandbox.get_current_state()`.
-
-**Fields:**
- `sandbox_identifier` - Sandbox ID
- `sandbox_type` - Type of sandbox
- `is_active` - Whether sandbox is currently active
- `git_branch_name` - Current git branch
- `working_directory` - Current working directory in sandbox
- `sandbox_created_at` - When sandbox was created
- `last_activity_at` - Last activity timestamp
- `sandbox_metadata` - Sandbox-specific state (dict) - e.g., E2B sandbox ID, Docker container ID
-
---
-
-### CommandExecutionResult
-
-**What it is:** Result of executing a command in a sandbox.
-
-**Purpose:** Returned by `sandbox.execute_command(command)`.
-
-**Fields:**
- `command` - Command that was executed
- `exit_code` - Command exit code (0 = success)
- `stdout_output` - Standard output
- `stderr_output` - Standard error output
- `execution_success` - Whether command succeeded (exit_code == 0)
- `execution_duration_seconds` - How long command took
- `execution_timestamp` - When command was executed
-
---
-
-## GitHub Integration
-
-### GitHubRepository
-
-**What it is:** GitHub repository information and access status.
-
-**Purpose:** Store repository metadata after verification.
-
-**Fields:**
- `repository_owner` - Owner username (e.g., "user")
- `repository_name` - Repo name (e.g., "repo")
- `repository_url` - Full URL (e.g., "https://github.com/user/repo.git")
- `repository_clone_url` - Git clone URL
- `default_branch` - Default branch name (usually "main")
- `is_accessible` - Whether we verified access
- `is_private` - Whether repo is private
- `access_verified_at` - When access was last verified
- `repository_description` - Repo description
-
---
-
-### GitHubRepositoryVerificationRequest
-
-**What it is:** Request to verify repository access.
-
-**Purpose:** Frontend asks backend to verify it can access a repo.
-
-**Fields:**
- `repository_url` - Repo URL to verify
-
---
-
-### GitHubRepositoryVerificationResponse
-
-**What it is:** Response from repository verification.
-
-**Purpose:** Tell frontend whether repo is accessible.
-
-**Fields:**
- `repository` - GitHubRepository object with details
- `verification_success` - Whether verification succeeded
- `error_message` - Error message if failed
- `error_details` - Detailed error info (dict)
-
---
-
-### GitHubPullRequest
-
-**What it is:** Pull request model.
-
-**Purpose:** Represent a created PR.
-
-**Fields:**
- `pull_request_number` - PR number
- `pull_request_title` - PR title
- `pull_request_body` - PR description
- `pull_request_url` - PR URL
- `pull_request_state` - State (open/closed/merged)
- `pull_request_head_branch` - Source branch
- `pull_request_base_branch` - Target branch
- `pull_request_author` - GitHub user who created PR
- `pull_request_created_at` - When created
- `pull_request_updated_at` - When last updated
- `pull_request_merged_at` - When merged (if applicable)
- `pull_request_is_draft` - Whether it's a draft PR
-
---
-
-### CreateGitHubPullRequestRequest
-
-**What it is:** Request to create a pull request.
-
-**Purpose:** Backend creates PR after work order completes.
-
-**Fields:**
- `repository_owner` - Repo owner
- `repository_name` - Repo name
- `pull_request_title` - PR title
- `pull_request_body` - PR description
- `pull_request_head_branch` - Source branch (work order branch)
- `pull_request_base_branch` - Target branch (usually "main")
- `pull_request_is_draft` - Create as draft (default: false)
-
---
-
-### GitHubIssue
-
-**What it is:** GitHub issue model.
-
-**Purpose:** Link work orders to GitHub issues.
-
-**Fields:**
- `issue_number` - Issue number
- `issue_title` - Issue title
- `issue_body` - Issue description
- `issue_state` - State (open/closed)
- `issue_author` - User who created issue
- `issue_assignees` - Assigned users (array)
- `issue_labels` - Labels (array)
- `issue_created_at` - When created
- `issue_updated_at` - When last updated
- `issue_closed_at` - When closed
- `issue_url` - Issue URL
-
---
-
-## Agent Execution
-
-### AgentCommandDefinition
-
-**What it is:** Represents a Claude Code slash command loaded from `.claude/commands/*.md`.
-
-**Purpose:** Catalog available commands for workflows.
-
-**Fields:**
- `command_name` - Command name (e.g., "/agent_workflow_plan")
- `command_file_path` - Path to .md file
- `command_description` - Description from file
- `command_arguments` - Expected arguments (array)
- `command_content` - Full file content
-
-**How loaded:** Scan `.claude/commands/` directory at startup, parse markdown files.
-
---
-
-### AgentCommandBuildRequest
-
-**What it is:** Request to build a Claude Code CLI command string.
-
-**Purpose:** Convert high-level command to actual CLI string.
-
-**Fields:**
- `command_name` - Command to execute (e.g., "/plan")
- `command_arguments` - Arguments (array)
- `agent_model_type` - Claude model (sonnet/opus/haiku)
- `output_format` - CLI output format (text/json/stream-json)
- `dangerously_skip_permissions` - Skip permission prompts (required for automation)
- `working_directory` - Directory to run in
- `timeout_seconds` - Command timeout (default 300, max 3600)
-
---
-
-### AgentCommandBuildResult
-
-**What it is:** Built CLI command ready to execute.
-
-**Purpose:** Actual command string to run via subprocess.
-
-**Fields:**
- `cli_command_string` - Complete CLI command (e.g., `"claude -p '/plan Issue #42' --model sonnet --output-format stream-json"`)
- `working_directory` - Directory to run in
- `timeout_seconds` - Timeout value
-
---
-
-### AgentCommandExecutionRequest
-
-**What it is:** High-level request to execute an agent command.
-
-**Purpose:** Frontend or orchestrator requests command execution.
-
-**Fields:**
- `agent_work_order_id` - Work order this is for
- `command_name` - Command to execute
- `command_arguments` - Arguments (array)
- `agent_model_type` - Model to use
- `working_directory` - Execution directory
-
---
-
-### AgentCommandExecutionResult
-
-**What it is:** Result of executing a Claude Code command.
-
-**Purpose:** Capture stdout/stderr, parse session ID, track timing.
-
-**Fields:**
-
-**Execution metadata:**
- `command_name` - Command executed
- `command_arguments` - Arguments used
- `execution_success` - Whether succeeded
- `exit_code` - Exit code
- `execution_duration_seconds` - How long it took
- `execution_started_at` - Start time
- `execution_completed_at` - End time
- `agent_work_order_id` - Work order ID
-
-**Output:**
- `stdout_output` - Standard output (may be JSONL)
- `stderr_output` - Standard error
- `agent_session_id` - Claude session ID (parsed from output)
-
-**Parsed results (from JSONL output):**
- `parsed_result_text` - Result text extracted from JSONL
- `parsed_result_is_error` - Whether result indicates error
- `parsed_result_total_cost_usd` - Total cost
- `parsed_result_duration_ms` - Duration from result message
-
-**Example JSONL parsing:** Last line of stdout contains result message with session_id, cost, duration.
-
---
-
-### SendAgentPromptRequest
-
-**What it is:** Request to send interactive prompt to running agent.
-
-**Purpose:** Allow user to interact with agent mid-execution.
-
-**Fields:**
- `agent_work_order_id` - Active work order
- `prompt_text` - Prompt to send (e.g., "Now implement the auth module")
- `continue_session` - Continue existing session vs start new (default: true)
-
---
-
-### SendAgentPromptResponse
-
-**What it is:** Response after sending prompt.
-
-**Purpose:** Confirm prompt was accepted.
-
-**Fields:**
- `agent_work_order_id` - Work order ID
- `prompt_accepted` - Whether prompt was accepted and queued
- `execution_started` - Whether execution has started
- `message` - Status message
- `error_message` - Error if rejected
-
---
-
-## Logging & Observability
-
-### AgentExecutionLogEntry
-
-**What it is:** Single structured log entry from work order execution.
-
-**Purpose:** Observability - track everything that happens during execution.
-
-**Fields:**
- `log_entry_id` - Unique log ID
- `agent_work_order_id` - Work order this belongs to
- `log_timestamp` - When log was created
- `log_level` - Level (debug/info/warning/error/critical)
- `event_name` - Structured event name (e.g., "agent_command_started", "git_commit_created")
- `log_message` - Human-readable message
- `log_context` - Additional context data (dict)
-
-**Storage:**
- Phase 1: Console output (pretty-print in dev)
- Phase 2+: JSONL files + Supabase table
-
-**Example log events:**
-```
-event_name: "agent_work_order_created"
-event_name: "git_branch_created"
-event_name: "agent_command_started"
-event_name: "agent_command_completed"
-event_name: "workflow_phase_started"
-event_name: "workflow_phase_completed"
-event_name: "git_commit_created"
-event_name: "github_pull_request_created"
-```
-
---
-
-### ListAgentExecutionLogsRequest
-
-**What it is:** Request to fetch execution logs.
-
-**Purpose:** UI can display logs for debugging.
-
-**Fields:**
- `agent_work_order_id` - Work order to get logs for
- `log_level_filter` - Filter by levels (array)
- `event_name_filter` - Filter by event names (array)
- `limit` - Results per page (default 100, max 1000)
- `offset` - Pagination offset
-
---
-
-### ListAgentExecutionLogsResponse
-
-**What it is:** Response containing log entries.
-
-**Fields:**
- `agent_work_order_id` - Work order ID
- `log_entries` - Array of AgentExecutionLogEntry objects
- `total_count` - Total log entries
- `has_more` - Whether more available
-
---
-
-## Enums (Type Definitions)
-
-### AgentWorkOrderStatus
-
-**What it is:** Possible work order statuses.
-
-**Values:**
- `pending` - Created, waiting to start
- `initializing` - Setting up sandbox
- `running` - Currently executing
- `completed` - Finished successfully
- `failed` - Execution failed
- `cancelled` - User cancelled (Phase 2+)
- `paused` - Paused by user (Phase 3+)
-
---
-
-### AgentWorkflowType
-
-**What it is:** Supported workflow types.
-
-**Values:**
- `agent_workflow_plan` - Planning only
- `agent_workflow_implement` - Implementation only
- `agent_workflow_validate` - Validation/testing only
- `agent_workflow_plan_implement` - Plan + Implement
- `agent_workflow_plan_implement_validate` - Full workflow
- `agent_workflow_custom` - User-defined (Phase 3+)
-
---
-
-### AgentWorkflowPhase
-
-**What it is:** Workflow execution phases (computed from git, not stored).
-
-**Values:**
- `initializing` - Setting up environment
- `planning` - Creating implementation plan
- `implementing` - Writing code
- `validating` - Running tests/validation
- `completed` - All phases done
-
-**How detected:** By analyzing commit messages in git log.
-
---
-
-### SandboxType
-
-**What it is:** Available sandbox environments.
-
-**Values:**
- `git_branch` - Isolated git branch (Phase 1)
- `git_worktree` - Git worktree (Phase 1) - better for parallel work orders
- `e2b` - E2B cloud sandbox (Phase 2+) - primary cloud target
- `dagger` - Dagger container (Phase 2+) - primary container target
- `local_docker` - Local Docker (Phase 3+)
-
---
-
-### AgentModelType
-
-**What it is:** Claude model options.
-
-**Values:**
- `sonnet` - Claude 3.5 Sonnet (balanced, default)
- `opus` - Claude 3 Opus (highest quality)
- `haiku` - Claude 3.5 Haiku (fastest)
-
---
-
-## Summary: What Gets Stored vs Computed
-
-### Stored in Database (Minimal State)
-
-**5 core fields:**
-1. `agent_work_order_id` - Unique ID
-2. `repository_url` - Repo URL
-3. `sandbox_identifier` - Execution environment ID (worktree path, E2B sandbox ID, etc.)
-4. `git_branch_name` - Branch name
-5. `agent_session_id` - Claude session
-
-**Metadata (for queries/filters):**
- `workflow_type`, `sandbox_type`, `agent_model_type`
- `status`, `github_issue_number`
- `created_at`, `updated_at`, `execution_started_at`, `execution_completed_at`
- `error_message`, `error_details`
- `created_by_user_id` (Phase 2+)
-
-### Computed from Git/Sandbox APIs (NOT in database)
-
-**Everything else:**
- `current_phase` → Analyze git commits
- `git_commit_count` → `git log --oneline | wc -l`
- `git_files_changed` → `git diff --stat`
- `git_lines_added/removed` → `git diff --stat`
- `latest_commit_sha/message` → `git log -1`
- `phase_history` → Analyze commit timestamps and messages
- `github_pull_request_url` → Query GitHub API
- `sandbox_state` (is_active, etc.) → Query sandbox API or check filesystem
- Test results → Read committed test_results.json file
-
-**This is the key insight:** Git is our database for work progress, sandbox APIs tell us execution state. We only store identifiers needed to find the right sandbox and git branch.
-
---
-
-**End of Data Models Document**
--- a/PRPs/specs/add-user-request-field-to-work-orders.md
+++ b/PRPs/specs/add-user-request-field-to-work-orders.md
@@ -1,643 +0,0 @@
-# Feature: Add User Request Field to Agent Work Orders
-
-## Feature Description
-
-Add a required `user_request` field to the Agent Work Orders API to enable users to provide custom prompts describing the work they want done. This field will be the primary input to the classification and planning workflow, replacing the current dependency on GitHub issue numbers. The system will intelligently parse the user request to extract GitHub issue references if present, or use the request content directly for classification and planning.
-
-## User Story
-
-As a developer using the Agent Work Orders system
-I want to provide a natural language description of the work I need done
-So that the AI agents can understand my requirements and create an appropriate implementation plan without requiring a GitHub issue
-
-## Problem Statement
-
-Currently, the `CreateAgentWorkOrderRequest` API only accepts a `github_issue_number` parameter, with no way to provide a custom user request. This causes several critical issues:
-
-1. **Empty Context**: When a work order is created, the `issue_json` passed to the classifier is empty (`{}`), causing agents to lack context
-2. **GitHub Dependency**: Users must create a GitHub issue before using the system, adding unnecessary friction
-3. **Limited Flexibility**: Users cannot describe ad-hoc tasks or provide additional context beyond what's in a GitHub issue
-4. **Broken Classification**: The classifier receives empty input and makes arbitrary classifications without understanding the actual work needed
-5. **Failed Planning**: Planners cannot create meaningful plans without understanding what the user wants
-
-**Current Flow (Broken):**
-```
-API Request → {github_issue_number: "1"}
-         ↓
-Workflow: github_issue_json = None → defaults to "{}"
-         ↓
-Classifier receives: "{}" (empty)
-         ↓
-Planner receives: "/feature" but no context about what feature to build
-```
-
-## Solution Statement
-
-Add a required `user_request` field to `CreateAgentWorkOrderRequest` that accepts natural language descriptions of the work to be done. The workflow will:
-
-1. **Accept User Requests**: Users provide a clear description like "Add login authentication with JWT tokens" or "Fix the bug where users can't save their profile" or "Implement GitHub issue #42"
-2. **Classify Based on Content**: The classifier receives the full user request and classifies it as feature/bug/chore based on the actual content
-3. **Optionally Fetch GitHub Issues**: If the user mentions a GitHub issue (e.g., "implement issue #42"), the system fetches the issue details and merges them with the user request
-4. **Provide Full Context**: All workflow steps receive the complete user request and any fetched issue data, enabling meaningful planning and implementation
-
-**Intended Flow (Fixed):**
-```
-API Request → {user_request: "Add login feature with JWT authentication"}
-         ↓
-Classifier receives: "Add login feature with JWT authentication"
-         ↓
-Classifier returns: "/feature" (based on actual content)
-         ↓
-IF user request mentions "issue #N" or "GitHub issue N":
-  → Fetch issue details from GitHub
-  → Merge with user request
-ELSE:
-  → Use user request as-is
-         ↓
-Planner receives: Full context about what to build
-         ↓
-Planner creates: Detailed implementation plan based on user request
-```
-
-## Relevant Files
-
-Use these files to implement the feature:
-
-**Core Models** - Add user_request field
- `python/src/agent_work_orders/models.py`:100-107 - `CreateAgentWorkOrderRequest` needs `user_request: str` field added
-
-**API Routes** - Pass user request to workflow
- `python/src/agent_work_orders/api/routes.py`:54-124 - `create_agent_work_order()` needs to pass `user_request` to orchestrator
-
-**Workflow Orchestrator** - Accept and process user request
- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`:48-56 - `execute_workflow()` signature needs `user_request` parameter
- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`:96-103 - Classification step needs to receive `user_request` instead of empty JSON
-
-**GitHub Client** - Add method to fetch issue details
- `python/src/agent_work_orders/github_integration/github_client.py` - Add `get_issue()` method to fetch issue by number
-
-**Workflow Operations** - Update classification to use user request
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:26-79 - `classify_issue()` may need parameter name updates for clarity
-
-**Tests** - Update and add test coverage
- `python/tests/agent_work_orders/test_api.py` - Update all API tests to include `user_request` field
- `python/tests/agent_work_orders/test_models.py` - Add tests for `user_request` field validation
- `python/tests/agent_work_orders/test_github_integration.py` - Add tests for `get_issue()` method
- `python/tests/agent_work_orders/test_workflow_operations.py` - Update mocks to use `user_request` content
-
-### New Files
-
-No new files needed - all changes are modifications to existing files.
-
-## Implementation Plan
-
-### Phase 1: Foundation - Model and API Updates
-
-Add the `user_request` field to the request model and update the API to accept it. This is backward-compatible if we keep `github_issue_number` optional.
-
-### Phase 2: Core Implementation - Workflow Integration
-
-Update the workflow orchestrator to receive and use the user request for classification and planning. Add logic to detect and fetch GitHub issues if mentioned.
-
-### Phase 3: Integration - GitHub Issue Fetching
-
-Add capability to fetch GitHub issue details when referenced in the user request, and merge that context with the user's description.
-
-## Step by Step Tasks
-
-IMPORTANT: Execute every step in order, top to bottom.
-
-### Add user_request Field to CreateAgentWorkOrderRequest Model
-
- Open `python/src/agent_work_orders/models.py`
- Locate the `CreateAgentWorkOrderRequest` class (line 100)
- Add new required field after `workflow_type`:
-  ```python
-  user_request: str = Field(..., description="User's description of the work to be done")
-  ```
- Update the docstring to explain that `user_request` is the primary input
- Make `github_issue_number` truly optional (it already is, but update docs to clarify it's only needed for reference)
- Save the file
-
-### Add get_issue() Method to GitHubClient
-
- Open `python/src/agent_work_orders/github_integration/github_client.py`
- Add new method after `get_repository_info()`:
-  ```python
-  async def get_issue(self, repository_url: str, issue_number: str) -> dict:
-      """Get GitHub issue details
-
-      Args:
-          repository_url: GitHub repository URL
-          issue_number: Issue number
-
-      Returns:
-          Issue details as JSON dict
-
-      Raises:
-          GitHubOperationError: If unable to fetch issue
-      """
-      self._logger.info("github_issue_fetch_started", repository_url=repository_url, issue_number=issue_number)
-
-      try:
-          owner, repo = self._parse_repository_url(repository_url)
-          repo_path = f"{owner}/{repo}"
-
-          process = await asyncio.create_subprocess_exec(
-              self.gh_cli_path,
-              "issue",
-              "view",
-              issue_number,
-              "--repo",
-              repo_path,
-              "--json",
-              "number,title,body,state,url",
-              stdout=asyncio.subprocess.PIPE,
-              stderr=asyncio.subprocess.PIPE,
-          )
-
-          stdout, stderr = await asyncio.wait_for(process.communicate(), timeout=30)
-
-          if process.returncode != 0:
-              error = stderr.decode() if stderr else "Unknown error"
-              raise GitHubOperationError(f"Failed to fetch issue: {error}")
-
-          issue_data = json.loads(stdout.decode())
-          self._logger.info("github_issue_fetched", issue_number=issue_number)
-          return issue_data
-
-      except Exception as e:
-          self._logger.error("github_issue_fetch_failed", error=str(e), exc_info=True)
-          raise GitHubOperationError(f"Failed to fetch GitHub issue: {e}") from e
-  ```
- Save the file
-
-### Update execute_workflow() Signature
-
- Open `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- Locate the `execute_workflow()` method (line 48)
- Add `user_request` parameter after `sandbox_type`:
-  ```python
-  async def execute_workflow(
-      self,
-      agent_work_order_id: str,
-      workflow_type: AgentWorkflowType,
-      repository_url: str,
-      sandbox_type: SandboxType,
-      user_request: str,  # NEW: Add this parameter
-      github_issue_number: str | None = None,
-      github_issue_json: str | None = None,
-  ) -> None:
-  ```
- Update the docstring to include `user_request` parameter documentation
- Save the file
-
-### Add Logic to Parse GitHub Issue References from User Request
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- After line 87 (after updating status to RUNNING), add logic to detect GitHub issues:
-  ```python
-  # Parse GitHub issue from user request if mentioned
-  import re
-  issue_match = re.search(r'(?:issue|#)\s*#?(\d+)', user_request, re.IGNORECASE)
-  if issue_match and not github_issue_number:
-      github_issue_number = issue_match.group(1)
-      bound_logger.info("github_issue_detected_in_request", issue_number=github_issue_number)
-
-  # Fetch GitHub issue if number provided
-  if github_issue_number and not github_issue_json:
-      try:
-          issue_data = await self.github_client.get_issue(repository_url, github_issue_number)
-          github_issue_json = json.dumps(issue_data)
-          bound_logger.info("github_issue_fetched", issue_number=github_issue_number)
-      except Exception as e:
-          bound_logger.warning("github_issue_fetch_failed", error=str(e))
-          # Continue without issue data - use user_request only
-
-  # Prepare classification input: merge user request with issue data if available
-  classification_input = user_request
-  if github_issue_json:
-      issue_data = json.loads(github_issue_json)
-      classification_input = f"User Request: {user_request}\n\nGitHub Issue Details:\nTitle: {issue_data.get('title', '')}\nBody: {issue_data.get('body', '')}"
-  ```
- Add `import json` at the top of the file if not already present
- Update the classify_issue call (line 97-103) to use `classification_input`:
-  ```python
-  classify_result = await workflow_operations.classify_issue(
-      self.agent_executor,
-      self.command_loader,
-      classification_input,  # Use classification_input instead of github_issue_json or "{}"
-      agent_work_order_id,
-      sandbox.working_dir,
-  )
-  ```
- Save the file
-
-### Update API Route to Pass user_request
-
- Open `python/src/agent_work_orders/api/routes.py`
- Locate `create_agent_work_order()` function (line 54)
- Update the `orchestrator.execute_workflow()` call (line 101-109) to include `user_request`:
-  ```python
-  asyncio.create_task(
-      orchestrator.execute_workflow(
-          agent_work_order_id=agent_work_order_id,
-          workflow_type=request.workflow_type,
-          repository_url=request.repository_url,
-          sandbox_type=request.sandbox_type,
-          user_request=request.user_request,  # NEW: Add this line
-          github_issue_number=request.github_issue_number,
-      )
-  )
-  ```
- Save the file
-
-### Update Model Tests for user_request Field
-
- Open `python/tests/agent_work_orders/test_models.py`
- Find or add test for `CreateAgentWorkOrderRequest`:
-  ```python
-  def test_create_agent_work_order_request_with_user_request():
-      """Test CreateAgentWorkOrderRequest with user_request field"""
-      request = CreateAgentWorkOrderRequest(
-          repository_url="https://github.com/owner/repo",
-          sandbox_type=SandboxType.GIT_BRANCH,
-          workflow_type=AgentWorkflowType.PLAN,
-          user_request="Add user authentication with JWT tokens",
-      )
-
-      assert request.user_request == "Add user authentication with JWT tokens"
-      assert request.repository_url == "https://github.com/owner/repo"
-      assert request.github_issue_number is None
-
-  def test_create_agent_work_order_request_with_github_issue():
-      """Test CreateAgentWorkOrderRequest with both user_request and issue number"""
-      request = CreateAgentWorkOrderRequest(
-          repository_url="https://github.com/owner/repo",
-          sandbox_type=SandboxType.GIT_BRANCH,
-          workflow_type=AgentWorkflowType.PLAN,
-          user_request="Implement the feature described in issue #42",
-          github_issue_number="42",
-      )
-
-      assert request.user_request == "Implement the feature described in issue #42"
-      assert request.github_issue_number == "42"
-  ```
- Save the file
-
-### Add GitHub Client Tests for get_issue()
-
- Open `python/tests/agent_work_orders/test_github_integration.py`
- Add new test function:
-  ```python
-  @pytest.mark.asyncio
-  async def test_get_issue_success():
-      """Test successful GitHub issue fetch"""
-      client = GitHubClient()
-
-      # Mock subprocess
-      mock_process = MagicMock()
-      mock_process.returncode = 0
-      issue_json = json.dumps({
-          "number": 42,
-          "title": "Add login feature",
-          "body": "Users need to log in with email and password",
-          "state": "open",
-          "url": "https://github.com/owner/repo/issues/42"
-      })
-      mock_process.communicate = AsyncMock(return_value=(issue_json.encode(), b""))
-
-      with patch("asyncio.create_subprocess_exec", return_value=mock_process):
-          issue_data = await client.get_issue("https://github.com/owner/repo", "42")
-
-      assert issue_data["number"] == 42
-      assert issue_data["title"] == "Add login feature"
-      assert issue_data["state"] == "open"
-
-  @pytest.mark.asyncio
-  async def test_get_issue_failure():
-      """Test failed GitHub issue fetch"""
-      client = GitHubClient()
-
-      # Mock subprocess
-      mock_process = MagicMock()
-      mock_process.returncode = 1
-      mock_process.communicate = AsyncMock(return_value=(b"", b"Issue not found"))
-
-      with patch("asyncio.create_subprocess_exec", return_value=mock_process):
-          with pytest.raises(GitHubOperationError, match="Failed to fetch issue"):
-              await client.get_issue("https://github.com/owner/repo", "999")
-  ```
- Add necessary imports at the top (json, AsyncMock if not present)
- Save the file
-
-### Update API Tests to Include user_request
-
- Open `python/tests/agent_work_orders/test_api.py`
- Find all tests that create work orders and add `user_request` field
- Update `test_create_agent_work_order()`:
-  ```python
-  response = client.post(
-      "/agent-work-orders",
-      json={
-          "repository_url": "https://github.com/owner/repo",
-          "sandbox_type": "git_branch",
-          "workflow_type": "agent_workflow_plan",
-          "user_request": "Add user authentication feature",  # ADD THIS
-          "github_issue_number": "42",
-      },
-  )
-  ```
- Update `test_create_agent_work_order_without_issue()`:
-  ```python
-  response = client.post(
-      "/agent-work-orders",
-      json={
-          "repository_url": "https://github.com/owner/repo",
-          "sandbox_type": "git_branch",
-          "workflow_type": "agent_workflow_plan",
-          "user_request": "Fix the login bug where users can't sign in",  # ADD THIS
-      },
-  )
-  ```
- Update any other test cases that create work orders
- Save the file
-
-### Update Workflow Operations Tests
-
- Open `python/tests/agent_work_orders/test_workflow_operations.py`
- Update `test_classify_issue_success()` to use meaningful user request:
-  ```python
-  result = await workflow_operations.classify_issue(
-      mock_executor,
-      mock_loader,
-      "Add user authentication with JWT tokens and refresh token support",  # Meaningful request
-      "wo-test",
-      "/tmp/working",
-  )
-  ```
- Update other test cases to use meaningful user requests instead of empty JSON
- Save the file
-
-### Run Model Unit Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py -v`
- Verify new `user_request` tests pass
- Fix any failures
-
-### Run GitHub Client Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_github_integration.py -v`
- Verify `get_issue()` tests pass
- Fix any failures
-
-### Run API Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_api.py -v`
- Verify all API tests pass with `user_request` field
- Fix any failures
-
-### Run All Agent Work Orders Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
- Target: 100% of tests pass
- Fix any failures
-
-### Run Type Checking
-
- Execute: `cd python && uv run mypy src/agent_work_orders/`
- Verify no type errors
- Fix any issues
-
-### Run Linting
-
- Execute: `cd python && uv run ruff check src/agent_work_orders/`
- Verify no linting issues
- Fix any issues
-
-### Manual End-to-End Test
-
- Start server: `cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &`
- Wait: `sleep 5`
- Test with user request only:
-  ```bash
-  curl -X POST http://localhost:8888/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d '{
-      "repository_url": "https://github.com/Wirasm/dylan.git",
-      "sandbox_type": "git_branch",
-      "workflow_type": "agent_workflow_plan",
-      "user_request": "Add a new feature for user profile management with avatar upload"
-    }' | jq
-  ```
- Get work order ID from response
- Wait: `sleep 30`
- Check status: `curl http://localhost:8888/agent-work-orders/{WORK_ORDER_ID} | jq`
- Check steps: `curl http://localhost:8888/agent-work-orders/{WORK_ORDER_ID}/steps | jq`
- Verify:
-  - Classifier received full user request (not empty JSON)
-  - Classifier returned appropriate classification
-  - Planner received the user request context
-  - Workflow progressed normally
- Test with GitHub issue reference:
-  ```bash
-  curl -X POST http://localhost:8888/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d '{
-      "repository_url": "https://github.com/Wirasm/dylan.git",
-      "sandbox_type": "git_branch",
-      "workflow_type": "agent_workflow_plan",
-      "user_request": "Implement the feature described in GitHub issue #1"
-    }' | jq
-  ```
- Verify:
-  - System detected issue reference
-  - Issue details were fetched
-  - Both user request and issue context passed to agents
- Stop server: `pkill -f "uvicorn.*8888"`
-
-## Testing Strategy
-
-### Unit Tests
-
-**Model Tests:**
- Test `user_request` field accepts string values
- Test `user_request` field is required (validation fails if missing)
- Test `github_issue_number` remains optional
- Test model serialization with all fields
-
-**GitHub Client Tests:**
- Test `get_issue()` with valid issue number
- Test `get_issue()` with invalid issue number
- Test `get_issue()` with network timeout
- Test `get_issue()` returns correct JSON structure
-
-**Workflow Orchestrator Tests:**
- Test GitHub issue regex detection from user request
- Test fetching GitHub issue when detected
- Test fallback to user request only if issue fetch fails
- Test classification input merges user request with issue data
-
-### Integration Tests
-
-**Full Workflow Tests:**
- Test complete workflow with user request only (no GitHub issue)
- Test complete workflow with explicit GitHub issue number
- Test complete workflow with GitHub issue mentioned in user request
- Test workflow handles GitHub API failures gracefully
-
-**API Integration Tests:**
- Test POST /agent-work-orders with user_request field
- Test POST /agent-work-orders validates user_request is required
- Test POST /agent-work-orders accepts both user_request and github_issue_number
-
-### Edge Cases
-
-**User Request Parsing:**
- User request mentions "issue #42"
- User request mentions "GitHub issue 42"
- User request mentions "issue#42" (no space)
- User request contains multiple issue references (use first one)
- User request doesn't mention any issues
- Very long user requests (>10KB)
- Empty user request (should fail validation)
-
-**GitHub Issue Handling:**
- Issue number provided but fetch fails
- Issue exists but is closed
- Issue exists but has no body
- Issue number is invalid (non-numeric)
- Repository doesn't have issues enabled
-
-**Backward Compatibility:**
- Existing tests must still pass (with user_request added)
- API accepts requests without github_issue_number
-
-## Acceptance Criteria
-
-**Core Functionality:**
- ✅ `user_request` field added to `CreateAgentWorkOrderRequest` model
- ✅ `user_request` field is required and validated
- ✅ `github_issue_number` field remains optional
- ✅ API accepts and passes `user_request` to workflow
- ✅ Workflow uses `user_request` for classification (not empty JSON)
- ✅ GitHub issue references auto-detected from user request
- ✅ `get_issue()` method fetches GitHub issue details via gh CLI
- ✅ Classification input merges user request with issue data when available
-
-**Test Coverage:**
- ✅ All existing tests pass with zero regressions
- ✅ New model tests for `user_request` field
- ✅ New GitHub client tests for `get_issue()` method
- ✅ Updated API tests include `user_request` field
- ✅ Updated workflow tests use meaningful user requests
-
-**Code Quality:**
- ✅ Type checking passes (mypy)
- ✅ Linting passes (ruff)
- ✅ Code follows existing patterns
- ✅ Comprehensive docstrings
-
-**End-to-End Validation:**
- ✅ User can create work order with custom request (no GitHub issue)
- ✅ Classifier receives full user request context
- ✅ Planner receives full user request context
- ✅ Workflow progresses successfully with user request
- ✅ System detects GitHub issue references in user request
- ✅ System fetches and merges GitHub issue data when detected
- ✅ Workflow handles missing GitHub issues gracefully
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-```bash
-# Unit Tests
-cd python && uv run pytest tests/agent_work_orders/test_models.py -v
-cd python && uv run pytest tests/agent_work_orders/test_github_integration.py -v
-cd python && uv run pytest tests/agent_work_orders/test_api.py -v
-cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v
-
-# Full Test Suite
-cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
-cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing
-cd python && uv run pytest  # All backend tests
-
-# Quality Checks
-cd python && uv run mypy src/agent_work_orders/
-cd python && uv run ruff check src/agent_work_orders/
-
-# End-to-End Test
-cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
-sleep 5
-curl http://localhost:8888/health | jq
-
-# Test 1: User request only (no GitHub issue)
-WORK_ORDER=$(curl -X POST http://localhost:8888/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Add user profile management with avatar upload functionality"}' \
-  | jq -r '.agent_work_order_id')
-
-echo "Work Order 1: $WORK_ORDER"
-sleep 30
-
-# Verify classifier received user request
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER/steps | jq '.steps[] | {step, success, output}'
-
-# Test 2: User request with GitHub issue reference
-WORK_ORDER2=$(curl -X POST http://localhost:8888/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Implement the feature described in GitHub issue #1"}' \
-  | jq -r '.agent_work_order_id')
-
-echo "Work Order 2: $WORK_ORDER2"
-sleep 30
-
-# Verify issue was fetched and merged with user request
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER2/steps | jq '.steps[] | {step, success, output}'
-
-# Cleanup
-pkill -f "uvicorn.*8888"
-```
-
-## Notes
-
-**Design Decisions:**
- `user_request` is required because it's the primary input to the system
- `github_issue_number` remains optional for backward compatibility and explicit issue references
- GitHub issue auto-detection uses regex to find common patterns ("issue #42", "GitHub issue 42")
- If both explicit `github_issue_number` and detected issue exist, explicit takes precedence
- If GitHub issue fetch fails, workflow continues with user request only (resilient design)
- Classification input merges user request with issue data to provide maximum context
-
-**Why This Fixes the Problem:**
-```
-BEFORE:
- No way to provide custom user requests
- issue_json = "{}" (empty)
- Classifier has no context
- Planner has no context
- Workflow fails or produces irrelevant output
-
-AFTER:
- user_request field provides clear description
- issue_json populated from user request + optional GitHub issue
- Classifier receives: "Add user authentication with JWT tokens"
- Planner receives: Full context about what to build
- Workflow succeeds with meaningful output
-```
-
-**GitHub Issue Detection Examples:**
- "Implement issue #42" → Detects issue 42
- "Fix GitHub issue 123" → Detects issue 123
- "Resolve issue#456 in the API" → Detects issue 456
- "Add login feature" → No issue detected, uses request as-is
-
-**Future Enhancements:**
- Support multiple GitHub issue references
- Support GitHub PR references
- Add user_request to work order state for historical tracking
- Support Jira, Linear, or other issue tracker references
- Add user_request validation (min/max length, profanity filter)
- Support rich text formatting in user requests
- Add example user requests in API documentation
--- a/PRPs/specs/agent-work-orders-mvp-v2.md
+++ b/PRPs/specs/agent-work-orders-mvp-v2.md
--- a/PRPs/specs/atomic-workflow-execution-refactor.md
+++ b/PRPs/specs/atomic-workflow-execution-refactor.md
--- a/PRPs/specs/awo-docker-integration-and-config-management.md
+++ b/PRPs/specs/awo-docker-integration-and-config-management.md
--- a/PRPs/specs/awo-docker-integration-mvp.md
+++ b/PRPs/specs/awo-docker-integration-mvp.md
--- a/PRPs/specs/compositional-workflow-architecture.md
+++ b/PRPs/specs/compositional-workflow-architecture.md
@@ -1,946 +0,0 @@
-# Feature: Compositional Workflow Architecture with Worktree Isolation, Test Resolution, and Review Resolution
-
-## Feature Description
-
-Transform the agent-work-orders system from a centralized orchestrator pattern to a compositional script-based architecture that enables parallel execution through git worktrees, automatic test failure resolution with retry logic, and comprehensive review phase with blocker issue patching. This architecture change enables running 15+ work orders simultaneously in isolated worktrees with deterministic port allocation, while maintaining complete SDLC coverage from planning through testing and review.
-
-The system will support:
-
- **Worktree-based isolation**: Each work order runs in its own git worktree under `trees/<work_order_id>/` instead of temporary clones
- **Port allocation**: Deterministic backend (9100-9114) and frontend (9200-9214) port assignment based on work order ID
- **Test phase with resolution**: Automatic retry loop (max 4 attempts) that resolves failed tests using AI-powered fixes
- **Review phase with resolution**: Captures screenshots, compares implementation vs spec, categorizes issues (blocker/tech_debt/skippable), and automatically patches blocker issues (max 3 attempts)
- **File-based state**: Simple JSON state management (`adw_state.json`) instead of in-memory repository
- **Compositional scripts**: Independent workflow scripts (plan, build, test, review, doc, ship) that can be run separately or together
-
-## User Story
-
-As a developer managing multiple concurrent features
-I want to run multiple agent work orders in parallel with isolated environments
-So that I can scale development velocity without conflicts or resource contention, while ensuring all code passes tests and review before deployment
-
-## Problem Statement
-
-The current agent-work-orders architecture has several critical limitations:
-
-1. **No Parallelization**: GitBranchSandbox creates temporary clones that get cleaned up, preventing safe parallel execution of multiple work orders
-2. **No Test Coverage**: Missing test workflow step - implementations are committed and PR'd without validation
-3. **No Automated Test Resolution**: When tests fail, there's no retry/fix mechanism to automatically resolve failures
-4. **No Review Phase**: No automated review of implementation against specifications with screenshot capture and blocker detection
-5. **Centralized Orchestration**: Monolithic orchestrator makes it difficult to run individual phases (e.g., just test, just review) independently
-6. **In-Memory State**: State management in WorkOrderRepository is not persistent across service restarts
-7. **No Port Management**: No system for allocating unique ports for parallel instances
-
-These limitations prevent scaling development workflows and ensuring code quality before PRs are created.
-
-## Solution Statement
-
-Implement a compositional workflow architecture inspired by the ADW (AI Developer Workflow) pattern with the following components: SEE EXAMPLES HERE: PRPs/examples/\* READ THESE
-
-1. **GitWorktreeSandbox**: Replace GitBranchSandbox with worktree-based isolation that shares the same repo but has independent working directories
-2. **Port Allocation System**: Deterministic port assignment (backend: 9100-9114, frontend: 9200-9214) based on work order ID hash
-3. **File-Based State Management**: JSON state files for persistence and debugging
-4. **Test Workflow Module**: New `test_workflow.py` with automatic resolution and retry logic (4 attempts)
-5. **Review Workflow Module**: New `review_workflow.py` with screenshot capture, spec comparison, and blocker patching (3 attempts)
-6. **Compositional Scripts**: Independent workflow operations that can be composed or run individually
-7. **Enhanced WorkflowStep Enum**: Add TEST, RESOLVE_TEST, REVIEW, RESOLVE_REVIEW steps
-8. **Resolution Commands**: New Claude commands `/resolve_failed_test` and `/resolve_failed_review` for AI-powered fixes
-
-## Relevant Files
-
-### Core Workflow Files
-
- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py` - Main orchestrator that needs refactoring for compositional approach
-  - Currently: Monolithic execute_workflow with sequential steps
-  - Needs: Modular workflow composition with test/review phases
-
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py` - Atomic workflow operations
-  - Currently: classify_issue, build_plan, implement_plan, create_commit, create_pull_request
-  - Needs: Add test_workflow, review_workflow, resolve_test, resolve_review operations
-
- `python/src/agent_work_orders/models.py` - Data models including WorkflowStep enum
-  - Currently: WorkflowStep has CLASSIFY, PLAN, IMPLEMENT, COMMIT, REVIEW, TEST, CREATE_PR
-  - Needs: Add RESOLVE_TEST, RESOLVE_REVIEW steps
-
-### Sandbox Management Files
-
- `python/src/agent_work_orders/sandbox_manager/git_branch_sandbox.py` - Current temp clone implementation
-  - Problem: Creates temp dirs, no parallelization support
-  - Will be replaced by: GitWorktreeSandbox
-
- `python/src/agent_work_orders/sandbox_manager/sandbox_factory.py` - Factory for creating sandboxes
-  - Needs: Add GitWorktreeSandbox creation logic
-
- `python/src/agent_work_orders/sandbox_manager/sandbox_protocol.py` - Sandbox interface
-  - May need: Port allocation methods
-
-### State Management Files
-
- `python/src/agent_work_orders/state_manager/work_order_repository.py` - Current in-memory state
-  - Currently: In-memory dictionary with async methods
-  - Needs: File-based JSON persistence option
-
- `python/src/agent_work_orders/config.py` - Configuration
-  - Needs: Port range configuration, worktree base directory
-
-### Command Files
-
- `python/.claude/commands/agent-work-orders/test.md` - Currently just a hello world test
-  - Needs: Comprehensive test suite runner that returns JSON with failed tests
-
- `python/.claude/commands/agent-work-orders/implementor.md` - Implementation command
-  - May need: Context about test requirements
-
-### New Files
-
-#### Worktree Management
-
- `python/src/agent_work_orders/sandbox_manager/git_worktree_sandbox.py` - New worktree-based sandbox
- `python/src/agent_work_orders/utils/worktree_operations.py` - Worktree CRUD operations
- `python/src/agent_work_orders/utils/port_allocation.py` - Port management utilities
-
-#### Test Workflow
-
- `python/src/agent_work_orders/workflow_engine/test_workflow.py` - Test execution with resolution
- `python/.claude/commands/agent-work-orders/test_runner.md` - Run test suite, return JSON
- `python/.claude/commands/agent-work-orders/resolve_failed_test.md` - Fix failed test given JSON
-
-#### Review Workflow
-
- `python/src/agent_work_orders/workflow_engine/review_workflow.py` - Review with screenshot capture
- `python/.claude/commands/agent-work-orders/review_runner.md` - Run review against spec
- `python/.claude/commands/agent-work-orders/resolve_failed_review.md` - Patch blocker issues
- `python/.claude/commands/agent-work-orders/create_patch_plan.md` - Generate patch plan for issue
-
-#### State Management
-
- `python/src/agent_work_orders/state_manager/file_state_repository.py` - JSON file-based state
- `python/src/agent_work_orders/models/workflow_state.py` - State data models
-
-#### Documentation
-
- `docs/compositional-workflows.md` - Architecture documentation
- `docs/worktree-management.md` - Worktree operations guide
- `docs/test-resolution.md` - Test workflow documentation
- `docs/review-resolution.md` - Review workflow documentation
-
-## Implementation Plan
-
-### Phase 1: Foundation - Worktree Isolation and Port Allocation
-
-Establish the core infrastructure for parallel execution through git worktrees and deterministic port allocation. This phase creates the foundation for all subsequent phases.
-
-**Key Deliverables**:
-
- GitWorktreeSandbox implementation
- Port allocation system
- Worktree management utilities
- `.ports.env` file generation
- Updated sandbox factory
-
-### Phase 2: File-Based State Management
-
-Replace in-memory state repository with file-based JSON persistence for durability and debuggability across service restarts.
-
-**Key Deliverables**:
-
- FileStateRepository implementation
- WorkflowState models
- State migration utilities
- JSON serialization/deserialization
- Backward compatibility layer
-
-### Phase 3: Test Workflow with Resolution
-
-Implement comprehensive test execution with automatic failure resolution and retry logic.
-
-**Key Deliverables**:
-
- test_workflow.py module
- test_runner.md command (returns JSON array of test results)
- resolve_failed_test.md command (takes test JSON, fixes issue)
- Retry loop (max 4 attempts)
- Test result parsing and formatting
- Integration with orchestrator
-
-### Phase 4: Review Workflow with Resolution
-
-Add review phase with screenshot capture, spec comparison, and automatic blocker patching.
-
-**Key Deliverables**:
-
- review_workflow.py module
- review_runner.md command (compares implementation vs spec)
- resolve_failed_review.md command (patches blocker issues)
- Screenshot capture integration
- Issue severity categorization (blocker/tech_debt/skippable)
- Retry loop (max 3 attempts)
- R2 upload integration (optional)
-
-### Phase 5: Compositional Refactoring
-
-Refactor the centralized orchestrator into composable workflow scripts that can be run independently.
-
-**Key Deliverables**:
-
- Modular workflow composition
- Independent script execution
- Workflow step dependencies
- Enhanced error handling
- Workflow resumption support
-
-## Step by Step Tasks
-
-### Step 1: Create Worktree Sandbox Implementation
-
-Create the core GitWorktreeSandbox class that manages git worktrees for isolated execution.
-
- Create `python/src/agent_work_orders/sandbox_manager/git_worktree_sandbox.py`
- Implement `GitWorktreeSandbox` class with:
-  - `__init__(repository_url, sandbox_identifier)` - Initialize with worktree path calculation
-  - `setup()` - Create worktree under `trees/<sandbox_identifier>/` from origin/main
-  - `cleanup()` - Remove worktree using `git worktree remove`
-  - `execute_command(command, timeout)` - Execute commands in worktree context
-  - `get_git_branch_name()` - Query current branch in worktree
- Handle existing worktree detection and validation
- Add logging for all worktree operations
- Write unit tests for GitWorktreeSandbox in `python/tests/agent_work_orders/sandbox_manager/test_git_worktree_sandbox.py`
-
-### Step 2: Implement Port Allocation System
-
-Create deterministic port allocation based on work order ID to enable parallel instances.
-
- Create `python/src/agent_work_orders/utils/port_allocation.py`
- Implement functions:
-  - `get_ports_for_work_order(work_order_id) -> Tuple[int, int]` - Calculate ports from ID hash (backend: 9100-9114, frontend: 9200-9214)
-  - `is_port_available(port: int) -> bool` - Check if port is bindable
-  - `find_next_available_ports(work_order_id, max_attempts=15) -> Tuple[int, int]` - Find available ports with offset
-  - `create_ports_env_file(worktree_path, backend_port, frontend_port)` - Generate `.ports.env` file
- Add port range configuration to `python/src/agent_work_orders/config.py`
- Write unit tests for port allocation in `python/tests/agent_work_orders/utils/test_port_allocation.py`
-
-### Step 3: Create Worktree Management Utilities
-
-Build helper utilities for worktree CRUD operations.
-
- Create `python/src/agent_work_orders/utils/worktree_operations.py`
- Implement functions:
-  - `create_worktree(work_order_id, branch_name, logger) -> Tuple[str, Optional[str]]` - Create worktree and return path or error
-  - `validate_worktree(work_order_id, state) -> Tuple[bool, Optional[str]]` - Three-way validation (state, filesystem, git)
-  - `get_worktree_path(work_order_id) -> str` - Calculate absolute worktree path
-  - `remove_worktree(work_order_id, logger) -> Tuple[bool, Optional[str]]` - Clean up worktree
-  - `setup_worktree_environment(worktree_path, backend_port, frontend_port, logger)` - Create .ports.env
- Handle git fetch operations before worktree creation
- Add comprehensive error handling and logging
- Write unit tests for worktree operations in `python/tests/agent_work_orders/utils/test_worktree_operations.py`
-
-### Step 4: Update Sandbox Factory
-
-Modify the sandbox factory to support creating GitWorktreeSandbox instances.
-
- Update `python/src/agent_work_orders/sandbox_manager/sandbox_factory.py`
- Add GIT_WORKTREE case to `create_sandbox()` method
- Integrate port allocation during sandbox creation
- Pass port configuration to GitWorktreeSandbox
- Update SandboxType enum in models.py to promote GIT_WORKTREE from placeholder
- Write integration tests for sandbox factory with worktrees
-
-### Step 5: Implement File-Based State Repository
-
-Create file-based state management for persistence and debugging.
-
- Create `python/src/agent_work_orders/state_manager/file_state_repository.py`
- Implement `FileStateRepository` class:
-  - `__init__(state_directory: str)` - Initialize with state directory path
-  - `save_state(work_order_id, state_data)` - Write JSON to `<state_dir>/<work_order_id>.json`
-  - `load_state(work_order_id) -> Optional[dict]` - Read JSON from file
-  - `list_states() -> List[str]` - List all work order IDs with state files
-  - `delete_state(work_order_id)` - Remove state file
-  - `update_status(work_order_id, status, **kwargs)` - Update specific fields
-  - `save_step_history(work_order_id, step_history)` - Persist step history
- Add state directory configuration to config.py
- Create state models in `python/src/agent_work_orders/models/workflow_state.py`
- Write unit tests for file state repository
-
-### Step 6: Update WorkflowStep Enum
-
-Add new workflow steps for test and review resolution.
-
- Update `python/src/agent_work_orders/models.py`
- Add to WorkflowStep enum:
-  - `RESOLVE_TEST = "resolve_test"` - Test failure resolution step
-  - `RESOLVE_REVIEW = "resolve_review"` - Review issue resolution step
- Update `StepHistory.get_current_step()` to include new steps in sequence:
-  - Updated sequence: CLASSIFY → PLAN → FIND_PLAN → GENERATE_BRANCH → IMPLEMENT → COMMIT → TEST → RESOLVE_TEST (if needed) → REVIEW → RESOLVE_REVIEW (if needed) → CREATE_PR
- Write unit tests for updated step sequence logic
-
-### Step 7: Create Test Runner Command
-
-Build Claude command to execute test suite and return structured JSON results.
-
- Update `python/.claude/commands/agent-work-orders/test_runner.md`
- Command should:
-  - Execute backend tests: `cd python && uv run pytest tests/ -v --tb=short`
-  - Execute frontend tests: `cd archon-ui-main && npm test`
-  - Parse test results from output
-  - Return JSON array with structure:
-    ```json
-    [
-      {
-        "test_name": "string",
-        "test_file": "string",
-        "passed": boolean,
-        "error": "optional string",
-        "execution_command": "string"
-      }
-    ]
-    ```
-  - Include test purpose and reproduction command
-  - Sort failed tests first
-  - Handle timeout and command errors gracefully
- Test the command manually with sample repositories
-
-### Step 8: Create Resolve Failed Test Command
-
-Build Claude command to analyze and fix failed tests given test JSON.
-
- Create `python/.claude/commands/agent-work-orders/resolve_failed_test.md`
- Command takes single argument: test result JSON object
- Command should:
-  - Parse test failure information
-  - Analyze root cause of failure
-  - Read relevant test file and code under test
-  - Implement fix (code change or test update)
-  - Re-run the specific failed test to verify fix
-  - Report success/failure
- Include examples of common test failure patterns
- Add constraints (don't skip tests, maintain test coverage)
- Test the command with sample failed test JSONs
-
-### Step 9: Implement Test Workflow Module
-
-Create the test workflow module with automatic resolution and retry logic.
-
- Create `python/src/agent_work_orders/workflow_engine/test_workflow.py`
- Implement functions:
-  - `run_tests(executor, command_loader, work_order_id, working_dir) -> StepExecutionResult` - Execute test suite
-  - `parse_test_results(output, logger) -> Tuple[List[TestResult], int, int]` - Parse JSON output
-  - `resolve_failed_test(executor, command_loader, test_json, work_order_id, working_dir) -> StepExecutionResult` - Fix single test
-  - `run_tests_with_resolution(executor, command_loader, work_order_id, working_dir, max_attempts=4) -> Tuple[List[TestResult], int, int]` - Main retry loop
- Implement retry logic:
-  - Run tests, check for failures
-  - If failures exist and attempts < max_attempts: resolve each failed test
-  - Re-run tests after resolution
-  - Stop if all tests pass or max attempts reached
- Add TestResult model to models.py
- Write comprehensive unit tests for test workflow
-
-### Step 10: Add Test Workflow Operation
-
-Create atomic operation for test execution in workflow_operations.py.
-
- Update `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Add function:
-  ```python
-  async def execute_tests(
-      executor: AgentCLIExecutor,
-      command_loader: ClaudeCommandLoader,
-      work_order_id: str,
-      working_dir: str,
-  ) -> StepExecutionResult
-  ```
- Function should:
-  - Call `run_tests_with_resolution()` from test_workflow.py
-  - Return StepExecutionResult with test summary
-  - Include pass/fail counts in output
-  - Log detailed test results
- Add TESTER constant to agent_names.py
- Write unit tests for execute_tests operation
-
-### Step 11: Integrate Test Phase in Orchestrator
-
-Add test phase to workflow orchestrator between COMMIT and CREATE_PR steps.
-
- Update `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- After commit step (line ~236), add:
-
-  ```python
-  # Step 7: Run tests with resolution
-  test_result = await workflow_operations.execute_tests(
-      self.agent_executor,
-      self.command_loader,
-      agent_work_order_id,
-      sandbox.working_dir,
-  )
-  step_history.steps.append(test_result)
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-  if not test_result.success:
-      raise WorkflowExecutionError(f"Tests failed: {test_result.error_message}")
-
-  bound_logger.info("step_completed", step="test")
-  ```
-
- Update step numbering (PR creation becomes step 8)
- Add test failure handling strategy
- Write integration tests for full workflow with test phase
-
-### Step 12: Create Review Runner Command
-
-Build Claude command to review implementation against spec with screenshot capture.
-
- Create `python/.claude/commands/agent-work-orders/review_runner.md`
- Command takes arguments: spec_file_path, work_order_id
- Command should:
-  - Read specification from spec_file_path
-  - Analyze implementation in codebase
-  - Start application (if UI component)
-  - Capture screenshots of key UI flows
-  - Compare implementation against spec requirements
-  - Categorize issues by severity: "blocker" | "tech_debt" | "skippable"
-  - Return JSON with structure:
-    ```json
-    {
-      "review_passed": boolean,
-      "review_issues": [
-        {
-          "issue_title": "string",
-          "issue_description": "string",
-          "issue_severity": "blocker|tech_debt|skippable",
-          "affected_files": ["string"],
-          "screenshots": ["string"]
-        }
-      ],
-      "screenshots": ["string"]
-    }
-    ```
- Include review criteria and severity definitions
- Test command with sample specifications
-
-### Step 13: Create Resolve Failed Review Command
-
-Build Claude command to patch blocker issues from review.
-
- Create `python/.claude/commands/agent-work-orders/resolve_failed_review.md`
- Command takes single argument: review issue JSON object
- Command should:
-  - Parse review issue details
-  - Create patch plan addressing the issue
-  - Implement the patch (code changes)
-  - Verify patch resolves the issue
-  - Report success/failure
- Include constraints (only fix blocker issues, maintain functionality)
- Add examples of common review issue patterns
- Test command with sample review issues
-
-### Step 14: Implement Review Workflow Module
-
-Create the review workflow module with automatic blocker patching.
-
- Create `python/src/agent_work_orders/workflow_engine/review_workflow.py`
- Implement functions:
-  - `run_review(executor, command_loader, spec_file, work_order_id, working_dir) -> ReviewResult` - Execute review
-  - `parse_review_results(output, logger) -> ReviewResult` - Parse JSON output
-  - `resolve_review_issue(executor, command_loader, issue_json, work_order_id, working_dir) -> StepExecutionResult` - Patch single issue
-  - `run_review_with_resolution(executor, command_loader, spec_file, work_order_id, working_dir, max_attempts=3) -> ReviewResult` - Main retry loop
- Implement retry logic:
-  - Run review, check for blocker issues
-  - If blockers exist and attempts < max_attempts: resolve each blocker
-  - Re-run review after patching
-  - Stop if no blockers or max attempts reached
-  - Allow tech_debt and skippable issues to pass
- Add ReviewResult and ReviewIssue models to models.py
- Write comprehensive unit tests for review workflow
-
-### Step 15: Add Review Workflow Operation
-
-Create atomic operation for review execution in workflow_operations.py.
-
- Update `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Add function:
-  ```python
-  async def execute_review(
-      executor: AgentCLIExecutor,
-      command_loader: ClaudeCommandLoader,
-      spec_file: str,
-      work_order_id: str,
-      working_dir: str,
-  ) -> StepExecutionResult
-  ```
- Function should:
-  - Call `run_review_with_resolution()` from review_workflow.py
-  - Return StepExecutionResult with review summary
-  - Include blocker count in output
-  - Log detailed review results
- Add REVIEWER constant to agent_names.py
- Write unit tests for execute_review operation
-
-### Step 16: Integrate Review Phase in Orchestrator
-
-Add review phase to workflow orchestrator between TEST and CREATE_PR steps.
-
- Update `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- After test step, add:
-
-  ```python
-  # Step 8: Run review with resolution
-  review_result = await workflow_operations.execute_review(
-      self.agent_executor,
-      self.command_loader,
-      plan_file or "",
-      agent_work_order_id,
-      sandbox.working_dir,
-  )
-  step_history.steps.append(review_result)
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-  if not review_result.success:
-      raise WorkflowExecutionError(f"Review failed: {review_result.error_message}")
-
-  bound_logger.info("step_completed", step="review")
-  ```
-
- Update step numbering (PR creation becomes step 9)
- Add review failure handling strategy
- Write integration tests for full workflow with review phase
-
-### Step 17: Refactor Orchestrator for Composition
-
-Refactor workflow orchestrator to support modular composition.
-
- Update `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- Extract workflow phases into separate methods:
-  - `_execute_planning_phase()` - classify → plan → find_plan → generate_branch
-  - `_execute_implementation_phase()` - implement → commit
-  - `_execute_testing_phase()` - test → resolve_test (if needed)
-  - `_execute_review_phase()` - review → resolve_review (if needed)
-  - `_execute_deployment_phase()` - create_pr
- Update `execute_workflow()` to compose phases:
-  ```python
-  await self._execute_planning_phase(...)
-  await self._execute_implementation_phase(...)
-  await self._execute_testing_phase(...)
-  await self._execute_review_phase(...)
-  await self._execute_deployment_phase(...)
-  ```
- Add phase-level error handling and recovery
- Support skipping phases via configuration
- Write unit tests for each phase method
-
-### Step 18: Add Configuration for New Features
-
-Add configuration options for worktrees, ports, and new workflow phases.
-
- Update `python/src/agent_work_orders/config.py`
- Add configuration:
-
-  ```python
-  # Worktree configuration
-  WORKTREE_BASE_DIR: str = os.getenv("WORKTREE_BASE_DIR", "trees")
-
-  # Port allocation
-  BACKEND_PORT_RANGE_START: int = int(os.getenv("BACKEND_PORT_START", "9100"))
-  BACKEND_PORT_RANGE_END: int = int(os.getenv("BACKEND_PORT_END", "9114"))
-  FRONTEND_PORT_RANGE_START: int = int(os.getenv("FRONTEND_PORT_START", "9200"))
-  FRONTEND_PORT_RANGE_END: int = int(os.getenv("FRONTEND_PORT_END", "9214"))
-
-  # Test workflow
-  MAX_TEST_RETRY_ATTEMPTS: int = int(os.getenv("MAX_TEST_RETRY_ATTEMPTS", "4"))
-  ENABLE_TEST_PHASE: bool = os.getenv("ENABLE_TEST_PHASE", "true").lower() == "true"
-
-  # Review workflow
-  MAX_REVIEW_RETRY_ATTEMPTS: int = int(os.getenv("MAX_REVIEW_RETRY_ATTEMPTS", "3"))
-  ENABLE_REVIEW_PHASE: bool = os.getenv("ENABLE_REVIEW_PHASE", "true").lower() == "true"
-  ENABLE_SCREENSHOT_CAPTURE: bool = os.getenv("ENABLE_SCREENSHOT_CAPTURE", "true").lower() == "true"
-
-  # State management
-  STATE_STORAGE_TYPE: str = os.getenv("STATE_STORAGE_TYPE", "memory")  # "memory" or "file"
-  FILE_STATE_DIRECTORY: str = os.getenv("FILE_STATE_DIRECTORY", "agent-work-orders-state")
-  ```
-
- Update `.env.example` with new configuration options
- Document configuration in README
-
-### Step 19: Create Documentation
-
-Document the new compositional architecture and workflows.
-
- Create `docs/compositional-workflows.md`:
-  - Architecture overview
-  - Compositional design principles
-  - Phase composition examples
-  - Error handling and recovery
-  - Configuration guide
-
- Create `docs/worktree-management.md`:
-  - Worktree vs temporary clone comparison
-  - Parallelization capabilities
-  - Port allocation system
-  - Cleanup and maintenance
-
- Create `docs/test-resolution.md`:
-  - Test workflow overview
-  - Retry logic explanation
-  - Test resolution examples
-  - Troubleshooting failed tests
-
- Create `docs/review-resolution.md`:
-  - Review workflow overview
-  - Screenshot capture setup
-  - Issue severity definitions
-  - Blocker patching process
-  - R2 upload configuration
-
-### Step 20: Run Validation Commands
-
-Execute all validation commands to ensure the feature works correctly with zero regressions.
-
- Run backend tests: `cd python && uv run pytest tests/agent_work_orders/ -v`
- Run backend linting: `cd python && uv run ruff check src/agent_work_orders/`
- Run type checking: `cd python && uv run mypy src/agent_work_orders/`
- Test worktree creation manually:
-  ```bash
-  cd python
-  python -c "
-  from src.agent_work_orders.utils.worktree_operations import create_worktree
-  from src.agent_work_orders.utils.structured_logger import get_logger
-  logger = get_logger('test')
-  path, err = create_worktree('test-wo-123', 'test-branch', logger)
-  print(f'Path: {path}, Error: {err}')
-  "
-  ```
- Test port allocation:
-  ```bash
-  cd python
-  python -c "
-  from src.agent_work_orders.utils.port_allocation import get_ports_for_work_order
-  backend, frontend = get_ports_for_work_order('test-wo-123')
-  print(f'Backend: {backend}, Frontend: {frontend}')
-  "
-  ```
- Create test work order with new workflow:
-  ```bash
-  curl -X POST http://localhost:8181/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d '{
-      "repository_url": "https://github.com/your-test-repo",
-      "sandbox_type": "git_worktree",
-      "workflow_type": "agent_workflow_plan",
-      "user_request": "Add a new feature with tests"
-    }'
-  ```
- Verify worktree created under `trees/<work_order_id>/`
- Verify `.ports.env` created in worktree
- Monitor workflow execution through all phases
- Verify test phase runs and resolves failures
- Verify review phase runs and patches blockers
- Verify PR created successfully
- Clean up test worktrees: `git worktree prune`
-
-## Testing Strategy
-
-### Unit Tests
-
-**Worktree Management**:
-
- Test worktree creation with valid repository
- Test worktree creation with invalid branch
- Test worktree validation (three-way check)
- Test worktree cleanup
- Test handling of existing worktrees
-
-**Port Allocation**:
-
- Test deterministic port assignment from work order ID
- Test port availability checking
- Test finding next available ports with collision
- Test port range boundaries (9100-9114, 9200-9214)
- Test `.ports.env` file generation
-
-**Test Workflow**:
-
- Test parsing valid test result JSON
- Test parsing malformed test result JSON
- Test retry loop with all tests passing
- Test retry loop with some tests failing then passing
- Test retry loop reaching max attempts
- Test individual test resolution
-
-**Review Workflow**:
-
- Test parsing valid review result JSON
- Test parsing malformed review result JSON
- Test retry loop with no blocker issues
- Test retry loop with blockers then resolved
- Test retry loop reaching max attempts
- Test issue severity filtering
-
-**State Management**:
-
- Test saving state to JSON file
- Test loading state from JSON file
- Test updating specific state fields
- Test handling missing state files
- Test concurrent state access
-
-### Integration Tests
-
-**End-to-End Workflow**:
-
- Test complete workflow with worktree sandbox: classify → plan → implement → commit → test → review → PR
- Test test phase with intentional test failure and resolution
- Test review phase with intentional blocker issue and patching
- Test parallel execution of multiple work orders with different ports
- Test workflow resumption after failure
- Test cleanup of worktrees after completion
-
-**Sandbox Integration**:
-
- Test command execution in worktree context
- Test git operations in worktree
- Test branch creation in worktree
- Test worktree isolation (parallel instances don't interfere)
-
-**State Persistence**:
-
- Test state survives service restart (file-based)
- Test state migration from memory to file
- Test state corruption recovery
-
-### Edge Cases
-
-**Worktree Edge Cases**:
-
- Worktree already exists (should reuse or fail gracefully)
- Git repository unreachable (should fail setup)
- Insufficient disk space for worktree (should fail with clear error)
- Worktree removal fails (should log error and continue)
- Maximum worktrees reached (15 concurrent) - should queue or fail
-
-**Port Allocation Edge Cases**:
-
- All ports in range occupied (should fail with error)
- Port becomes occupied between allocation and use (should retry)
- Invalid port range in configuration (should fail validation)
-
-**Test Workflow Edge Cases**:
-
- Test command times out (should mark as failed)
- Test command returns invalid JSON (should fail gracefully)
- All tests fail and none can be resolved (should fail after max attempts)
- Test resolution introduces new failures (should continue with retry loop)
-
-**Review Workflow Edge Cases**:
-
- Review command crashes (should fail gracefully)
- Screenshot capture fails (should continue review without screenshots)
- Review finds only skippable issues (should pass)
- Blocker patch introduces new blocker (should continue with retry loop)
- Spec file not found (should fail with clear error)
-
-**State Management Edge Cases**:
-
- State file corrupted (should fail with recovery suggestion)
- State directory not writable (should fail with permission error)
- Concurrent access to same state file (should handle with locking or fail safely)
-
-## Acceptance Criteria
-
- [ ] GitWorktreeSandbox successfully creates and manages worktrees under `trees/<work_order_id>/`
- [ ] Port allocation deterministically assigns unique ports (backend: 9100-9114, frontend: 9200-9214) based on work order ID
- [ ] Multiple work orders (at least 3) can run in parallel without port or filesystem conflicts
- [ ] `.ports.env` file is created in each worktree with correct port configuration
- [ ] Test workflow successfully runs test suite and returns structured JSON results
- [ ] Test workflow automatically resolves failed tests up to 4 attempts
- [ ] Test workflow stops retrying when all tests pass
- [ ] Review workflow successfully reviews implementation against spec
- [ ] Review workflow captures screenshots (when enabled)
- [ ] Review workflow categorizes issues by severity (blocker/tech_debt/skippable)
- [ ] Review workflow automatically patches blocker issues up to 3 attempts
- [ ] Review workflow allows tech_debt and skippable issues to pass
- [ ] WorkflowStep enum includes TEST, RESOLVE_TEST, REVIEW, RESOLVE_REVIEW steps
- [ ] Workflow orchestrator executes all phases: planning → implementation → testing → review → deployment
- [ ] File-based state repository persists state to JSON files
- [ ] State survives service restarts when using file-based storage
- [ ] Configuration supports enabling/disabling test and review phases
- [ ] All existing tests pass with zero regressions
- [ ] New unit tests achieve >80% code coverage for new modules
- [ ] Integration tests verify end-to-end workflow with parallel execution
- [ ] Documentation covers compositional architecture, worktrees, test resolution, and review resolution
- [ ] Cleanup of worktrees works correctly (git worktree remove + prune)
- [ ] Error messages are clear and actionable for all failure scenarios
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-### Backend Tests
-
- `cd python && uv run pytest tests/agent_work_orders/ -v --tb=short` - Run all agent work orders tests
- `cd python && uv run pytest tests/agent_work_orders/sandbox_manager/ -v` - Test sandbox management
- `cd python && uv run pytest tests/agent_work_orders/workflow_engine/ -v` - Test workflow engine
- `cd python && uv run pytest tests/agent_work_orders/utils/ -v` - Test utilities
-
-### Code Quality
-
- `cd python && uv run ruff check src/agent_work_orders/` - Check code quality
- `cd python && uv run mypy src/agent_work_orders/` - Type checking
-
-### Manual Worktree Testing
-
-```bash
-# Test worktree creation
-cd python
-python -c "
-from src.agent_work_orders.utils.worktree_operations import create_worktree, validate_worktree, remove_worktree
-from src.agent_work_orders.utils.structured_logger import get_logger
-logger = get_logger('test')
-
-# Create worktree
-path, err = create_worktree('test-wo-123', 'test-branch', logger)
-print(f'Created worktree at: {path}')
-assert err is None, f'Error: {err}'
-
-# Validate worktree
-from src.agent_work_orders.state_manager.file_state_repository import FileStateRepository
-state_repo = FileStateRepository('test-state')
-state_data = {'worktree_path': path}
-valid, err = validate_worktree('test-wo-123', state_data)
-assert valid, f'Validation failed: {err}'
-
-# Remove worktree
-success, err = remove_worktree('test-wo-123', logger)
-assert success, f'Removal failed: {err}'
-print('Worktree lifecycle test passed!')
-"
-```
-
-### Manual Port Allocation Testing
-
-```bash
-cd python
-python -c "
-from src.agent_work_orders.utils.port_allocation import get_ports_for_work_order, find_next_available_ports, is_port_available
-backend, frontend = get_ports_for_work_order('test-wo-123')
-print(f'Ports for test-wo-123: Backend={backend}, Frontend={frontend}')
-assert 9100 <= backend <= 9114, f'Backend port out of range: {backend}'
-assert 9200 <= frontend <= 9214, f'Frontend port out of range: {frontend}'
-
-# Test availability check
-available = is_port_available(backend)
-print(f'Backend port {backend} available: {available}')
-
-# Test finding next available
-next_backend, next_frontend = find_next_available_ports('test-wo-456')
-print(f'Next available ports: Backend={next_backend}, Frontend={next_frontend}')
-print('Port allocation test passed!')
-"
-```
-
-### Integration Testing
-
-```bash
-# Start agent work orders service
-docker compose up -d archon-server
-
-# Create work order with worktree sandbox
-curl -X POST http://localhost:8181/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{
-    "repository_url": "https://github.com/coleam00/archon",
-    "sandbox_type": "git_worktree",
-    "workflow_type": "agent_workflow_plan",
-    "user_request": "Fix issue #123"
-  }'
-
-# Verify worktree created
-ls -la trees/
-
-# Monitor workflow progress
-watch -n 2 'curl -s http://localhost:8181/agent-work-orders | jq'
-
-# Verify .ports.env in worktree
-cat trees/<work_order_id>/.ports.env
-
-# After completion, verify cleanup
-git worktree list
-```
-
-### Parallel Execution Testing
-
-```bash
-# Create 3 work orders simultaneously
-for i in 1 2 3; do
-  curl -X POST http://localhost:8181/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d "{
-      \"repository_url\": \"https://github.com/coleam00/archon\",
-      \"sandbox_type\": \"git_worktree\",
-      \"workflow_type\": \"agent_workflow_plan\",
-      \"user_request\": \"Parallel test $i\"
-    }" &
-done
-wait
-
-# Verify all worktrees exist
-ls -la trees/
-
-# Verify different ports allocated
-for dir in trees/*/; do
-  echo "Worktree: $dir"
-  cat "$dir/.ports.env"
-  echo "---"
-done
-```
-
-## Notes
-
-### Architecture Decision: Compositional vs Centralized
-
-This feature implements Option B (compositional refactoring) because:
-
-1. **Scalability**: Compositional design enables running individual phases (e.g., just test or just review) without full workflow
-2. **Debugging**: Independent scripts are easier to test and debug in isolation
-3. **Flexibility**: Users can compose custom workflows (e.g., skip review for simple PRs)
-4. **Maintainability**: Smaller, focused modules are easier to maintain than monolithic orchestrator
-5. **Parallelization**: Worktree-based approach inherently supports compositional execution
-
-### Performance Considerations
-
- **Worktree Creation**: Worktrees are faster than clones (~2-3x) because they share the same .git directory
- **Port Allocation**: Hash-based allocation is deterministic but may have collisions; fallback to linear search adds minimal overhead
- **Retry Loops**: Test (4 attempts) and review (3 attempts) retry limits prevent infinite loops while allowing reasonable resolution attempts
- **State I/O**: File-based state adds disk I/O but enables persistence; consider eventual move to database for high-volume deployments
-
-### Future Enhancements
-
-1. **Database State**: Replace file-based state with PostgreSQL/Supabase for better concurrent access and querying
-2. **WebSocket Updates**: Stream test/review progress to UI in real-time
-3. **Screenshot Upload**: Integrate R2/S3 for screenshot storage and PR comments with images
-4. **Workflow Resumption**: Support resuming failed workflows from last successful step
-5. **Custom Workflows**: Allow users to define custom workflow compositions via config
-6. **Metrics**: Add OpenTelemetry instrumentation for workflow performance monitoring
-7. **E2E Testing**: Add Playwright/Cypress integration for UI-focused review
-8. **Distributed Execution**: Support running work orders across multiple machines
-
-### Migration Path
-
-For existing deployments:
-
-1. **Backward Compatibility**: Keep GitBranchSandbox working alongside GitWorktreeSandbox
-2. **Gradual Migration**: Default to GIT_BRANCH, opt-in to GIT_WORKTREE via configuration
-3. **State Migration**: Provide utility to migrate in-memory state to file-based state
-4. **Cleanup**: Add command to clean up old temporary clones: `rm -rf /tmp/agent-work-orders/*`
-
-### Dependencies
-
-New dependencies to add via `uv add`:
-
- (None required - uses existing git, pytest, claude CLI)
-
-### Related Issues/PRs
-
- #XXX - Original agent-work-orders MVP implementation
- #XXX - Worktree isolation discussion
- #XXX - Test phase feature request
- #XXX - Review automation proposal
--- a/PRPs/specs/fix-claude-cli-integration.md
+++ b/PRPs/specs/fix-claude-cli-integration.md
@@ -1,365 +0,0 @@
-# Feature: Fix Claude CLI Integration for Agent Work Orders
-
-## Feature Description
-
-Fix the Claude CLI integration in the Agent Work Orders system to properly execute agent workflows using the Claude Code CLI. The current implementation is missing the required `--verbose` flag and lacks other important CLI configuration options for reliable, automated agent execution.
-
-The system currently fails with error: `"Error: When using --print, --output-format=stream-json requires --verbose"` because the CLI command builder is incomplete. This feature will add all necessary CLI flags, improve error handling, and ensure robust integration with Claude Code CLI for automated agent workflows.
-
-## User Story
-
-As a developer using the Agent Work Orders system
-I want the system to properly execute Claude CLI commands with all required flags
-So that agent workflows complete successfully and I can automate development tasks reliably
-
-## Problem Statement
-
-The current CLI integration has several issues:
-
-1. **Missing `--verbose` flag**: When using `--print` with `--output-format=stream-json`, the `--verbose` flag is required by Claude Code CLI but not included in the command
-2. **No turn limits**: Workflows can run indefinitely without a safety mechanism to limit agentic turns
-3. **No permission handling**: Interactive permission prompts block automated workflows
-4. **Incomplete configuration**: Missing flags for model selection, working directories, and other important options
-5. **Test misalignment**: Tests were written expecting `-f` flag pattern but implementation uses stdin, causing confusion
-6. **Limited error context**: Error messages don't provide enough information for debugging CLI failures
-
-These issues prevent agent work orders from executing successfully and make the system unusable in its current state.
-
-## Solution Statement
-
-Implement a complete CLI integration by:
-
-1. **Add missing `--verbose` flag** to enable stream-json output format
-2. **Add safety limits** with `--max-turns` to prevent runaway executions
-3. **Enable automation** with `--dangerously-skip-permissions` for non-interactive operation
-4. **Add configuration options** for working directories and model selection
-5. **Update tests** to match the stdin-based implementation pattern
-6. **Improve error handling** with better error messages and validation
-7. **Add configuration** for customizable CLI flags via environment variables
-
-The solution maintains the existing architecture while fixing the CLI command builder and adding proper configuration management.
-
-## Relevant Files
-
-**Core Implementation Files:**
- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py` (lines 24-58) - CLI command builder that needs fixing
-  - Currently missing `--verbose` flag
-  - Needs additional flags for safety and automation
-  - Error handling could be improved
-
-**Configuration:**
- `python/src/agent_work_orders/config.py` (lines 17-30) - Configuration management
-  - Needs new configuration options for CLI flags
-  - Should support environment variable overrides
-
-**Tests:**
- `python/tests/agent_work_orders/test_agent_executor.py` (lines 10-44) - Unit tests for CLI executor
-  - Tests expect `-f` flag pattern but implementation uses stdin
-  - Need to update tests to match current implementation
-  - Add tests for new CLI flags
-
-**Workflow Integration:**
- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py` (lines 98-104) - Calls CLI executor
-  - Verify integration works with updated CLI command
-  - Ensure proper error propagation
-
-**Documentation:**
- `PRPs/ai_docs/cc_cli_ref.md` - Claude CLI reference documentation
-  - Contains complete flag reference
-  - Guides implementation
-
-### New Files
-
-None - this is a fix to existing implementation.
-
-## Implementation Plan
-
-### Phase 1: Foundation - Fix Core CLI Command Builder
-
-Add the missing `--verbose` flag and implement basic safety flags to make the CLI integration functional. This unblocks agent workflow execution.
-
-**Changes:**
- Add `--verbose` flag to command builder (required for stream-json)
- Add `--max-turns` flag with default limit (safety)
- Add `--dangerously-skip-permissions` flag (automation)
- Update configuration with new options
-
-### Phase 2: Enhanced Configuration
-
-Add comprehensive configuration management for CLI flags, allowing operators to customize behavior via environment variables or config files.
-
-**Changes:**
- Add configuration options for all CLI flags
- Support environment variable overrides
- Add validation for configuration values
- Document configuration options
-
-### Phase 3: Testing and Validation
-
-Update tests to match the current stdin-based implementation and add comprehensive test coverage for new CLI flags.
-
-**Changes:**
- Fix existing tests to match stdin pattern
- Add tests for new CLI flags
- Add integration tests for full workflow execution
- Add error handling tests
-
-## Step by Step Tasks
-
-### Fix CLI Command Builder
-
- Read the current implementation in `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
- Update the `build_command` method to include the `--verbose` flag after `--output-format stream-json`
- Add `--max-turns` flag with configurable value (default: 20)
- Add `--dangerously-skip-permissions` flag for automation
- Ensure command parts are joined correctly with proper spacing
- Update the docstring to document all flags being added
- Verify the command string format matches CLI expectations
-
-### Add Configuration Options
-
- Read `python/src/agent_work_orders/config.py`
- Add `CLAUDE_CLI_MAX_TURNS` config option (default: 20)
- Add `CLAUDE_CLI_SKIP_PERMISSIONS` config option (default: True for automation)
- Add `CLAUDE_CLI_VERBOSE` config option (default: True, required for stream-json)
- Add docstrings explaining each configuration option
- Ensure all config options support environment variable overrides
-
-### Update CLI Executor to Use Config
-
- Update `agent_cli_executor.py` to read configuration values
- Pass configuration to `build_command` method
- Make flags configurable rather than hardcoded
- Add parameter documentation for new options
- Maintain backward compatibility with existing code
-
-### Improve Error Handling
-
- Add validation for command file path existence before reading
- Add better error messages when CLI execution fails
- Include the full command in error logs (without sensitive data)
- Add timeout context to error messages
- Log CLI stdout/stderr even on success for debugging
-
-### Update Unit Tests
-
- Read `python/tests/agent_work_orders/test_agent_executor.py`
- Update `test_build_command` to verify `--verbose` flag is included
- Update `test_build_command` to verify `--max-turns` flag is included
- Update `test_build_command` to verify `--dangerously-skip-permissions` flag is included
- Remove or update tests expecting `-f` flag pattern (no longer used)
- Update test assertions to match stdin-based implementation
- Add test for command with all flags enabled
- Add test for command with custom max-turns value
-
-### Add Integration Tests
-
- Create new test `test_build_command_with_config` that verifies configuration is used
- Create test `test_execute_with_valid_command_file` that mocks file reading
- Create test `test_execute_with_missing_command_file` that verifies error handling
- Create test `test_cli_flags_in_correct_order` to ensure proper flag ordering
- Verify all tests pass with `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v`
-
-### Test End-to-End Workflow
-
- Start the agent work orders server with `cd python && uv run uvicorn src.agent_work_orders.main:app --host 0.0.0.0 --port 8888`
- Create a test work order via curl: `curl -X POST http://localhost:8888/agent-work-orders -H "Content-Type: application/json" -d '{"repository_url": "https://github.com/anthropics/claude-code", "sandbox_type": "git_branch", "workflow_type": "agent_workflow_plan", "github_issue_number": "123"}'`
- Monitor server logs to verify the CLI command includes all required flags
- Verify the error message no longer appears: "Error: When using --print, --output-format=stream-json requires --verbose"
- Check that workflow executes successfully or fails with a different (expected) error
- Verify session ID extraction works from CLI output
-
-### Update Documentation
-
- Update inline code comments in `agent_cli_executor.py` explaining why each flag is needed
- Add comments documenting the Claude CLI requirements
- Reference the CLI documentation file `PRPs/ai_docs/cc_cli_ref.md` in code comments
- Ensure configuration options are documented with examples
-
-### Run Validation Commands
-
-Execute all validation commands listed in the Validation Commands section to ensure zero regressions and complete functionality.
-
-## Testing Strategy
-
-### Unit Tests
-
-**CLI Command Builder Tests:**
- Verify `--verbose` flag is present in built command
- Verify `--max-turns` flag is present with correct value
- Verify `--dangerously-skip-permissions` flag is present
- Verify flags are in correct order (order may matter for CLI parsing)
- Verify command parts are properly space-separated
- Verify prompt text is correctly prepared for stdin
-
-**Configuration Tests:**
- Verify default configuration values are correct
- Verify environment variables override defaults
- Verify configuration validation works for invalid values
-
-**Error Handling Tests:**
- Test with non-existent command file path
- Test with invalid configuration values
- Test with CLI execution failures
- Test with timeout scenarios
-
-### Integration Tests
-
-**Full Workflow Tests:**
- Test creating work order triggers CLI execution
- Test CLI command includes all required flags
- Test session ID extraction from CLI output
- Test error propagation from CLI to API response
-
-**Sandbox Integration:**
- Test CLI executes in correct working directory
- Test prompt text is passed via stdin correctly
- Test output parsing works with actual CLI format
-
-### Edge Cases
-
-**Command Building:**
- Empty args list
- Very long prompt text (test stdin limits)
- Special characters in args
- Non-existent command file path
- Command file with no content
-
-**Configuration:**
- Max turns = 0 (should error or use sensible minimum)
- Max turns = 1000 (should cap at reasonable maximum)
- Invalid boolean values for skip_permissions
- Missing environment variables (should use defaults)
-
-**CLI Execution:**
- CLI command times out
- CLI command exits with non-zero code
- CLI output contains no session ID
- CLI output is malformed JSON
- Claude CLI not installed or not in PATH
-
-## Acceptance Criteria
-
-**CLI Integration:**
- ✅ Agent work orders execute without "requires --verbose" error
- ✅ CLI command includes `--verbose` flag
- ✅ CLI command includes `--max-turns` flag with configurable value
- ✅ CLI command includes `--dangerously-skip-permissions` flag
- ✅ Configuration options support environment variable overrides
- ✅ Error messages include helpful context for debugging
-
-**Testing:**
- ✅ All existing unit tests pass
- ✅ New tests verify CLI flags are included
- ✅ Integration test verifies end-to-end workflow
- ✅ Test coverage for error handling scenarios
-
-**Functionality:**
- ✅ Work orders can be created via API
- ✅ Background workflow execution starts
- ✅ CLI command executes with proper flags
- ✅ Session ID is extracted from CLI output
- ✅ Errors are properly logged and returned to API
-
-**Documentation:**
- ✅ Code comments explain CLI requirements
- ✅ Configuration options are documented
- ✅ Error messages are clear and actionable
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-```bash
-# Run all agent work orders tests
-cd python && uv run pytest tests/agent_work_orders/ -v
-
-# Run specific CLI executor tests
-cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v
-
-# Run type checking
-cd python && uv run mypy src/agent_work_orders/agent_executor/
-
-# Run linting
-cd python && uv run ruff check src/agent_work_orders/agent_executor/
-cd python && uv run ruff check src/agent_work_orders/config.py
-
-# Start server and test end-to-end
-cd python && uv run uvicorn src.agent_work_orders.main:app --host 0.0.0.0 --port 8888 &
-sleep 3
-
-# Test health endpoint
-curl -s http://localhost:8888/health | jq .
-
-# Create test work order
-curl -s -X POST http://localhost:8888/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{
-    "repository_url": "https://github.com/anthropics/claude-code",
-    "sandbox_type": "git_branch",
-    "workflow_type": "agent_workflow_plan",
-    "github_issue_number": "123"
-  }' | jq .
-
-# Wait for background execution to start
-sleep 5
-
-# Check work order status
-curl -s http://localhost:8888/agent-work-orders | jq '.[] | {id: .agent_work_order_id, status: .status, error: .error_message}'
-
-# Verify logs show proper CLI command with all flags (check server stdout)
-# Should see: claude --print --output-format stream-json --verbose --max-turns 20 --dangerously-skip-permissions
-
-# Stop server
-pkill -f "uvicorn src.agent_work_orders.main:app"
-```
-
-## Notes
-
-### CLI Flag Requirements
-
-Based on `PRPs/ai_docs/cc_cli_ref.md`:
- `--verbose` is **required** when using `--print` with `--output-format=stream-json`
- `--max-turns` should be set to prevent runaway executions (recommended: 10-50)
- `--dangerously-skip-permissions` is needed for non-interactive automation
- Flag order may matter - follow the order shown in documentation examples
-
-### Configuration Philosophy
-
- Default values should enable successful automation
- Environment variables allow per-deployment customization
- Configuration should fail fast with clear errors
- Document all configuration with examples
-
-### Future Enhancements (Out of Scope for This Feature)
-
- Add support for `--add-dir` flag for multi-directory workspaces
- Add support for `--agents` flag for custom subagents
- Add support for `--model` flag for model selection
- Add retry logic with exponential backoff for transient failures
- Add metrics/telemetry for CLI execution success rates
- Add support for resuming failed workflows with `--resume` flag
-
-### Testing Notes
-
- Tests must not require actual Claude CLI installation
- Mock subprocess execution for unit tests
- Integration tests can assume Claude CLI is available
- Consider adding e2e tests that use a mock CLI script
- Validate session ID extraction with real CLI output examples
-
-### Debugging Tips
-
-When CLI execution fails:
-1. Check server logs for full command string
-2. Verify command file exists at expected path
-3. Test CLI command manually in terminal
-4. Check Claude CLI version (may have breaking changes)
-5. Verify working directory has correct permissions
-6. Check for prompt text issues (encoding, length)
-
-### Related Documentation
-
- Claude Code CLI Reference: `PRPs/ai_docs/cc_cli_ref.md`
- Agent Work Orders PRD: `PRPs/specs/agent-work-orders-mvp-v2.md`
- SDK Documentation: https://docs.claude.com/claude-code/sdk
--- a/PRPs/specs/fix-jsonl-result-extraction-and-argument-passing.md
+++ b/PRPs/specs/fix-jsonl-result-extraction-and-argument-passing.md
@@ -1,742 +0,0 @@
-# Feature: Fix JSONL Result Extraction and Argument Passing
-
-## Feature Description
-
-Fix critical integration issues between Agent Work Orders system and Claude CLI that prevent workflow execution from completing successfully. The system currently fails to extract the actual result text from Claude CLI's JSONL output stream and doesn't properly pass arguments to command files using the $ARGUMENTS placeholder pattern.
-
-These fixes enable the atomic workflow execution pattern to work end-to-end by ensuring clean data flow between workflow steps.
-
-## User Story
-
-As a developer using the Agent Work Orders system
-I want workflows to execute successfully end-to-end
-So that I can automate development tasks via GitHub issues without manual intervention
-
-## Problem Statement
-
-The first real-world test of the atomic workflow execution system (work order wo-18d08ae8, repository: https://github.com/Wirasm/dylan.git, issue #1) revealed two critical failures that prevent workflow completion:
-
-**Problem 1: JSONL Result Not Extracted**
- `workflow_operations.py` uses `result.stdout.strip()` to get agent output
- `result.stdout` contains the entire JSONL stream (multiple lines of JSON messages)
- The actual agent result is in the "result" field of the final JSONL message with `type:"result"`
- Consequence: Downstream steps receive JSONL garbage instead of clean output
-
-**Observed Example:**
-```python
-# What we're currently doing (WRONG):
-issue_class = result.stdout.strip()
-# Gets: '{"type":"session_started","session_id":"..."}\n{"type":"result","result":"/feature","is_error":false}'
-
-# What we should do (CORRECT):
-issue_class = result.result_text.strip()
-# Gets: "/feature"
-```
-
-**Problem 2: $ARGUMENTS Placeholder Not Replaced**
- Command files use `$ARGUMENTS` placeholder for dynamic content (ADW pattern)
- `AgentCLIExecutor.build_command()` appends args to prompt but doesn't replace placeholder
- Claude CLI receives literal "$ARGUMENTS" text instead of actual issue JSON
- Consequence: Agents cannot access input data needed to perform their task
-
-**Observed Failure:**
-```
-Step 1 (Classifier): ✅ Executed BUT ❌ Wrong Output
- Agent response: "I need to see the GitHub issue content. The $ARGUMENTS placeholder shows {}"
- Output: Full JSONL stream instead of "/feature", "/bug", or "/chore"
- Session ID: 06f225c7-bcd8-436c-8738-9fa744c8eee6
-
-Step 2 (Planner): ❌ Failed Immediately
- Received JSONL as issue_class: {"type":"result"...}
- Error: "Unknown issue class: {JSONL output...}"
- Workflow halted - cannot proceed without clean classification
-```
-
-## Solution Statement
-
-Implement two critical fixes to enable proper Claude CLI integration:
-
-**Fix 1: Extract result_text from JSONL Output**
- Add `result_text` field to `CommandExecutionResult` model
- Extract the "result" field value from JSONL's final result message in `AgentCLIExecutor`
- Update all `workflow_operations.py` functions to use `result.result_text` instead of `result.stdout`
- Preserve `stdout` for debugging (contains full JSONL stream)
-
-**Fix 2: Replace $ARGUMENTS and Positional Placeholders**
- Modify `AgentCLIExecutor.build_command()` to replace `$ARGUMENTS` with actual arguments
- Support both `$ARGUMENTS` (all args) and `$1`, `$2`, `$3` (positional args)
- Pre-process command file content before passing to Claude CLI
- Remove old code that appended "Arguments: ..." to end of prompt
-
-This enables atomic workflows to execute correctly with clean data flow between steps.
-
-## Relevant Files
-
-Use these files to implement the feature:
-
-**Core Models** - Add result extraction field
- `python/src/agent_work_orders/models.py`:180-190 - CommandExecutionResult model needs result_text field to store extracted result
-
-**Agent Executor** - Implement JSONL parsing and argument replacement
- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:25-88 - build_command() needs $ARGUMENTS replacement logic (line 61-62 currently just appends args)
- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:90-236 - execute_async() needs result_text extraction (around line 170-175)
- `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`:337-363 - _extract_result_message() already extracts result dict, need to get "result" field value
-
-**Workflow Operations** - Use extracted result_text instead of stdout
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:26-79 - classify_issue() line 51 uses `result.stdout.strip()`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:82-155 - build_plan() line 133 uses `result.stdout`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:158-213 - find_plan_file() line 185 uses `result.stdout`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:216-267 - implement_plan() line 245 uses `result.stdout`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:270-326 - generate_branch() line 299 uses `result.stdout`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:329-385 - create_commit() line 358 uses `result.stdout`
- `python/src/agent_work_orders/workflow_engine/workflow_operations.py`:388-444 - create_pull_request() line 417 uses `result.stdout`
-
-**Tests** - Update and add test coverage
- `python/tests/agent_work_orders/test_models.py` - Add tests for CommandExecutionResult with result_text field
- `python/tests/agent_work_orders/test_agent_executor.py` - Add tests for result extraction and argument replacement
- `python/tests/agent_work_orders/test_workflow_operations.py`:1-398 - Update ALL mocks to include result_text field (currently missing)
-
-**Command Files** - Examples using $ARGUMENTS that need to work
- `.claude/commands/agent-work-orders/classify_issue.md`:19-21 - Uses `$ARGUMENTS` placeholder
- `.claude/commands/agent-work-orders/feature.md` - Uses `$ARGUMENTS` placeholder
- `.claude/commands/agent-work-orders/bug.md` - Uses positional `$1`, `$2`, `$3`
-
-### New Files
-
-No new files needed - all changes are modifications to existing files.
-
-## Implementation Plan
-
-### Phase 1: Foundation - Model Enhancement
-
-Add the result_text field to CommandExecutionResult so we can store the extracted result value separately from the raw JSONL stdout. This is a backward-compatible change.
-
-### Phase 2: Core Implementation - Result Extraction
-
-Implement the logic to parse JSONL output and extract the "result" field value into result_text during command execution in AgentCLIExecutor.
-
-### Phase 3: Core Implementation - Argument Replacement
-
-Implement placeholder replacement logic in build_command() to support $ARGUMENTS and $1, $2, $3 patterns in command files.
-
-### Phase 4: Integration - Update Workflow Operations
-
-Update all 7 workflow operation functions to use result_text instead of stdout for cleaner data flow between atomic steps.
-
-### Phase 5: Testing and Validation
-
-Comprehensive test coverage for both fixes and end-to-end validation with actual workflow execution.
-
-## Step by Step Tasks
-
-IMPORTANT: Execute every step in order, top to bottom.
-
-### Add result_text Field to CommandExecutionResult Model
-
- Open `python/src/agent_work_orders/models.py`
- Locate the `CommandExecutionResult` class (line 180)
- Add new optional field after stdout:
-  ```python
-  result_text: str | None = None
-  ```
- Add inline comment above the field: `# Extracted result text from JSONL "result" field (if available)`
- Verify the model definition is complete and properly formatted
- Save the file
-
-### Implement Result Text Extraction in execute_async()
-
- Open `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
- Locate the `execute_async()` method
- Find the section around line 170-175 where `_extract_result_message()` is called
- After line 173 `result_message = self._extract_result_message(stdout_text)`, add:
-  ```python
-  # Extract result text from JSONL result message
-  result_text: str | None = None
-  if result_message and "result" in result_message:
-      result_value = result_message.get("result")
-      # Convert result to string (handles both str and other types)
-      result_text = str(result_value) if result_value is not None else None
-  else:
-      result_text = None
-  ```
- Update the `CommandExecutionResult` instantiation (around line 191) to include the new field:
-  ```python
-  result = CommandExecutionResult(
-      success=success,
-      stdout=stdout_text,
-      result_text=result_text,  # NEW: Add this line
-      stderr=stderr_text,
-      exit_code=process.returncode or 0,
-      session_id=session_id,
-      error_message=error_message,
-      duration_seconds=duration,
-  )
-  ```
- Add debug logging after extraction (before the result object is created):
-  ```python
-  if result_text:
-      self._logger.debug(
-          "result_text_extracted",
-          result_text_preview=result_text[:100] if len(result_text) > 100 else result_text,
-          work_order_id=work_order_id
-      )
-  ```
- Save the file
-
-### Implement $ARGUMENTS Placeholder Replacement in build_command()
-
- Still in `python/src/agent_work_orders/agent_executor/agent_cli_executor.py`
- Locate the `build_command()` method (line 25-88)
- Find the section around line 60-62 that handles arguments
- Replace the current args handling code:
-  ```python
-  # OLD CODE TO REMOVE:
-  # if args:
-  #     prompt_text += f"\n\nArguments: {', '.join(args)}"
-
-  # NEW CODE:
-  # Replace argument placeholders in prompt text
-  if args:
-      # Replace $ARGUMENTS with first arg (or all args joined if multiple)
-      prompt_text = prompt_text.replace("$ARGUMENTS", args[0] if len(args) == 1 else ", ".join(args))
-
-      # Replace positional placeholders ($1, $2, $3, etc.)
-      for i, arg in enumerate(args, start=1):
-          prompt_text = prompt_text.replace(f"${i}", arg)
-  ```
- Save the file
-
-### Update classify_issue() to Use result_text
-
- Open `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `classify_issue()` function (starts at line 26)
- Find line 50-51 that extracts issue_class
- Replace with:
-  ```python
-  # OLD: if result.success and result.stdout:
-  #         issue_class = result.stdout.strip()
-
-  # NEW: Use result_text which contains the extracted result
-  if result.success and result.result_text:
-      issue_class = result.result_text.strip()
-  ```
- Verify the rest of the function logic remains unchanged
- Save the file
-
-### Update build_plan() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `build_plan()` function (starts at line 82)
- Find line 133 in the success case
- Replace `output=result.stdout or ""` with:
-  ```python
-  output=result.result_text or result.stdout or ""
-  ```
- Note: We use fallback to stdout for backward compatibility during transition
- Save the file
-
-### Update find_plan_file() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `find_plan_file()` function (starts at line 158)
- Find line 185 that checks stdout
- Replace with:
-  ```python
-  # OLD: if result.success and result.stdout and result.stdout.strip() != "0":
-  #         plan_file_path = result.stdout.strip()
-
-  # NEW: Use result_text
-  if result.success and result.result_text and result.result_text.strip() != "0":
-      plan_file_path = result.result_text.strip()
-  ```
- Save the file
-
-### Update implement_plan() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `implement_plan()` function (starts at line 216)
- Find line 245 in the success case
- Replace `output=result.stdout or ""` with:
-  ```python
-  output=result.result_text or result.stdout or ""
-  ```
- Save the file
-
-### Update generate_branch() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `generate_branch()` function (starts at line 270)
- Find line 298-299 that extracts branch_name
- Replace with:
-  ```python
-  # OLD: if result.success and result.stdout:
-  #         branch_name = result.stdout.strip()
-
-  # NEW: Use result_text
-  if result.success and result.result_text:
-      branch_name = result.result_text.strip()
-  ```
- Save the file
-
-### Update create_commit() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `create_commit()` function (starts at line 329)
- Find line 357-358 that extracts commit_message
- Replace with:
-  ```python
-  # OLD: if result.success and result.stdout:
-  #         commit_message = result.stdout.strip()
-
-  # NEW: Use result_text
-  if result.success and result.result_text:
-      commit_message = result.result_text.strip()
-  ```
- Save the file
-
-### Update create_pull_request() to Use result_text
-
- Still in `python/src/agent_work_orders/workflow_engine/workflow_operations.py`
- Locate the `create_pull_request()` function (starts at line 388)
- Find line 416-417 that extracts pr_url
- Replace with:
-  ```python
-  # OLD: if result.success and result.stdout:
-  #         pr_url = result.stdout.strip()
-
-  # NEW: Use result_text
-  if result.success and result.result_text:
-      pr_url = result.result_text.strip()
-  ```
- Save the file
- Verify all 7 workflow operations now use result_text
-
-### Add Model Tests for result_text Field
-
- Open `python/tests/agent_work_orders/test_models.py`
- Add new test function at the end of the file:
-  ```python
-  def test_command_execution_result_with_result_text():
-      """Test CommandExecutionResult includes result_text field"""
-      result = CommandExecutionResult(
-          success=True,
-          stdout='{"type":"result","result":"/feature"}',
-          result_text="/feature",
-          stderr=None,
-          exit_code=0,
-          session_id="session-123",
-      )
-      assert result.result_text == "/feature"
-      assert result.stdout == '{"type":"result","result":"/feature"}'
-      assert result.success is True
-
-  def test_command_execution_result_without_result_text():
-      """Test CommandExecutionResult works without result_text (backward compatibility)"""
-      result = CommandExecutionResult(
-          success=True,
-          stdout="raw output",
-          stderr=None,
-          exit_code=0,
-      )
-      assert result.result_text is None
-      assert result.stdout == "raw output"
-  ```
- Save the file
-
-### Add Agent Executor Tests for Result Extraction
-
- Open `python/tests/agent_work_orders/test_agent_executor.py`
- Add new test function:
-  ```python
-  @pytest.mark.asyncio
-  async def test_execute_async_extracts_result_text():
-      """Test that result text is extracted from JSONL output"""
-      executor = AgentCLIExecutor()
-
-      # Mock subprocess that returns JSONL with result
-      jsonl_output = '{"type":"session_started","session_id":"test-123"}\n{"type":"result","result":"/feature","is_error":false}'
-
-      with patch("asyncio.create_subprocess_shell") as mock_subprocess:
-          mock_process = AsyncMock()
-          mock_process.communicate = AsyncMock(return_value=(jsonl_output.encode(), b""))
-          mock_process.returncode = 0
-          mock_subprocess.return_value = mock_process
-
-          result = await executor.execute_async(
-              "claude --print",
-              "/tmp/test",
-              prompt_text="test prompt",
-              work_order_id="wo-test"
-          )
-
-          assert result.success is True
-          assert result.result_text == "/feature"
-          assert result.session_id == "test-123"
-          assert '{"type":"result"' in result.stdout
-  ```
- Save the file
-
-### Add Agent Executor Tests for Argument Replacement
-
- Still in `python/tests/agent_work_orders/test_agent_executor.py`
- Add new test functions:
-  ```python
-  def test_build_command_replaces_arguments_placeholder():
-      """Test that $ARGUMENTS placeholder is replaced with actual arguments"""
-      executor = AgentCLIExecutor()
-
-      # Create temp command file with $ARGUMENTS
-      import tempfile
-      with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
-          f.write("Classify this issue:\\n\\n$ARGUMENTS")
-          temp_file = f.name
-
-      try:
-          command, prompt = executor.build_command(
-              temp_file,
-              args=['{"title": "Add feature", "body": "description"}']
-          )
-
-          assert "$ARGUMENTS" not in prompt
-          assert '{"title": "Add feature"' in prompt
-          assert "Classify this issue:" in prompt
-      finally:
-          import os
-          os.unlink(temp_file)
-
-  def test_build_command_replaces_positional_arguments():
-      """Test that $1, $2, $3 are replaced with positional arguments"""
-      executor = AgentCLIExecutor()
-
-      import tempfile
-      with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False) as f:
-          f.write("Issue: $1\\nWorkOrder: $2\\nData: $3")
-          temp_file = f.name
-
-      try:
-          command, prompt = executor.build_command(
-              temp_file,
-              args=["42", "wo-test", '{"title":"Test"}']
-          )
-
-          assert "$1" not in prompt
-          assert "$2" not in prompt
-          assert "$3" not in prompt
-          assert "Issue: 42" in prompt
-          assert "WorkOrder: wo-test" in prompt
-          assert 'Data: {"title":"Test"}' in prompt
-      finally:
-          import os
-          os.unlink(temp_file)
-  ```
- Save the file
-
-### Update All Workflow Operations Test Mocks
-
- Open `python/tests/agent_work_orders/test_workflow_operations.py`
- Find every `CommandExecutionResult` mock and add `result_text` field
- Update test_classify_issue_success (line 27-34):
-  ```python
-  mock_executor.execute_async = AsyncMock(
-      return_value=CommandExecutionResult(
-          success=True,
-          stdout='{"type":"result","result":"/feature"}',
-          result_text="/feature",  # ADD THIS
-          stderr=None,
-          exit_code=0,
-          session_id="session-123",
-      )
-  )
-  ```
- Repeat for all other test functions:
-  - test_build_plan_feature_success (line 93-100) - add `result_text="Plan created successfully"`
-  - test_build_plan_bug_success (line 128-135) - add `result_text="Bug plan created"`
-  - test_find_plan_file_success (line 180-187) - add `result_text="specs/issue-42-wo-test-planner-feature.md"`
-  - test_find_plan_file_not_found (line 213-220) - add `result_text="0"`
-  - test_implement_plan_success (line 243-250) - add `result_text="Implementation completed"`
-  - test_generate_branch_success (line 274-281) - add `result_text="feat-issue-42-wo-test-add-feature"`
-  - test_create_commit_success (line 307-314) - add `result_text="implementor: feat: add user authentication"`
-  - test_create_pull_request_success (line 339-346) - add `result_text="https://github.com/owner/repo/pull/123"`
- Save the file
-
-### Run Model Unit Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py::test_command_execution_result_with_result_text -v`
- Verify test passes
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_models.py::test_command_execution_result_without_result_text -v`
- Verify test passes
-
-### Run Agent Executor Unit Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_execute_async_extracts_result_text -v`
- Verify result extraction test passes
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_build_command_replaces_arguments_placeholder -v`
- Verify $ARGUMENTS replacement test passes
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py::test_build_command_replaces_positional_arguments -v`
- Verify positional argument test passes
-
-### Run Workflow Operations Unit Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v`
- Verify all 9+ tests pass with updated mocks
- Check for any assertion failures related to result_text
-
-### Run Full Test Suite
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
- Target: 100% of tests pass
- If any tests fail, fix them immediately before proceeding
- Execute: `cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing`
- Verify >80% coverage for modified files
-
-### Run Type Checking
-
- Execute: `cd python && uv run mypy src/agent_work_orders/models.py`
- Verify no type errors in models
- Execute: `cd python && uv run mypy src/agent_work_orders/agent_executor/agent_cli_executor.py`
- Verify no type errors in executor
- Execute: `cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_operations.py`
- Verify no type errors in workflow operations
-
-### Run Linting
-
- Execute: `cd python && uv run ruff check src/agent_work_orders/models.py`
- Execute: `cd python && uv run ruff check src/agent_work_orders/agent_executor/agent_cli_executor.py`
- Execute: `cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_operations.py`
- Fix any linting issues if found
-
-### Run End-to-End Integration Test
-
- Start server: `cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &`
- Wait for startup: `sleep 5`
- Test health: `curl http://localhost:8888/health`
- Create work order:
-  ```bash
-  WORK_ORDER_ID=$(curl -X POST http://localhost:8888/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d '{
-      "repository_url": "https://github.com/Wirasm/dylan.git",
-      "sandbox_type": "git_branch",
-      "workflow_type": "agent_workflow_plan",
-      "github_issue_number": "1"
-    }' | jq -r '.agent_work_order_id')
-  echo "Work Order ID: $WORK_ORDER_ID"
-  ```
- Monitor: `sleep 30`
- Check status: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq`
- Check steps: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps[] | {step: .step, agent: .agent_name, success: .success, output: .output[:50]}'`
- Verify:
-  - Classifier step shows `output: "/feature"` (NOT JSONL)
-  - Planner step succeeded (received clean classification)
-  - All subsequent steps executed
-  - Final status is "completed" or shows specific error
- Inspect logs: `ls -la /tmp/agent-work-orders/*/`
- Check artifacts: `cat /tmp/agent-work-orders/$WORK_ORDER_ID/outputs/*.jsonl | grep '"result"'`
- Stop server: `pkill -f "uvicorn.*8888"`
-
-### Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
- `cd python && uv run pytest tests/agent_work_orders/test_models.py -v` - Verify model tests pass
- `cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v` - Verify executor tests pass
- `cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v` - Verify workflow operations tests pass
- `cd python && uv run pytest tests/agent_work_orders/ -v` - All agent work orders tests
- `cd python && uv run pytest` - Entire backend test suite (zero regressions)
- `cd python && uv run mypy src/agent_work_orders/` - Type check all modified code
- `cd python && uv run ruff check src/agent_work_orders/` - Lint all modified code
- End-to-end test: Start server and create work order as documented above
- Verify classifier returns clean "/feature" not JSONL
- Verify planner receives correct classification
- Verify workflow completes successfully
-
-## Testing Strategy
-
-### Unit Tests
-
-**CommandExecutionResult Model**
- Test result_text field accepts string values
- Test result_text field accepts None (optional)
- Test model serialization with result_text
- Test backward compatibility (result_text=None works)
-
-**AgentCLIExecutor Result Extraction**
- Test extraction from valid JSONL with result field
- Test extraction when result is string
- Test extraction when result is number (should stringify)
- Test extraction when result is object (should stringify)
- Test no extraction when JSONL has no result message
- Test no extraction when result message missing "result" field
- Test handles malformed JSONL gracefully
-
-**AgentCLIExecutor Argument Replacement**
- Test $ARGUMENTS with single argument
- Test $ARGUMENTS with multiple arguments
- Test $1, $2, $3 positional replacement
- Test mixed placeholders in one file
- Test no replacement when args is None
- Test no replacement when args is empty
- Test command without placeholders
-
-**Workflow Operations**
- Test each operation uses result_text
- Test each operation handles None result_text
- Test fallback to stdout works
- Test clean output flows to next step
-
-### Integration Tests
-
-**Complete Workflow**
- Test full workflow with real JSONL parsing
- Test classifier → planner data flow
- Test each step receives clean input
- Test step history contains result_text values
- Test error handling when result_text is None
-
-**Error Scenarios**
- Test malformed JSONL output
- Test missing result field in JSONL
- Test agent returns error in result
- Test $ARGUMENTS not in command file (should still work)
-
-### Edge Cases
-
-**JSONL Parsing**
- Result message not last in stream
- Multiple result messages
- Result with is_error:true
- Result value is null
- Result value is boolean true/false
- Result value is large object
- Result value contains newlines
-
-**Argument Replacement**
- $ARGUMENTS appears multiple times
- Positional args exceed provided args count
- Args contain special characters
- Args contain literal $ character
- Very long arguments (>10KB)
- Empty string arguments
-
-**Backward Compatibility**
- Old commands without placeholders
- Workflow handles result_text=None gracefully
- stdout still accessible for debugging
-
-## Acceptance Criteria
-
-**Core Functionality:**
- ✅ CommandExecutionResult model has result_text field
- ✅ result_text extracted from JSONL "result" field
- ✅ $ARGUMENTS placeholder replaced with arguments
- ✅ $1, $2, $3 positional placeholders replaced
- ✅ All 7 workflow operations use result_text
- ✅ stdout preserved for debugging (backward compatible)
-
-**Test Results:**
- ✅ All existing tests pass (zero regressions)
- ✅ New model tests pass
- ✅ New executor tests pass
- ✅ Updated workflow operations tests pass
- ✅ >80% test coverage for modified files
-
-**Code Quality:**
- ✅ Type checking passes with no errors
- ✅ Linting passes with no warnings
- ✅ Code follows existing patterns
- ✅ Docstrings updated where needed
-
-**End-to-End:**
- ✅ Classifier returns clean output: "/feature", "/bug", or "/chore"
- ✅ Planner receives correct issue class (not JSONL)
- ✅ All workflow steps execute successfully
- ✅ Step history shows clean result_text values
- ✅ Logs show result extraction working
- ✅ Complete workflow creates PR
-
-## Validation Commands
-
-```bash
-# Unit Tests
-cd python && uv run pytest tests/agent_work_orders/test_models.py -v
-cd python && uv run pytest tests/agent_work_orders/test_agent_executor.py -v
-cd python && uv run pytest tests/agent_work_orders/test_workflow_operations.py -v
-
-# Full Suite
-cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
-cd python && uv run pytest tests/agent_work_orders/ --cov=src/agent_work_orders --cov-report=term-missing
-cd python && uv run pytest  # All backend tests
-
-# Quality Checks
-cd python && uv run mypy src/agent_work_orders/
-cd python && uv run ruff check src/agent_work_orders/
-
-# Integration Test
-cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
-sleep 5
-curl http://localhost:8888/health | jq
-
-# Create test work order
-WORK_ORDER=$(curl -X POST http://localhost:8888/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","github_issue_number":"1"}' \
-  | jq -r '.agent_work_order_id')
-
-echo "Work Order: $WORK_ORDER"
-sleep 20
-
-# Check execution
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER | jq
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER/steps | jq '.steps[] | {step, agent_name, success, output}'
-
-# Verify logs
-ls /tmp/agent-work-orders/*/outputs/
-cat /tmp/agent-work-orders/*/outputs/*.jsonl | grep '"result"'
-
-# Cleanup
-pkill -f "uvicorn.*8888"
-```
-
-## Notes
-
-**Design Decisions:**
- Preserve `stdout` containing raw JSONL for debugging
- `result_text` is the new preferred field for clean output
- Fallback to `stdout` in some workflow operations (defensive)
- Support both `$ARGUMENTS` and `$1, $2, $3` for flexibility
- Backward compatible - optional fields, graceful fallbacks
-
-**Why This Fixes the Issue:**
-```
-Before Fix:
-  Classifier stdout: '{"type":"result","result":"/feature","is_error":false}'
-  Planner receives:  '{"type":"result","result":"/feature","is_error":false}' ❌
-  Error: "Unknown issue class: {JSONL...}"
-
-After Fix:
-  Classifier stdout:      '{"type":"result","result":"/feature","is_error":false}'
-  Classifier result_text: "/feature"
-  Planner receives:       "/feature" ✅
-  Success: Clean classification flows to next step
-```
-
-**Claude CLI JSONL Format:**
-```json
-{"type":"session_started","session_id":"abc-123"}
-{"type":"text","text":"I'm analyzing..."}
-{"type":"result","result":"/feature","is_error":false}
-```
-
-**Future Improvements:**
- Add result_json field for structured data
- Support more placeholder patterns (${ISSUE_NUMBER}, etc.)
- Validate command files have required placeholders
- Add metrics for result_text extraction success rate
- Consider streaming result extraction for long-running agents
-
-**Migration Path:**
-1. Add result_text field (backward compatible)
-2. Extract in executor (backward compatible)
-3. Update workflow operations (backward compatible - fallback)
-4. Deploy and validate
-5. Future: Remove stdout usage entirely
--- a/PRPs/specs/incremental-step-history-tracking.md
+++ b/PRPs/specs/incremental-step-history-tracking.md
@@ -1,724 +0,0 @@
-# Feature: Incremental Step History Tracking for Real-Time Workflow Observability
-
-## Feature Description
-
-Enable real-time progress visibility for Agent Work Orders by saving step history incrementally after each workflow step completes, rather than waiting until the end. This critical observability fix allows users to monitor workflow execution in real-time via the `/agent-work-orders/{id}/steps` API endpoint, providing immediate feedback on which steps have completed, which are in progress, and which have failed.
-
-Currently, step history is only saved at two points: when the entire workflow completes successfully (line 260 in orchestrator) or when the workflow fails with an exception (line 269). This means users polling the steps endpoint see zero progress information until the workflow reaches one of these terminal states, creating a black-box execution experience that can last several minutes.
-
-## User Story
-
-As a developer using the Agent Work Orders system
-I want to see real-time progress as each workflow step completes
-So that I can monitor execution, debug failures quickly, and understand what the system is doing without waiting for the entire workflow to finish
-
-## Problem Statement
-
-The current implementation has a critical observability gap that prevents real-time progress tracking:
-
-**Root Cause:**
- Step history is initialized at workflow start: `step_history = StepHistory(agent_work_order_id=agent_work_order_id)` (line 82)
- After each step executes, results are appended: `step_history.steps.append(result)` (lines 130, 150, 166, 186, 205, 224, 241)
- **BUT** step history is only saved to state at:
-  - Line 260: `await self.state_repository.save_step_history(...)` - After ALL 7 steps complete successfully
-  - Line 269: `await self.state_repository.save_step_history(...)` - In exception handler when workflow fails
-
-**Impact:**
-1. **Zero Real-Time Visibility**: Users polling `/agent-work-orders/{id}/steps` see an empty array until workflow completes or fails
-2. **Poor Debugging Experience**: Cannot see which step failed until the entire workflow terminates
-3. **Uncertain Progress**: Long-running workflows (3-5 minutes) appear frozen with no progress indication
-4. **Wasted API Calls**: Clients poll repeatedly but get no new information until terminal state
-5. **Bad User Experience**: Cannot show meaningful progress bars, step indicators, or real-time status updates in UI
-
-**Example Scenario:**
-```
-User creates work order → Polls /steps endpoint every 3 seconds
-  0s: [] (empty)
-  3s: [] (empty)
-  6s: [] (empty)
-  ... workflow running ...
-  120s: [] (empty)
-  123s: [] (empty)
-  ... workflow running ...
-  180s: [all 7 steps] (suddenly all appear at once)
-```
-
-This creates a frustrating experience where users have no insight into what's happening for minutes at a time.
-
-## Solution Statement
-
-Implement incremental step history persistence by adding a single `await self.state_repository.save_step_history()` call immediately after each step result is appended to the history. This simple change enables real-time progress tracking with minimal code modification and zero performance impact.
-
-**Implementation:**
- After each `step_history.steps.append(result)` call, immediately save: `await self.state_repository.save_step_history(agent_work_order_id, step_history)`
- Apply this pattern consistently across all 7 workflow steps
- Preserve existing end-of-workflow and error-handler saves for robustness
- No changes needed to API, models, or state repository (already supports incremental saves)
-
-**Result:**
-```
-User creates work order → Polls /steps endpoint every 3 seconds
-  0s: [] (empty - workflow starting)
-  3s: [{classify step}] (classification complete!)
-  10s: [{classify}, {plan}] (planning complete!)
-  20s: [{classify}, {plan}, {find_plan}] (plan file found!)
-  ... progress visible at each step ...
-  180s: [all 7 steps] (complete with full history)
-```
-
-This provides immediate feedback, enables meaningful progress UIs, and dramatically improves the developer experience.
-
-## Relevant Files
-
-Use these files to implement the feature:
-
-**Core Implementation:**
- `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py` (lines 122-269)
-  - Main orchestration logic where step history is managed
-  - Currently appends to step_history but doesn't save incrementally
-  - Need to add `save_step_history()` calls after each step completion (7 locations)
-  - Lines to modify: 130, 150, 166, 186, 205, 224, 241 (add save call after each append)
-
-**State Management (No Changes Needed):**
- `python/src/agent_work_orders/state_manager/work_order_repository.py` (lines 147-163)
-  - Already implements `save_step_history()` method with proper locking
-  - Thread-safe with asyncio.Lock for concurrent access
-  - Logs each save operation for observability
-  - Works perfectly for incremental saves - no modifications required
-
-**API Layer (No Changes Needed):**
- `python/src/agent_work_orders/api/routes.py` (lines 220-240)
-  - Already implements `GET /agent-work-orders/{id}/steps` endpoint
-  - Returns step history from state repository
-  - Will automatically return incremental results once orchestrator saves them
-
-**Models (No Changes Needed):**
- `python/src/agent_work_orders/models.py` (lines 213-246)
-  - `StepHistory` model is immutable-friendly (each save creates full snapshot)
-  - `StepExecutionResult` captures all step details
-  - Models already support incremental history updates
-
-### New Files
-
-No new files needed - this is a simple enhancement to existing workflow orchestrator.
-
-## Implementation Plan
-
-### Phase 1: Foundation - Add Incremental Saves After Each Step
-
-Add `save_step_history()` calls immediately after each step result is appended to enable real-time progress tracking. This is the core fix.
-
-### Phase 2: Testing - Verify Real-Time Updates
-
-Create comprehensive tests to verify step history is saved incrementally and accessible via API throughout workflow execution.
-
-### Phase 3: Validation - End-to-End Testing
-
-Validate with real workflow execution that step history appears incrementally when polling the steps endpoint.
-
-## Step by Step Tasks
-
-IMPORTANT: Execute every step in order, top to bottom.
-
-### Read Current Implementation
-
- Open `python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- Review the workflow execution flow from lines 122-269
- Identify all 7 locations where `step_history.steps.append()` is called
- Understand the pattern: append result → log completion → (currently missing: save history)
- Note that `save_step_history()` already exists in state_repository and is thread-safe
-
-### Add Incremental Save After Classify Step
-
- Locate line 130: `step_history.steps.append(classify_result)`
- Immediately after line 130, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of classification result in real-time
- Save the file
-
-### Add Incremental Save After Plan Step
-
- Locate line 150: `step_history.steps.append(plan_result)`
- Immediately after line 150, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of planning result in real-time
- Save the file
-
-### Add Incremental Save After Find Plan Step
-
- Locate line 166: `step_history.steps.append(plan_finder_result)`
- Immediately after line 166, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of plan file discovery in real-time
- Save the file
-
-### Add Incremental Save After Branch Generation Step
-
- Locate line 186: `step_history.steps.append(branch_result)`
- Immediately after line 186, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of branch creation in real-time
- Save the file
-
-### Add Incremental Save After Implementation Step
-
- Locate line 205: `step_history.steps.append(implement_result)`
- Immediately after line 205, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of implementation progress in real-time
- This is especially important as implementation can take 1-2 minutes
- Save the file
-
-### Add Incremental Save After Commit Step
-
- Locate line 224: `step_history.steps.append(commit_result)`
- Immediately after line 224, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of commit creation in real-time
- Save the file
-
-### Add Incremental Save After PR Creation Step
-
- Locate line 241: `step_history.steps.append(pr_result)`
- Immediately after line 241, add:
-  ```python
-  await self.state_repository.save_step_history(agent_work_order_id, step_history)
-  ```
- This enables visibility of PR creation result in real-time
- Save the file
- Verify all 7 locations now have incremental saves
-
-### Add Comprehensive Unit Test for Incremental Saves
-
- Open `python/tests/agent_work_orders/test_workflow_engine.py`
- Add new test function at the end of file:
-  ```python
-  @pytest.mark.asyncio
-  async def test_orchestrator_saves_step_history_incrementally():
-      """Test that step history is saved after each step, not just at the end"""
-      from src.agent_work_orders.models import (
-          CommandExecutionResult,
-          StepExecutionResult,
-          WorkflowStep,
-      )
-      from src.agent_work_orders.workflow_engine.agent_names import CLASSIFIER
-
-      # Create mocks
-      mock_executor = MagicMock()
-      mock_sandbox_factory = MagicMock()
-      mock_github_client = MagicMock()
-      mock_phase_tracker = MagicMock()
-      mock_command_loader = MagicMock()
-      mock_state_repository = MagicMock()
-
-      # Track save_step_history calls
-      save_calls = []
-      async def track_save(wo_id, history):
-          save_calls.append(len(history.steps))
-
-      mock_state_repository.save_step_history = AsyncMock(side_effect=track_save)
-      mock_state_repository.update_status = AsyncMock()
-      mock_state_repository.update_git_branch = AsyncMock()
-
-      # Mock sandbox
-      mock_sandbox = MagicMock()
-      mock_sandbox.working_dir = "/tmp/test"
-      mock_sandbox.setup = AsyncMock()
-      mock_sandbox.cleanup = AsyncMock()
-      mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
-
-      # Mock GitHub client
-      mock_github_client.get_issue = AsyncMock(return_value={
-          "title": "Test Issue",
-          "body": "Test body"
-      })
-
-      # Create orchestrator
-      orchestrator = WorkflowOrchestrator(
-          agent_executor=mock_executor,
-          sandbox_factory=mock_sandbox_factory,
-          github_client=mock_github_client,
-          phase_tracker=mock_phase_tracker,
-          command_loader=mock_command_loader,
-          state_repository=mock_state_repository,
-      )
-
-      # Mock workflow operations to return success for all steps
-      with patch("src.agent_work_orders.workflow_engine.workflow_operations.classify_issue") as mock_classify:
-          with patch("src.agent_work_orders.workflow_engine.workflow_operations.build_plan") as mock_plan:
-              with patch("src.agent_work_orders.workflow_engine.workflow_operations.find_plan_file") as mock_find:
-                  with patch("src.agent_work_orders.workflow_engine.workflow_operations.generate_branch") as mock_branch:
-                      with patch("src.agent_work_orders.workflow_engine.workflow_operations.implement_plan") as mock_implement:
-                          with patch("src.agent_work_orders.workflow_engine.workflow_operations.create_commit") as mock_commit:
-                              with patch("src.agent_work_orders.workflow_engine.workflow_operations.create_pull_request") as mock_pr:
-
-                                  # Mock successful results for each step
-                                  mock_classify.return_value = StepExecutionResult(
-                                      step=WorkflowStep.CLASSIFY,
-                                      agent_name=CLASSIFIER,
-                                      success=True,
-                                      output="/feature",
-                                      duration_seconds=1.0,
-                                  )
-
-                                  mock_plan.return_value = StepExecutionResult(
-                                      step=WorkflowStep.PLAN,
-                                      agent_name="planner",
-                                      success=True,
-                                      output="Plan created",
-                                      duration_seconds=2.0,
-                                  )
-
-                                  mock_find.return_value = StepExecutionResult(
-                                      step=WorkflowStep.FIND_PLAN,
-                                      agent_name="plan_finder",
-                                      success=True,
-                                      output="specs/plan.md",
-                                      duration_seconds=0.5,
-                                  )
-
-                                  mock_branch.return_value = StepExecutionResult(
-                                      step=WorkflowStep.GENERATE_BRANCH,
-                                      agent_name="branch_generator",
-                                      success=True,
-                                      output="feat-issue-1-wo-test",
-                                      duration_seconds=1.0,
-                                  )
-
-                                  mock_implement.return_value = StepExecutionResult(
-                                      step=WorkflowStep.IMPLEMENT,
-                                      agent_name="implementor",
-                                      success=True,
-                                      output="Implementation complete",
-                                      duration_seconds=5.0,
-                                  )
-
-                                  mock_commit.return_value = StepExecutionResult(
-                                      step=WorkflowStep.COMMIT,
-                                      agent_name="committer",
-                                      success=True,
-                                      output="Commit created",
-                                      duration_seconds=1.0,
-                                  )
-
-                                  mock_pr.return_value = StepExecutionResult(
-                                      step=WorkflowStep.CREATE_PR,
-                                      agent_name="pr_creator",
-                                      success=True,
-                                      output="https://github.com/owner/repo/pull/1",
-                                      duration_seconds=1.0,
-                                  )
-
-                                  # Execute workflow
-                                  await orchestrator.execute_workflow(
-                                      agent_work_order_id="wo-test",
-                                      workflow_type=AgentWorkflowType.PLAN,
-                                      repository_url="https://github.com/owner/repo",
-                                      sandbox_type=SandboxType.GIT_BRANCH,
-                                      user_request="Test feature request",
-                                  )
-
-      # Verify save_step_history was called after EACH step (7 times) + final save (8 total)
-      # OR at minimum, verify it was called MORE than just once at the end
-      assert len(save_calls) >= 7, f"Expected at least 7 incremental saves, got {len(save_calls)}"
-
-      # Verify the progression: 1 step, 2 steps, 3 steps, etc.
-      assert save_calls[0] == 1, "First save should have 1 step"
-      assert save_calls[1] == 2, "Second save should have 2 steps"
-      assert save_calls[2] == 3, "Third save should have 3 steps"
-      assert save_calls[3] == 4, "Fourth save should have 4 steps"
-      assert save_calls[4] == 5, "Fifth save should have 5 steps"
-      assert save_calls[5] == 6, "Sixth save should have 6 steps"
-      assert save_calls[6] == 7, "Seventh save should have 7 steps"
-  ```
- Save the file
-
-### Add Integration Test for Real-Time Step Visibility
-
- Still in `python/tests/agent_work_orders/test_workflow_engine.py`
- Add another test function:
-  ```python
-  @pytest.mark.asyncio
-  async def test_step_history_visible_during_execution():
-      """Test that step history can be retrieved during workflow execution"""
-      from src.agent_work_orders.models import StepHistory
-
-      # Create real state repository (in-memory)
-      from src.agent_work_orders.state_manager.work_order_repository import WorkOrderRepository
-      state_repo = WorkOrderRepository()
-
-      # Create empty step history
-      step_history = StepHistory(agent_work_order_id="wo-test")
-
-      # Simulate incremental saves during workflow
-      from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
-
-      # Step 1: Classify
-      step_history.steps.append(StepExecutionResult(
-          step=WorkflowStep.CLASSIFY,
-          agent_name="classifier",
-          success=True,
-          output="/feature",
-          duration_seconds=1.0,
-      ))
-      await state_repo.save_step_history("wo-test", step_history)
-
-      # Retrieve and verify
-      retrieved = await state_repo.get_step_history("wo-test")
-      assert retrieved is not None
-      assert len(retrieved.steps) == 1
-      assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
-
-      # Step 2: Plan
-      step_history.steps.append(StepExecutionResult(
-          step=WorkflowStep.PLAN,
-          agent_name="planner",
-          success=True,
-          output="Plan created",
-          duration_seconds=2.0,
-      ))
-      await state_repo.save_step_history("wo-test", step_history)
-
-      # Retrieve and verify progression
-      retrieved = await state_repo.get_step_history("wo-test")
-      assert len(retrieved.steps) == 2
-      assert retrieved.steps[1].step == WorkflowStep.PLAN
-
-      # Verify both steps are present
-      assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
-      assert retrieved.steps[1].step == WorkflowStep.PLAN
-  ```
- Save the file
-
-### Run Unit Tests for Workflow Engine
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_orchestrator_saves_step_history_incrementally -v`
- Verify the test passes and confirms incremental saves occur
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_step_history_visible_during_execution -v`
- Verify the test passes
- Fix any failures before proceeding
-
-### Run All Workflow Engine Tests
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py -v`
- Ensure all existing tests still pass (zero regressions)
- Verify new tests are included in the run
- Fix any failures
-
-### Run Complete Agent Work Orders Test Suite
-
- Execute: `cd python && uv run pytest tests/agent_work_orders/ -v`
- Ensure all tests across all modules pass
- This validates no regressions were introduced
- Pay special attention to state manager and API tests
- Fix any failures
-
-### Run Type Checking
-
- Execute: `cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- Verify no type errors in the orchestrator
- Execute: `cd python && uv run mypy src/agent_work_orders/`
- Verify no type errors in the entire module
- Fix any type issues
-
-### Run Linting
-
- Execute: `cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_orchestrator.py`
- Verify no linting issues in orchestrator
- Execute: `cd python && uv run ruff check src/agent_work_orders/`
- Verify no linting issues in entire module
- Fix any issues found
-
-### Perform Manual End-to-End Validation
-
- Start the Agent Work Orders server:
-  ```bash
-  cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
-  ```
- Wait for startup: `sleep 5`
- Verify health: `curl http://localhost:8888/health | jq`
- Create a test work order:
-  ```bash
-  WORK_ORDER_ID=$(curl -s -X POST http://localhost:8888/agent-work-orders \
-    -H "Content-Type: application/json" \
-    -d '{
-      "repository_url": "https://github.com/Wirasm/dylan.git",
-      "sandbox_type": "git_branch",
-      "workflow_type": "agent_workflow_plan",
-      "user_request": "Add a test feature for real-time step tracking validation"
-    }' | jq -r '.agent_work_order_id')
-  echo "Created work order: $WORK_ORDER_ID"
-  ```
- Immediately start polling for steps (in a loop or manually):
-  ```bash
-  # Poll every 3 seconds to observe real-time progress
-  for i in {1..60}; do
-    echo "=== Poll $i ($(date +%H:%M:%S)) ==="
-    curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length'
-    curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps[-1] | {step: .step, agent: .agent_name, success: .success}'
-    sleep 3
-  done
-  ```
- Observe that step count increases incrementally: 0 → 1 → 2 → 3 → 4 → 5 → 6 → 7
- Verify each step appears immediately after completion (not all at once at the end)
- Verify you can see progress in real-time
- Check final status: `curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq '{status: .status, steps_completed: (.git_commit_count // 0)}'`
- Stop the server: `pkill -f "uvicorn.*8888"`
-
-### Document the Improvement
-
- Open `PRPs/specs/agent-work-orders-mvp-v2.md` (or relevant spec file)
- Add a note in the Observability or Implementation Notes section:
-  ```markdown
-  ### Real-Time Progress Tracking
-
-  Step history is saved incrementally after each workflow step completes, enabling
-  real-time progress visibility via the `/agent-work-orders/{id}/steps` endpoint.
-  This allows users to monitor execution as it happens rather than waiting for the
-  entire workflow to complete.
-
-  Implementation: `save_step_history()` is called after each `steps.append()` in
-  the workflow orchestrator, providing immediate feedback to polling clients.
-  ```
- Save the file
-
-### Run Final Validation Commands
-
- Execute all validation commands listed in the Validation Commands section below
- Ensure every command executes successfully
- Verify zero regressions across the entire codebase
- Confirm real-time progress tracking works end-to-end
-
-## Testing Strategy
-
-### Unit Tests
-
-**Workflow Orchestrator Tests:**
- Test that `save_step_history()` is called after each workflow step
- Test that step history is saved 7+ times during successful execution (once per step + final save)
- Test that step count increases incrementally (1, 2, 3, 4, 5, 6, 7)
- Test that step history is saved even when workflow fails mid-execution
- Test that each save contains all steps completed up to that point
-
-**State Repository Tests:**
- Test that `save_step_history()` handles concurrent calls safely (already implemented with asyncio.Lock)
- Test that retrieving step history returns the most recently saved version
- Test that step history can be saved and retrieved multiple times for same work order
- Test that step history overwrites previous version (not appends)
-
-### Integration Tests
-
-**End-to-End Workflow Tests:**
- Test that step history can be retrieved via API during workflow execution
- Test that polling `/agent-work-orders/{id}/steps` shows progressive updates
- Test that step history contains correct number of steps at each save point
- Test that step history is accessible immediately after each step completes
- Test that failed steps are visible in step history before workflow terminates
-
-**API Integration Tests:**
- Test GET `/agent-work-orders/{id}/steps` returns empty array before first step
- Test GET `/agent-work-orders/{id}/steps` returns 1 step after classification
- Test GET `/agent-work-orders/{id}/steps` returns N steps after N steps complete
- Test GET `/agent-work-orders/{id}/steps` returns complete history after workflow finishes
-
-### Edge Cases
-
-**Concurrent Access:**
- Multiple clients polling `/agent-work-orders/{id}/steps` simultaneously
- Step history being saved while another request reads it (handled by asyncio.Lock)
- Workflow fails while client is retrieving step history
-
-**Performance:**
- Large step history (7 steps * 100+ lines each) saved multiple times
- Multiple work orders executing simultaneously with incremental saves
- High polling frequency (1 second intervals) during workflow execution
-
-**Failure Scenarios:**
- Step history save fails (network/disk error) - workflow should continue
- Step history is saved but retrieval fails - should return appropriate error
- Workflow interrupted mid-execution - partial step history should be preserved
-
-## Acceptance Criteria
-
-**Core Functionality:**
- ✅ Step history is saved after each workflow step completes
- ✅ Step history is saved 7 times during successful workflow execution (once per step)
- ✅ Each incremental save contains all steps completed up to that point
- ✅ Step history is accessible via API immediately after each step
- ✅ Real-time progress visible when polling `/agent-work-orders/{id}/steps`
-
-**Backward Compatibility:**
- ✅ All existing tests pass without modification
- ✅ API behavior unchanged (same endpoints, same response format)
- ✅ No breaking changes to models or state repository
- ✅ Performance impact negligible (save operations are fast)
-
-**Testing:**
- ✅ New unit test verifies incremental saves occur
- ✅ New integration test verifies step history visibility during execution
- ✅ All existing workflow engine tests pass
- ✅ All agent work orders tests pass
- ✅ Manual end-to-end test confirms real-time progress tracking
-
-**Code Quality:**
- ✅ Type checking passes (mypy)
- ✅ Linting passes (ruff)
- ✅ Code follows existing patterns and conventions
- ✅ Structured logging used for save operations
-
-**Documentation:**
- ✅ Implementation documented in spec file
- ✅ Acceptance criteria met and verified
- ✅ Validation commands executed successfully
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-```bash
-# Unit Tests - Verify incremental saves
-cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_orchestrator_saves_step_history_incrementally -v
-cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py::test_step_history_visible_during_execution -v
-
-# Workflow Engine Tests - Ensure no regressions
-cd python && uv run pytest tests/agent_work_orders/test_workflow_engine.py -v
-
-# State Manager Tests - Verify save_step_history works correctly
-cd python && uv run pytest tests/agent_work_orders/test_state_manager.py -v
-
-# API Tests - Ensure steps endpoint still works
-cd python && uv run pytest tests/agent_work_orders/test_api.py -v
-
-# Complete Agent Work Orders Test Suite
-cd python && uv run pytest tests/agent_work_orders/ -v --tb=short
-
-# Type Checking
-cd python && uv run mypy src/agent_work_orders/workflow_engine/workflow_orchestrator.py
-cd python && uv run mypy src/agent_work_orders/
-
-# Linting
-cd python && uv run ruff check src/agent_work_orders/workflow_engine/workflow_orchestrator.py
-cd python && uv run ruff check src/agent_work_orders/
-
-# Full Backend Test Suite (zero regressions)
-cd python && uv run pytest
-
-# Manual End-to-End Validation
-cd python && uv run uvicorn src.agent_work_orders.main:app --port 8888 &
-sleep 5
-curl http://localhost:8888/health | jq
-
-# Create work order
-WORK_ORDER_ID=$(curl -s -X POST http://localhost:8888/agent-work-orders \
-  -H "Content-Type: application/json" \
-  -d '{"repository_url":"https://github.com/Wirasm/dylan.git","sandbox_type":"git_branch","workflow_type":"agent_workflow_plan","user_request":"Test real-time progress"}' \
-  | jq -r '.agent_work_order_id')
-
-echo "Work Order: $WORK_ORDER_ID"
-
-# Poll for real-time progress (observe step count increase: 0->1->2->3->4->5->6->7)
-for i in {1..30}; do
-  STEP_COUNT=$(curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length')
-  LAST_STEP=$(curl -s http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq -r '.steps[-1].step // "none"')
-  echo "Poll $i: $STEP_COUNT steps completed, last: $LAST_STEP"
-  sleep 3
-done
-
-# Verify final state
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID | jq '{status: .status}'
-curl http://localhost:8888/agent-work-orders/$WORK_ORDER_ID/steps | jq '.steps | length'
-
-# Cleanup
-pkill -f "uvicorn.*8888"
-```
-
-## Notes
-
-### Performance Considerations
-
-**Save Operation Performance:**
- `save_step_history()` is a fast in-memory operation (Phase 1 MVP)
- Uses asyncio.Lock to prevent race conditions
- No network I/O or disk writes in current implementation
- Future Supabase migration (Phase 2) will add network latency but async execution prevents blocking
-
-**Impact Analysis:**
- Adding 7 incremental saves adds ~7ms total overhead (1ms per save in-memory)
- This is negligible compared to agent execution time (30-60 seconds per step)
- Total workflow time increase: <0.01% (unmeasurable)
- Trade-off: Tiny performance cost for massive observability improvement
-
-### Why This Fix is Critical
-
-**User Experience Impact:**
- **Before**: Black-box execution with 3-5 minute wait, zero feedback
- **After**: Real-time progress updates every 30-60 seconds as steps complete
-
-**Debugging Benefits:**
- Immediately see which step failed without waiting for entire workflow
- Monitor long-running implementation steps for progress
- Identify bottlenecks in workflow execution
-
-**API Efficiency:**
- Clients still poll every 3 seconds, but now get meaningful updates
- Reduces frustrated users refreshing pages or restarting work orders
- Enables progress bars, step indicators, and real-time status UIs
-
-### Implementation Simplicity
-
-This is one of the simplest high-value features to implement:
- **7 lines of code** (one `await save_step_history()` call per step)
- **Zero API changes** (existing endpoint already works)
- **Zero model changes** (StepHistory already supports this pattern)
- **Zero state repository changes** (save_step_history() already thread-safe)
- **High impact** (transforms user experience from frustrating to delightful)
-
-### Future Enhancements
-
-**Phase 2 - Supabase Persistence:**
- When migrating to Supabase, the same incremental save pattern works
- May want to batch saves (every 2-3 steps) to reduce DB writes
- Consider write-through cache for high-frequency polling
-
-**Phase 3 - WebSocket Support:**
- Instead of polling, push step updates via WebSocket
- Even better real-time experience with lower latency
- Incremental saves still required as source of truth
-
-**Advanced Observability:**
- Add step timing metrics (time between saves = step duration)
- Track which steps consistently take longest
- Alert on unusually slow step execution
- Historical analysis of workflow performance
-
-### Testing Philosophy
-
-**Focus on Real-Time Visibility:**
- Primary test: verify saves occur after each step (not just at end)
- Secondary test: verify step count progression (1, 2, 3, 4, 5, 6, 7)
- Integration test: confirm API returns incremental results during execution
- Manual test: observe real progress while workflow runs
-
-**Regression Prevention:**
- All existing tests must pass unchanged
- No API contract changes
- No model changes
- Performance impact negligible and measured
-
-### Related Documentation
-
- Agent Work Orders MVP v2 Spec: `PRPs/specs/agent-work-orders-mvp-v2.md`
- Atomic Workflow Execution: `PRPs/specs/atomic-workflow-execution-refactor.md`
- PRD: `PRPs/PRD.md`
--- a/python/.claude/commands/agent-work-orders/branch_generator.md
+++ b/python/.claude/commands/agent-work-orders/branch_generator.md
@@ -1,26 +0,0 @@
-# Generate Git Branch
-
-Create a git branch following the standard naming convention.
-
-## Variables
-issue_class: $1
-issue_number: $2
-work_order_id: $3
-issue_json: $4
-
-## Instructions
-
- Generate branch name: `<class>-issue-<num>-wo-<id>-<desc>`
- <class>: bug, feat, or chore (remove slash from issue_class)
- <desc>: 3-6 words, lowercase, hyphens
- Extract issue details from issue_json
-
-## Run
-
-1. `git checkout main`
-2. `git pull`
-3. `git checkout -b <branch_name>`
-
-## Output
-
-Return ONLY the branch name created
--- a/python/.claude/commands/agent-work-orders/classifier.md
+++ b/python/.claude/commands/agent-work-orders/classifier.md
@@ -1,36 +0,0 @@
-# Issue Classification
-
-Classify the GitHub issue into the appropriate category.
-
-## Instructions
-
- Read the issue title and body carefully
- Determine if this is a bug, feature, or chore
- Respond ONLY with one of: /bug, /feature, /chore
- If unclear, default to /feature
-
-## Classification Rules
-
-**Bug**: Fixing broken functionality
- Issue describes something not working as expected
- Error messages, crashes, incorrect behavior
- Keywords: "error", "broken", "not working", "fails"
-
-**Feature**: New functionality or enhancement
- Issue requests new capability
- Adds value to users
- Keywords: "add", "implement", "support", "enable"
-
-**Chore**: Maintenance, refactoring, documentation
- No user-facing changes
- Code cleanup, dependency updates, docs
- Keywords: "refactor", "update", "clean", "docs"
-
-## Input
-
-GitHub Issue JSON:
-$ARGUMENTS
-
-## Output
-
-Return ONLY one of: /bug, /feature, /chore
--- a/python/.claude/commands/agent-work-orders/commit.md
+++ b/python/.claude/commands/agent-work-orders/commit.md
@@ -0,0 +1,81 @@
+# Create Git Commit
+
+Create an atomic git commit with a properly formatted commit message following best practices for the uncommited changes or these specific files if specified.
+
+Specific files (skip if not specified):
+
+- File 1: $1
+- File 2: $2
+- File 3: $3
+- File 4: $4
+- File 5: $5
+
+## Instructions
+
+**Commit Message Format:**
+
+- Use conventional commits: `<type>: <description>`
+- Types: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`
+- Present tense (e.g., "add", "fix", "update", not "added", "fixed", "updated")
+- 50 characters or less for the subject line
+- Lowercase subject line
+- No period at the end
+- Be specific and descriptive
+
+**Examples:**
+
+- `feat: add web search tool with structured logging`
+- `fix: resolve type errors in middleware`
+- `test: add unit tests for config module`
+- `docs: update CLAUDE.md with testing guidelines`
+- `refactor: simplify logging configuration`
+- `chore: update dependencies`
+
+**Atomic Commits:**
+
+- One logical change per commit
+- If you've made multiple unrelated changes, consider splitting into separate commits
+- Commit should be self-contained and not break the build
+
+**IMPORTANT**
+
+- NEVER mention claude code, anthropic, co authored by or anything similar in the commit messages
+
+## Run
+
+1. Review changes: `git diff HEAD`
+2. Check status: `git status`
+3. Stage changes: `git add -A`
+4. Create commit: `git commit -m "<type>: <description>"`
+5. Push to remote: `git push -u origin $(git branch --show-current)`
+6. Verify push: `git log origin/$(git branch --show-current) -1 --oneline`
+
+## Report
+
+Output in this format (plain text, no markdown):
+
+Commit: <commit-hash>
+Branch: <branch-name>
+Message: <commit-message>
+Pushed: Yes (or No if push failed)
+Files: <number> files changed
+
+Then list the files:
+- <file1>
+- <file2>
+- ...
+
+**Example:**
+```
+Commit: a3c2f1e
+Branch: feat/add-user-auth
+Message: feat: add user authentication system
+Pushed: Yes
+Files: 5 files changed
+
+- src/auth/login.py
+- src/auth/middleware.py
+- tests/auth/test_login.py
+- CLAUDE.md
+- requirements.txt
+```
--- a/python/.claude/commands/agent-work-orders/committer.md
+++ b/python/.claude/commands/agent-work-orders/committer.md
@@ -1,26 +0,0 @@
-# Create Git Commit
-
-Create a git commit with proper formatting.
-
-## Variables
-agent_name: $1
-issue_class: $2
-issue_json: $3
-
-## Instructions
-
- Format: `<agent>: <class>: <message>`
- Message: Present tense, 50 chars max, descriptive
- Examples:
-  - `planner: feat: add user authentication`
-  - `implementor: bug: fix login validation`
-
-## Run
-
-1. `git diff HEAD` - Review changes
-2. `git add -A` - Stage all
-3. `git commit -m "<message>"`
-
-## Output
-
-Return ONLY the commit message used
--- a/python/.claude/commands/agent-work-orders/create-branch.md
+++ b/python/.claude/commands/agent-work-orders/create-branch.md
@@ -0,0 +1,104 @@
+# Create Git Branch
+
+Generate a conventional branch name based on user request and create a new git branch.
+
+## Variables
+
+User request: $1
+
+## Instructions
+
+**Step 1: Check Current Branch**
+
+- Check current branch: `git branch --show-current`
+- Check if on main/master:
+  ```bash
+  CURRENT_BRANCH=$(git branch --show-current)
+  if [[ "$CURRENT_BRANCH" != "main" && "$CURRENT_BRANCH" != "master" ]]; then
+    echo "Warning: Currently on branch '$CURRENT_BRANCH', not main/master"
+    echo "Proceeding with branch creation from current branch"
+  fi
+  ```
+- Note: We proceed regardless, but log the warning
+
+**Step 2: Generate Branch Name**
+
+Use conventional branch naming:
+
+**Prefixes:**
+- `feat/` - New feature or enhancement
+- `fix/` - Bug fix
+- `chore/` - Maintenance tasks (dependencies, configs, etc.)
+- `docs/` - Documentation only changes
+- `refactor/` - Code refactoring (no functionality change)
+- `test/` - Adding or updating tests
+- `perf/` - Performance improvements
+
+**Naming Rules:**
+- Use kebab-case (lowercase with hyphens)
+- Be descriptive but concise (max 50 characters)
+- Remove special characters except hyphens
+- No spaces, use hyphens instead
+
+**Examples:**
+- "Add user authentication system" → `feat/add-user-auth`
+- "Fix login redirect bug" → `fix/login-redirect`
+- "Update README documentation" → `docs/update-readme`
+- "Refactor database queries" → `refactor/database-queries`
+- "Add unit tests for API" → `test/api-unit-tests`
+
+**Branch Name Generation Logic:**
+1. Analyze user request to determine type (feature/fix/chore/docs/refactor/test/perf)
+2. Extract key action and subject
+3. Convert to kebab-case
+4. Truncate if needed to keep under 50 chars
+5. Validate name is descriptive and follows conventions
+
+**Step 3: Check Branch Exists**
+
+- Check if branch name already exists:
+  ```bash
+  if git show-ref --verify --quiet refs/heads/<branch-name>; then
+    echo "Branch <branch-name> already exists"
+    # Append version suffix
+    COUNTER=2
+    while git show-ref --verify --quiet refs/heads/<branch-name>-v$COUNTER; do
+      COUNTER=$((COUNTER + 1))
+    done
+    BRANCH_NAME="<branch-name>-v$COUNTER"
+  fi
+  ```
+- If exists, append `-v2`, `-v3`, etc. until unique
+
+**Step 4: Create and Checkout Branch**
+
+- Create and checkout new branch: `git checkout -b <branch-name>`
+- Verify creation: `git branch --show-current`
+- Ensure output matches expected branch name
+
+**Step 5: Verify Branch State**
+
+- Confirm branch created: `git branch --list <branch-name>`
+- Confirm currently on branch: `[ "$(git branch --show-current)" = "<branch-name>" ]`
+- Check remote tracking: `git rev-parse --abbrev-ref --symbolic-full-name @{u} 2>/dev/null || echo "No upstream set"`
+
+**Important Notes:**
+
+- NEVER mention Claude Code, Anthropic, AI, or co-authoring in any output
+- Branch should be created locally only (no push yet)
+- Branch will be pushed later by commit.md command
+- If user request is unclear, prefer `feat/` prefix as default
+
+## Report
+
+Output ONLY the branch name (no markdown, no explanations, no quotes):
+
+<branch-name>
+
+**Example outputs:**
+```
+feat/add-user-auth
+fix/login-redirect-issue
+docs/update-api-documentation
+refactor/simplify-middleware
+```
--- a/python/.claude/commands/agent-work-orders/create-pr.md
+++ b/python/.claude/commands/agent-work-orders/create-pr.md
@@ -0,0 +1,201 @@
+# Create GitHub Pull Request
+
+Create a GitHub pull request for the current branch with auto-generated description.
+
+## Variables
+
+- Branch name: $1
+- PRP file path: $2 (optional - may be empty)
+
+## Instructions
+
+**Prerequisites Check:**
+
+1. Verify gh CLI is authenticated:
+   ```bash
+   gh auth status || {
+     echo "Error: gh CLI not authenticated. Run: gh auth login"
+     exit 1
+   }
+   ```
+
+2. Verify we're in a git repository:
+   ```bash
+   git rev-parse --git-dir >/dev/null 2>&1 || {
+     echo "Error: Not in a git repository"
+     exit 1
+   }
+   ```
+
+3. Verify changes are pushed to remote:
+   ```bash
+   BRANCH=$(git branch --show-current)
+   git rev-parse --verify origin/$BRANCH >/dev/null 2>&1 || {
+     echo "Error: Branch '$BRANCH' not pushed to remote. Run: git push -u origin $BRANCH"
+     exit 1
+   }
+   ```
+
+**Step 1: Gather Information**
+
+1. Get current branch name:
+   ```bash
+   BRANCH=$(git branch --show-current)
+   ```
+
+2. Get default base branch (usually main or master):
+   ```bash
+   BASE=$(git remote show origin | grep 'HEAD branch' | cut -d' ' -f5)
+   # Fallback to main if detection fails
+   [ -z "$BASE" ] && BASE="main"
+   ```
+
+3. Get repository info:
+   ```bash
+   REPO=$(gh repo view --json nameWithOwner -q .nameWithOwner)
+   ```
+
+**Step 2: Generate PR Title**
+
+Convert branch name to conventional commit format:
+
+**Rules:**
+- `feat/add-user-auth` → `feat: add user authentication`
+- `fix/login-bug` → `fix: resolve login bug`
+- `docs/update-readme` → `docs: update readme`
+- Capitalize first letter after prefix
+- Remove hyphens, replace with spaces
+- Keep concise (under 72 characters)
+
+**Step 3: Find PR Template**
+
+Look for PR template in these locations (in order):
+
+1. `.github/pull_request_template.md`
+2. `.github/PULL_REQUEST_TEMPLATE.md`
+3. `.github/PULL_REQUEST_TEMPLATE/pull_request_template.md`
+4. `docs/pull_request_template.md`
+
+```bash
+PR_TEMPLATE=""
+if [ -f ".github/pull_request_template.md" ]; then
+  PR_TEMPLATE=".github/pull_request_template.md"
+elif [ -f ".github/PULL_REQUEST_TEMPLATE.md" ]; then
+  PR_TEMPLATE=".github/PULL_REQUEST_TEMPLATE.md"
+elif [ -f ".github/PULL_REQUEST_TEMPLATE/pull_request_template.md" ]; then
+  PR_TEMPLATE=".github/PULL_REQUEST_TEMPLATE/pull_request_template.md"
+elif [ -f "docs/pull_request_template.md" ]; then
+  PR_TEMPLATE="docs/pull_request_template.md"
+fi
+```
+
+**Step 4: Generate PR Body**
+
+**If PR template exists:**
+- Read template content
+- Fill in placeholders if present
+- If PRP file provided: Extract summary and insert into template
+
+**If no PR template (use default):**
+
+```markdown
+## Summary
+[Brief description of what this PR does]
+
+## Changes
+[Bullet list of key changes from git log]
+
+## Implementation Details
+[Reference PRP file if provided, otherwise summarize commits]
+
+## Testing
+- [ ] All existing tests pass
+- [ ] New tests added (if applicable)
+- [ ] Manual testing completed
+
+## Related Issues
+Closes #[issue number if applicable]
+```
+
+**Auto-fill logic:**
+
+1. **Summary section:**
+   - If PRP file exists: Extract "Feature Description" section
+   - Otherwise: Use first commit message body
+   - Fallback: Summarize changes from `git diff --stat`
+
+2. **Changes section:**
+   - Get commit messages: `git log $BASE..$BRANCH --pretty=format:"- %s"`
+   - List modified files: `git diff --name-only $BASE...$BRANCH`
+   - Format as bullet points
+
+3. **Implementation Details:**
+   - If PRP file exists: Link to it with `See: $PRP_FILE_PATH`
+   - Extract key technical details from PRP "Solution Statement"
+   - Otherwise: Summarize from commit messages
+
+4. **Testing section:**
+   - Check if new test files were added: `git diff --name-only $BASE...$BRANCH | grep test`
+   - Auto-check test boxes if tests exist
+   - Include validation results from execute.md if available
+
+**Step 5: Create Pull Request**
+
+```bash
+gh pr create \
+  --title "$PR_TITLE" \
+  --body "$PR_BODY" \
+  --base "$BASE" \
+  --head "$BRANCH" \
+  --web
+```
+
+**Flags:**
+- `--web`: Open PR in browser after creation
+- If `--web` not desired, remove it
+
+**Step 6: Capture PR URL**
+
+```bash
+PR_URL=$(gh pr view --json url -q .url)
+```
+
+**Step 7: Link to Issues (if applicable)**
+
+If PRP file or commits mention issue numbers (#123), link them:
+
+```bash
+# Extract issue numbers from commits
+ISSUES=$(git log $BASE..$BRANCH --pretty=format:"%s %b" | grep -oP '#\K\d+' | sort -u)
+
+# Link issues to PR
+for ISSUE in $ISSUES; do
+  gh pr comment $PR_URL --body "Relates to #$ISSUE"
+done
+```
+
+**Important Notes:**
+
+- NEVER mention Claude Code, Anthropic, AI, or co-authoring in PR
+- PR title and body should be professional and clear
+- Include all relevant context for reviewers
+- Link to PRP file in repo if available
+- Auto-check completed checkboxes in template
+
+## Report
+
+Output ONLY the PR URL (no markdown, no explanations, no quotes):
+
+https://github.com/owner/repo/pull/123
+
+**Example output:**
+```
+https://github.com/coleam00/archon/pull/456
+```
+
+## Error Handling
+
+If PR creation fails:
+- Check if PR already exists for branch: `gh pr list --head $BRANCH`
+- If exists: Return existing PR URL
+- If other error: Output error message with context
--- a/python/.claude/commands/agent-work-orders/execute.md
+++ b/python/.claude/commands/agent-work-orders/execute.md
@@ -0,0 +1,27 @@
+# Execute PRP Plan
+
+Implement a feature plan from the PRPs directory by following its Step by Step Tasks section.
+
+## Variables
+
+Plan file: $ARGUMENTS
+
+## Instructions
+
+- Read the entire plan file carefully
+- Execute **every step** in the "Step by Step Tasks" section in order, top to bottom
+- Follow the "Testing Strategy" to create proper unit and integration tests
+- Complete all "Validation Commands" at the end
+- Ensure all linters pass and all tests pass before finishing
+- Follow CLAUDE.md guidelines for type safety, logging, and docstrings
+
+## When done
+
+- Move the PRP file to the completed directory in PRPs/features/completed
+
+## Report
+
+- Summarize completed work in a concise bullet point list
+- Show files and lines changed: `git diff --stat`
+- Confirm all validation commands passed
+- Note any deviations from the plan (if any)
--- a/python/.claude/commands/agent-work-orders/implementor.md
+++ b/python/.claude/commands/agent-work-orders/implementor.md
@@ -1,21 +0,0 @@
-# Implementation
-
-Implement the plan from the specified plan file.
-
-## Variables
-plan_file: $1
-
-## Instructions
-
- Read the plan file carefully
- Execute every step in order
- Follow existing code patterns and conventions
- Create/modify files as specified in the plan
- Run validation commands from the plan
- Do NOT create git commits or branches (separate steps)
-
-## Output
-
- Summarize work completed
- List files changed
- Report test results if any
--- a/python/.claude/commands/agent-work-orders/noqa.md
+++ b/python/.claude/commands/agent-work-orders/noqa.md
@@ -0,0 +1,176 @@
+# NOQA Analysis and Resolution
+
+Find all noqa/type:ignore comments in the codebase, investigate why they exist, and provide recommendations for resolution or justification.
+
+## Instructions
+
+**Step 1: Find all NOQA comments**
+
+- Use Grep tool to find all noqa comments: pattern `noqa|type:\s*ignore`
+- Use output_mode "content" with line numbers (-n flag)
+- Search across all Python files (type: "py")
+- Document total count of noqa comments found
+
+**Step 2: For EACH noqa comment (repeat this process):**
+
+- Read the file containing the noqa comment with sufficient context (at least 10 lines before and after)
+- Identify the specific linting rule or type error being suppressed
+- Understand the code's purpose and why the suppression was added
+- Investigate if the suppression is still necessary or can be resolved
+
+**Step 3: Investigation checklist for each noqa:**
+
+- What specific error/warning is being suppressed? (e.g., `type: ignore[arg-type]`, `noqa: F401`)
+- Why was the suppression necessary? (legacy code, false positive, legitimate limitation, technical debt)
+- Can the underlying issue be fixed? (refactor code, update types, improve imports)
+- What would it take to remove the suppression? (effort estimate, breaking changes, architectural changes)
+- Is the suppression justified long-term? (external library limitation, Python limitation, intentional design)
+
+**Step 4: Research solutions:**
+
+- Check if newer versions of tools (mypy, ruff) handle the case better
+- Look for alternative code patterns that avoid the suppression
+- Consider if type stubs or Protocol definitions could help
+- Evaluate if refactoring would be worthwhile
+
+## Report Format
+
+Create a markdown report file (create the reports directory if not created yet): `PRPs/reports/noqa-analysis-{YYYY-MM-DD}.md`
+
+Use this structure for the report:
+
+````markdown
+# NOQA Analysis Report
+
+**Generated:** {date}
+**Total NOQA comments found:** {count}
+
+---
+
+## Summary
+
+- Total suppressions: {count}
+- Can be removed: {count}
+- Should remain: {count}
+- Requires investigation: {count}
+
+---
+
+## Detailed Analysis
+
+### 1. {File path}:{line number}
+
+**Location:** `{file_path}:{line_number}`
+
+**Suppression:** `{noqa comment or type: ignore}`
+
+**Code context:**
+
+```python
+{relevant code snippet}
+```
+````
+
+**Why it exists:**
+{explanation of why the suppression was added}
+
+**Options to resolve:**
+
+1. {Option 1: description}
+   - Effort: {Low/Medium/High}
+   - Breaking: {Yes/No}
+   - Impact: {description}
+
+2. {Option 2: description}
+   - Effort: {Low/Medium/High}
+   - Breaking: {Yes/No}
+   - Impact: {description}
+
+**Tradeoffs:**
+
+- {Tradeoff 1}
+- {Tradeoff 2}
+
+**Recommendation:** {Remove | Keep | Refactor}
+{Justification for recommendation}
+
+---
+
+{Repeat for each noqa comment}
+
+````
+
+## Example Analysis Entry
+
+```markdown
+### 1. src/shared/config.py:45
+
+**Location:** `src/shared/config.py:45`
+
+**Suppression:** `# type: ignore[assignment]`
+
+**Code context:**
+```python
+@property
+def openai_api_key(self) -> str:
+    key = os.getenv("OPENAI_API_KEY")
+    if not key:
+        raise ValueError("OPENAI_API_KEY not set")
+    return key  # type: ignore[assignment]
+````
+
+**Why it exists:**
+MyPy cannot infer that the ValueError prevents None from being returned, so it thinks the return type could be `str | None`.
+
+**Options to resolve:**
+
+1. Use assert to help mypy narrow the type
+   - Effort: Low
+   - Breaking: No
+   - Impact: Cleaner code, removes suppression
+
+2. Add explicit cast with typing.cast()
+   - Effort: Low
+   - Breaking: No
+   - Impact: More verbose but type-safe
+
+3. Refactor to use separate validation method
+   - Effort: Medium
+   - Breaking: No
+   - Impact: Better separation of concerns
+
+**Tradeoffs:**
+
+- Option 1 (assert) is cleanest but asserts can be disabled with -O flag
+- Option 2 (cast) is most explicit but adds import and verbosity
+- Option 3 is most robust but requires more refactoring
+
+**Recommendation:** Remove (use Option 1)
+Replace the type:ignore with an assert statement after the if check. This helps mypy understand the control flow while maintaining runtime safety. The assert will never fail in practice since the ValueError is raised first.
+
+**Implementation:**
+
+```python
+@property
+def openai_api_key(self) -> str:
+    key = os.getenv("OPENAI_API_KEY")
+    if not key:
+        raise ValueError("OPENAI_API_KEY not set")
+    assert key is not None  # Help mypy understand control flow
+    return key
+```
+
+```
+
+## Report
+
+After completing the analysis:
+
+- Output the path to the generated report file
+- Summarize findings:
+  - Total suppressions found
+  - How many can be removed immediately (low effort)
+  - How many should remain (justified)
+  - How many need deeper investigation or refactoring
+- Highlight any quick wins (suppressions that can be removed with minimal effort)
+```
--- a/python/.claude/commands/agent-work-orders/plan_finder.md
+++ b/python/.claude/commands/agent-work-orders/plan_finder.md
@@ -1,23 +0,0 @@
-# Find Plan File
-
-Locate the plan file created in the previous step.
-
-## Variables
-issue_number: $1
-work_order_id: $2
-previous_output: $3
-
-## Instructions
-
- The previous step created a plan file
- Find the exact file path
- Pattern: `specs/issue-{issue_number}-wo-{work_order_id}-planner-*.md`
- Try these approaches:
-  1. Parse previous_output for file path mention
-  2. Run: `ls specs/issue-{issue_number}-wo-{work_order_id}-planner-*.md`
-  3. Run: `find specs -name "issue-{issue_number}-wo-{work_order_id}-planner-*.md"`
-
-## Output
-
-Return ONLY the file path (e.g., "specs/issue-7-wo-abc123-planner-fix-auth.md")
-Return "0" if not found
--- a/python/.claude/commands/agent-work-orders/planner_bug.md
+++ b/python/.claude/commands/agent-work-orders/planner_bug.md
@@ -1,71 +0,0 @@
-# Bug Planning
-
-Create a new plan to resolve the Bug using the exact specified markdown Plan Format.
-
-## Variables
-issue_number: $1
-work_order_id: $2
-issue_json: $3
-
-## Instructions
-
- IMPORTANT: You're writing a plan to resolve a bug that will add value to the application.
- IMPORTANT: The Bug describes the bug that will be resolved but we're not resolving it, we're creating the plan.
- You're writing a plan to resolve a bug, it should be thorough and precise so we fix the root cause and prevent regressions.
- Create the plan in the `specs/` directory with filename: `issue-{issue_number}-wo-{work_order_id}-planner-{descriptive-name}.md`
-  - Replace `{descriptive-name}` with a short name based on the bug (e.g., "fix-login-error", "resolve-timeout")
- Use the plan format below to create the plan.
- Research the codebase to understand the bug, reproduce it, and put together a plan to fix it.
- IMPORTANT: Replace every <placeholder> in the Plan Format with the requested value.
- Use your reasoning model: THINK HARD about the bug, its root cause, and the steps to fix it properly.
- IMPORTANT: Be surgical with your bug fix, solve the bug at hand and don't fall off track.
- IMPORTANT: We want the minimal number of changes that will fix and address the bug.
- If you need a new library, use `uv add` and report it in the Notes section.
- Start your research by reading the README.md file.
-
-## Plan Format
-
-```md
-# Bug: <bug name>
-
-## Bug Description
-<describe the bug in detail, including symptoms and expected vs actual behavior>
-
-## Problem Statement
-<clearly define the specific problem that needs to be solved>
-
-## Solution Statement
-<describe the proposed solution approach to fix the bug>
-
-## Steps to Reproduce
-<list exact steps to reproduce the bug>
-
-## Root Cause Analysis
-<analyze and explain the root cause of the bug>
-
-## Relevant Files
-Use these files to fix the bug:
-
-<find and list the files relevant to the bug with bullet points describing why. If new files need to be created, list them in an h3 'New Files' section.>
-
-## Step by Step Tasks
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. Order matters, start with foundational shared changes then move on to specific changes. Include tests that will validate the bug is fixed. Your last step should be running the Validation Commands.>
-
-## Validation Commands
-Execute every command to validate the bug is fixed with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the bug is fixed. Every command must execute without errors. Include commands to reproduce the bug before and after the fix.>
-
-## Notes
-<optionally list any additional notes or context relevant to the bug>
-```
-
-## Bug
-
-Extract the bug details from the `issue_json` variable (parse the JSON and use the title and body fields).
-
-## Report
- Summarize the work you've just done in a concise bullet point list.
- Include the full path to the plan file you created (e.g., `specs/issue-123-wo-abc123-planner-fix-login-error.md`)
--- a/python/.claude/commands/agent-work-orders/planner_chore.md
+++ b/python/.claude/commands/agent-work-orders/planner_chore.md
@@ -1,56 +0,0 @@
-# Chore Planning
-
-Create a new plan to resolve the Chore using the exact specified markdown Plan Format.
-
-## Variables
-issue_number: $1
-work_order_id: $2
-issue_json: $3
-
-## Instructions
-
- IMPORTANT: You're writing a plan to resolve a chore that will add value to the application.
- IMPORTANT: The Chore describes the chore that will be resolved but we're not resolving it, we're creating the plan.
- You're writing a plan to resolve a chore, it should be simple but thorough and precise so we don't miss anything.
- Create the plan in the `specs/` directory with filename: `issue-{issue_number}-wo-{work_order_id}-planner-{descriptive-name}.md`
-  - Replace `{descriptive-name}` with a short name based on the chore (e.g., "update-readme", "fix-tests")
- Use the plan format below to create the plan.
- Research the codebase and put together a plan to accomplish the chore.
- IMPORTANT: Replace every <placeholder> in the Plan Format with the requested value.
- Use your reasoning model: THINK HARD about the plan and the steps to accomplish the chore.
- Start your research by reading the README.md file.
-
-## Plan Format
-
-```md
-# Chore: <chore name>
-
-## Chore Description
-<describe the chore in detail>
-
-## Relevant Files
-Use these files to resolve the chore:
-
-<find and list the files relevant to the chore with bullet points describing why. If new files need to be created, list them in an h3 'New Files' section.>
-
-## Step by Step Tasks
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. Order matters, start with foundational shared changes then move on to specific changes. Your last step should be running the Validation Commands.>
-
-## Validation Commands
-Execute every command to validate the chore is complete with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the chore is complete. Every command must execute without errors.>
-
-## Notes
-<optionally list any additional notes or context relevant to the chore>
-```
-
-## Chore
-
-Extract the chore details from the `issue_json` variable (parse the JSON and use the title and body fields).
-
-## Report
- Summarize the work you've just done in a concise bullet point list.
- Include the full path to the plan file you created (e.g., `specs/issue-7-wo-abc123-planner-update-readme.md`)
--- a/python/.claude/commands/agent-work-orders/planner_feature.md
+++ b/python/.claude/commands/agent-work-orders/planner_feature.md
@@ -1,111 +0,0 @@
-# Feature Planning
-
-Create a new plan in specs/*.md to implement the Feature using the exact specified markdown Plan Format.
-
-## Variables
-issue_number: $1
-work_order_id: $2
-issue_json: $3
-
-## Instructions
-
- IMPORTANT: You're writing a plan to implement a net new feature that will add value to the application.
- IMPORTANT: The Feature describes the feature that will be implemented but remember we're not implementing it, we're creating the plan.
- Create the plan in the `specs/` directory with filename: `issue-{issue_number}-wo-{work_order_id}-planner-{descriptive-name}.md`
-  - Replace `{descriptive-name}` with a short name based on the feature (e.g., "add-auth", "api-endpoints")
- Use the Plan Format below to create the plan.
- Research the codebase to understand existing patterns, architecture, and conventions before planning.
- IMPORTANT: Replace every <placeholder> in the Plan Format with the requested value.
- Use your reasoning model: THINK HARD about the feature requirements, design, and implementation approach.
- Follow existing patterns and conventions in the codebase.
- Design for extensibility and maintainability.
- If you need a new library, use `uv add` and report it in the Notes section.
- Start your research by reading the README.md file.
- ultrathink about the research before you create the plan.
-
-## Plan Format
-
-```md
-# Feature: <feature name>
-
-## Feature Description
-
-<describe the feature in detail, including its purpose and value to users>
-
-## User Story
-
-As a <type of user>
-I want to <action/goal>
-So that <benefit/value>
-
-## Problem Statement
-
-<clearly define the specific problem or opportunity this feature addresses>
-
-## Solution Statement
-
-<describe the proposed solution approach and how it solves the problem>
-
-## Relevant Files
-
-Use these files to implement the feature:
-
-<find and list the files relevant to the feature with bullet points describing why. If new files need to be created, list them in an h3 'New Files' section.>
-
-## Implementation Plan
-
-### Phase 1: Foundation
-
-<describe the foundational work needed before implementing the main feature>
-
-### Phase 2: Core Implementation
-
-<describe the main implementation work for the feature>
-
-### Phase 3: Integration
-
-<describe how the feature will integrate with existing functionality>
-
-## Step by Step Tasks
-
-IMPORTANT: Execute every step in order, top to bottom.
-
-<list step by step tasks as h3 headers plus bullet points. Order matters, start with foundational shared changes required then move on to specific implementation. Include creating tests throughout. Your last step should be running the Validation Commands.>
-
-## Testing Strategy
-
-### Unit Tests
-
-<describe unit tests needed for the feature>
-
-### Integration Tests
-
-<describe integration tests needed for the feature>
-
-### Edge Cases
-
-<list edge cases that need to be tested>
-
-## Acceptance Criteria
-
-<list specific, measurable criteria that must be met for the feature to be considered complete>
-
-## Validation Commands
-
-Execute every command to validate the feature works correctly with zero regressions.
-
-<list commands you'll use to validate with 100% confidence the feature is implemented correctly. Every command must execute without errors.>
-
-## Notes
-
-<optionally list any additional notes, future considerations, or context relevant to the feature>
-```
-
-## Feature
-
-Extract the feature details from the `issue_json` variable (parse the JSON and use the title and body fields).
-
-## Report
-
- Summarize the work you've just done in a concise bullet point list.
- Include the full path to the plan file you created (e.g., `specs/issue-123-wo-abc123-planner-add-auth.md`)
--- a/python/.claude/commands/agent-work-orders/planning.md
+++ b/python/.claude/commands/agent-work-orders/planning.md
@@ -0,0 +1,176 @@
+# Feature Planning
+
+Create a new plan to implement the `PRP` using the exact specified markdown `PRP Format`. Follow the `Instructions` to create the plan use the `Relevant Files` to focus on the right files.
+
+## Variables
+
+FEATURE $1 $2
+
+## Instructions
+
+- IMPORTANT: You're writing a plan to implement a net new feature based on the `Feature` that will add value to the application.
+- IMPORTANT: The `Feature` describes the feature that will be implemented but remember we're not implementing a new feature, we're creating the plan that will be used to implement the feature based on the `PRP Format` below.
+- Create the plan in the `PRPs/features/` directory with filename: `{descriptive-name}.md`
+  - Replace `{descriptive-name}` with a short, descriptive name based on the feature (e.g., "add-auth-system", "implement-search", "create-dashboard")
+- Use the `PRP Format` below to create the plan.
+- Deeply research the codebase to understand existing patterns, architecture, and conventions before planning the feature.
+- If no patterns are established or are unclear ask the user for clarifications while providing best recommendations and options
+- IMPORTANT: Replace every <placeholder> in the `PRP Format` with the requested value. Add as much detail as needed to implement the feature successfully.
+- Use your reasoning model: THINK HARD about the feature requirements, design, and implementation approach.
+- Follow existing patterns and conventions in the codebase. Don't reinvent the wheel.
+- Design for extensibility and maintainability.
+- Deeply do web research to understand the latest trends and technologies in the field.
+- Figure out latest best practices and library documentation.
+- Include links to relevant resources and documentation with anchor tags for easy navigation.
+- If you need a new library, use `uv add <package>` and report it in the `Notes` section.
+- Read `CLAUDE.md` for project principles, logging rules, testing requirements, and docstring style.
+- All code MUST have type annotations (strict mypy enforcement).
+- Use Google-style docstrings for all functions, classes, and modules.
+- Every new file in `src/` MUST have a corresponding test file in `tests/`.
+- Respect requested files in the `Relevant Files` section.
+
+## Relevant Files
+
+Focus on the following files and vertical slice structure:
+
+**Core Files:**
+
+- `CLAUDE.md` - Project instructions, logging rules, testing requirements, docstring style
+  app/backend core files
+  app/frontend core files
+
+## PRP Format
+
+```md
+# Feature: <feature name>
+
+## Feature Description
+
+<describe the feature in detail, including its purpose and value to users>
+
+## User Story
+
+As a <type of user>
+I want to <action/goal>
+So that <benefit/value>
+
+## Problem Statement
+
+<clearly define the specific problem or opportunity this feature addresses>
+
+## Solution Statement
+
+<describe the proposed solution approach and how it solves the problem>
+
+## Relevant Files
+
+Use these files to implement the feature:
+
+<find and list the files that are relevant to the feature describe why they are relevant in bullet points. If there are new files that need to be created to implement the feature, list them in an h3 'New Files' section. inlcude line numbers for the relevant sections>
+
+## Relevant research docstring
+
+Use these documentation files and links to help with understanding the technology to use:
+
+- [Documentation Link 1](https://example.com/doc1)
+  - [Anchor tag]
+  - [Short summary]
+- [Documentation Link 2](https://example.com/doc2)
+  - [Anchor tag]
+  - [Short summary]
+
+## Implementation Plan
+
+### Phase 1: Foundation
+
+<describe the foundational work needed before implementing the main feature>
+
+### Phase 2: Core Implementation
+
+<describe the main implementation work for the feature>
+
+### Phase 3: Integration
+
+<describe how the feature will integrate with existing functionality>
+
+## Step by Step Tasks
+
+IMPORTANT: Execute every step in order, top to bottom.
+
+<list step by step tasks as h3 headers plus bullet points. use as many h3 headers as needed to implement the feature. Order matters:
+
+1. Start with foundational shared changes (schemas, types)
+2. Implement core functionality with proper logging
+3. Create corresponding test files (unit tests mirror src/ structure)
+4. Add integration tests if feature interacts with multiple components
+5. Verify linters pass: `uv run ruff check src/ && uv run mypy src/`
+6. Ensure all tests pass: `uv run pytest tests/`
+7. Your last step should be running the `Validation Commands`>
+
+<For tool implementations:
+
+- Define Pydantic schemas in `schemas.py`
+- Implement tool with structured logging and type hints
+- Register tool with Pydantic AI agent
+- Create unit tests in `tests/tools/<name>/test_<module>.py`
+- Add integration test in `tests/integration/` if needed>
+
+## Testing Strategy
+
+See `CLAUDE.md` for complete testing requirements. Every file in `src/` must have a corresponding test file in `tests/`.
+
+### Unit Tests
+
+<describe unit tests needed for the feature. Mark with @pytest.mark.unit. Test individual components in isolation.>
+
+### Integration Tests
+
+<if the feature interacts with multiple components, describe integration tests needed. Mark with @pytest.mark.integration. Place in tests/integration/ when testing full application stack.>
+
+### Edge Cases
+
+<list edge cases that need to be tested>
+
+## Acceptance Criteria
+
+<list specific, measurable criteria that must be met for the feature to be considered complete>
+
+## Validation Commands
+
+Execute every command to validate the feature works correctly with zero regressions.
+
+<list commands you'll use to validate with 100% confidence the feature is implemented correctly with zero regressions. Include (example for BE Biome and TS checks are used for FE):
+
+- Linting: `uv run ruff check src/`
+- Type checking: `uv run mypy src/`
+- Unit tests: `uv run pytest tests/ -m unit -v`
+- Integration tests: `uv run pytest tests/ -m integration -v` (if applicable)
+- Full test suite: `uv run pytest tests/ -v`
+- Manual API testing if needed (curl commands, test requests)>
+
+**Required validation commands:**
+
+- `uv run ruff check src/` - Lint check must pass
+- `uv run mypy src/` - Type check must pass
+- `uv run pytest tests/ -v` - All tests must pass with zero regressions
+
+**Run server and test core endpoints:**
+
+- Start server: @.claude/start-server
+- Test endpoints with curl (at minimum: health check, main functionality)
+- Verify structured logs show proper correlation IDs and context
+- Stop server after validation
+
+## Notes
+
+<optionally list any additional notes, future considerations, or context that are relevant to the feature that will be helpful to the developer>
+```
+
+## Feature
+
+Extract the feature details from the `issue_json` variable (parse the JSON and use the title and body fields).
+
+## Report
+
+- Summarize the work you've just done in a concise bullet point list.
+- Include the full path to the plan file you created (e.g., `PRPs/features/add-auth-system.md`)
--- a/python/.claude/commands/agent-work-orders/pr_creator.md
+++ b/python/.claude/commands/agent-work-orders/pr_creator.md
@@ -1,27 +0,0 @@
-# Create Pull Request
-
-Create a GitHub pull request for the changes.
-
-## Variables
-branch_name: $1
-issue_json: $2
-plan_file: $3
-work_order_id: $4
-
-## Instructions
-
- Title format: `<type>: #<num> - <title>`
- Body includes:
-  - Summary from issue
-  - Link to plan_file
-  - Closes #<number>
-  - Work Order: {work_order_id}
-
-## Run
-
-1. `git push -u origin <branch_name>`
-2. `gh pr create --title "<title>" --body "<body>" --base main`
-
-## Output
-
-Return ONLY the PR URL
--- a/python/.claude/commands/agent-work-orders/prime.md
+++ b/python/.claude/commands/agent-work-orders/prime.md
@@ -0,0 +1,28 @@
+# Prime
+
+Execute the following sections to understand the codebase before starting new work, then summarize your understanding.
+
+## Run
+
+- List all tracked files: `git ls-files`
+- Show project structure: `tree -I '.venv|__pycache__|*.pyc|.pytest_cache|.mypy_cache|.ruff_cache' -L 3`
+
+## Read
+
+- `CLAUDE.md` - Core project instructions, principles, logging rules, testing requirements
+- `python/src/agent_work_orders` - Project overview and setup (if exists)
+
+- Identify core files in the agent work orders directory to understand what we are woerking on and its intent
+
+## Report
+
+Provide a concise summary of:
+
+1. **Project Purpose**: What this application does
+2. **Architecture**: Key patterns (vertical slice, FastAPI + Pydantic AI)
+3. **Core Principles**: TYPE SAFETY, KISS, YAGNI
+4. **Tech Stack**: Main dependencies and tools
+5. **Key Requirements**: Logging, testing, type annotations
+6. **Current State**: What's implemented
+
+Keep the summary brief (5-10 bullet points) and focused on what you need to know to contribute effectively.
--- a/python/.claude/commands/agent-work-orders/prp-review.md
+++ b/python/.claude/commands/agent-work-orders/prp-review.md
@@ -0,0 +1,89 @@
+# Code Review
+
+Review implemented work against a PRP specification to ensure code quality, correctness, and adherence to project standards.
+
+## Variables
+
+Plan file: $ARGUMENTS (e.g., `PRPs/features/add-web-search.md`)
+
+## Instructions
+
+**Understand the Changes:**
+
+- Check current branch: `git branch`
+- Review changes: `git diff origin/main` (or `git diff HEAD` if not on a branch)
+- Read the PRP plan file to understand requirements
+
+**Code Quality Review:**
+
+- **Type Safety**: Verify all functions have type annotations, mypy passes
+- **Logging**: Check structured logging is used correctly (event names, context, exception handling)
+- **Docstrings**: Ensure Google-style docstrings on all functions/classes
+- **Testing**: Verify unit tests exist for all new files, integration tests if needed
+- **Architecture**: Confirm vertical slice structure is followed
+- **CLAUDE.md Compliance**: Check adherence to core principles (KISS, YAGNI, TYPE SAFETY)
+
+**Validation Ruff for BE and Biome for FE:**
+
+- Run linters: `uv run ruff check src/ && uv run mypy src/`
+- Run tests: `uv run pytest tests/ -v`
+- Start server and test endpoints with curl (if applicable)
+- Verify structured logs show proper correlation IDs and context
+
+**Issue Severity:**
+
+- `blocker` - Must fix before merge (breaks build, missing tests, type errors, security issues)
+- `major` - Should fix (missing logging, incomplete docstrings, poor patterns)
+- `minor` - Nice to have (style improvements, optimization opportunities)
+
+## Report
+
+Return ONLY valid JSON (no markdown, no explanations) save to [report-#.json] in prps/reports directory create the directory if it doesn't exist. Output will be parsed with JSON.parse().
+
+### Output Structure
+
+```json
+{
+  "success": "boolean - true if NO BLOCKER issues, false if BLOCKER issues exist",
+  "review_summary": "string - 2-4 sentences: what was built, does it match spec, quality assessment",
+  "review_issues": [
+    {
+      "issue_number": "number - issue index",
+      "file_path": "string - file with the issue (if applicable)",
+      "issue_description": "string - what's wrong",
+      "issue_resolution": "string - how to fix it",
+      "severity": "string - blocker|major|minor"
+    }
+  ],
+  "validation_results": {
+    "linting_passed": "boolean",
+    "type_checking_passed": "boolean",
+    "tests_passed": "boolean",
+    "api_endpoints_tested": "boolean - true if endpoints were tested with curl"
+  }
+}
+```
+
+## Example Success Review
+
+```json
+{
+  "success": true,
+  "review_summary": "The web search tool has been implemented with proper type annotations, structured logging, and comprehensive tests. The implementation follows the vertical slice architecture and matches all spec requirements. Code quality is high with proper error handling and documentation.",
+  "review_issues": [
+    {
+      "issue_number": 1,
+      "file_path": "src/tools/web_search/tool.py",
+      "issue_description": "Missing debug log for API response",
+      "issue_resolution": "Add logger.debug with response metadata",
+      "severity": "minor"
+    }
+  ],
+  "validation_results": {
+    "linting_passed": true,
+    "type_checking_passed": true,
+    "tests_passed": true,
+    "api_endpoints_tested": true
+  }
+}
+```
--- a/python/.claude/commands/agent-work-orders/start-server.md
+++ b/python/.claude/commands/agent-work-orders/start-server.md
@@ -0,0 +1,33 @@
+# Start Servers
+
+Start both the FastAPI backend and React frontend development servers with hot reload.
+
+## Run
+
+### Run in the background with bash tool
+
+- Ensure you are in the right PWD
+- Use the Bash tool to run the servers in the background so you can read the shell outputs
+- IMPORTANT: run `git ls-files` first so you know where directories are located before you start
+
+### Backend Server (FastAPI)
+
+- Navigate to backend: `cd app/backend`
+- Start server in background: `uv sync && uv run python run_api.py`
+- Wait 2-3 seconds for startup
+- Test health endpoint: `curl http://localhost:8000/health`
+- Test products endpoint: `curl http://localhost:8000/api/products`
+
+### Frontend Server (Bun + React)
+
+- Navigate to frontend: `cd ../app/frontend`
+- Start server in background: `bun install && bun dev`
+- Wait 2-3 seconds for startup
+- Frontend should be accessible at `http://localhost:3000`
+
+## Report
+
+- Confirm backend is running on `http://localhost:8000`
+- Confirm frontend is running on `http://localhost:3000`
+- Show the health check response from backend
+- Mention: "Backend logs will show structured JSON logging for all requests"
--- a/python/.claude/commands/agent-work-orders/test.md
+++ b/python/.claude/commands/agent-work-orders/test.md
@@ -1,7 +0,0 @@
-# Test Command
-
-This is a test command for verifying the CLI integration.
-
-## Instructions
-
-Echo "Hello from agent work orders test"
--- a/python/src/agent_work_orders/api/routes.py
+++ b/python/src/agent_work_orders/api/routes.py
@@ -29,7 +29,6 @@ from ..state_manager.repository_factory import create_repository
 from ..utils.id_generator import generate_work_order_id
 from ..utils.structured_logger import get_logger
 from ..workflow_engine.workflow_orchestrator import WorkflowOrchestrator
-from ..workflow_engine.workflow_phase_tracker import WorkflowPhaseTracker

 logger = get_logger(__name__)
 router = APIRouter()
@@ -39,13 +38,11 @@ state_repository = create_repository()
 agent_executor = AgentCLIExecutor()
 sandbox_factory = SandboxFactory()
 github_client = GitHubClient()
-phase_tracker = WorkflowPhaseTracker()
 command_loader = ClaudeCommandLoader()
 orchestrator = WorkflowOrchestrator(
    agent_executor=agent_executor,
    sandbox_factory=sandbox_factory,
    github_client=github_client,
-    phase_tracker=phase_tracker,
    command_loader=command_loader,
    state_repository=state_repository,
 )
@@ -62,8 +59,8 @@ async def create_agent_work_order(
    logger.info(
        "agent_work_order_creation_started",
        repository_url=request.repository_url,
-        workflow_type=request.workflow_type.value,
        sandbox_type=request.sandbox_type.value,
+        selected_commands=request.selected_commands,
    )

    try:
@@ -81,7 +78,6 @@ async def create_agent_work_order(

        # Create metadata
        metadata = {
-            "workflow_type": request.workflow_type,
            "sandbox_type": request.sandbox_type,
            "github_issue_number": request.github_issue_number,
            "status": AgentWorkOrderStatus.PENDING,
@@ -101,10 +97,10 @@ async def create_agent_work_order(
        asyncio.create_task(
            orchestrator.execute_workflow(
                agent_work_order_id=agent_work_order_id,
-                workflow_type=request.workflow_type,
                repository_url=request.repository_url,
                sandbox_type=request.sandbox_type,
                user_request=request.user_request,
+                selected_commands=request.selected_commands,
                github_issue_number=request.github_issue_number,
            )
        )
@@ -144,7 +140,6 @@ async def get_agent_work_order(agent_work_order_id: str) -> AgentWorkOrder:
            sandbox_identifier=state.sandbox_identifier,
            git_branch_name=state.git_branch_name,
            agent_session_id=state.agent_session_id,
-            workflow_type=metadata["workflow_type"],
            sandbox_type=metadata["sandbox_type"],
            github_issue_number=metadata["github_issue_number"],
            status=metadata["status"],
@@ -194,7 +189,6 @@ async def list_agent_work_orders(
                sandbox_identifier=state.sandbox_identifier,
                git_branch_name=state.git_branch_name,
                agent_session_id=state.agent_session_id,
-                workflow_type=metadata["workflow_type"],
                sandbox_type=metadata["sandbox_type"],
                github_issue_number=metadata["github_issue_number"],
                status=metadata["status"],
--- a/python/src/agent_work_orders/config.py
+++ b/python/src/agent_work_orders/config.py
@@ -58,15 +58,6 @@ class AgentWorkOrdersConfig:
    FRONTEND_PORT_RANGE_START: int = int(os.getenv("FRONTEND_PORT_START", "9200"))
    FRONTEND_PORT_RANGE_END: int = int(os.getenv("FRONTEND_PORT_END", "9214"))

-    # Test workflow configuration
-    MAX_TEST_RETRY_ATTEMPTS: int = int(os.getenv("MAX_TEST_RETRY_ATTEMPTS", "4"))
-    ENABLE_TEST_PHASE: bool = os.getenv("ENABLE_TEST_PHASE", "true").lower() == "true"
-
-    # Review workflow configuration
-    MAX_REVIEW_RETRY_ATTEMPTS: int = int(os.getenv("MAX_REVIEW_RETRY_ATTEMPTS", "3"))
-    ENABLE_REVIEW_PHASE: bool = os.getenv("ENABLE_REVIEW_PHASE", "true").lower() == "true"
-    ENABLE_SCREENSHOT_CAPTURE: bool = os.getenv("ENABLE_SCREENSHOT_CAPTURE", "true").lower() == "true"
-
    # State management configuration
    STATE_STORAGE_TYPE: str = os.getenv("STATE_STORAGE_TYPE", "memory")  # "memory" or "file"
    FILE_STATE_DIRECTORY: str = os.getenv("FILE_STATE_DIRECTORY", "agent-work-orders-state")
--- a/python/src/agent_work_orders/models.py
+++ b/python/src/agent_work_orders/models.py
@@ -6,7 +6,7 @@ All models follow exact naming from the PRD specification.
 from datetime import datetime
 from enum import Enum

-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, field_validator


 class AgentWorkOrderStatus(str, Enum):
@@ -41,19 +41,14 @@ class AgentWorkflowPhase(str, Enum):


 class WorkflowStep(str, Enum):
-    """Individual workflow execution steps"""
+    """User-selectable workflow commands"""

-    CLASSIFY = "classify"
-    PLAN = "plan"
-    FIND_PLAN = "find_plan"
-    IMPLEMENT = "implement"
-    GENERATE_BRANCH = "generate_branch"
+    CREATE_BRANCH = "create-branch"
+    PLANNING = "planning"
+    EXECUTE = "execute"
    COMMIT = "commit"
-    TEST = "test"
-    RESOLVE_TEST = "resolve_test"
-    REVIEW = "review"
-    RESOLVE_REVIEW = "resolve_review"
-    CREATE_PR = "create_pr"
+    CREATE_PR = "create-pr"
+    REVIEW = "prp-review"


 class AgentWorkOrderState(BaseModel):
@@ -84,7 +79,6 @@ class AgentWorkOrder(BaseModel):
    agent_session_id: str | None = None

    # Metadata fields
-    workflow_type: AgentWorkflowType
    sandbox_type: SandboxType
    github_issue_number: str | None = None
    status: AgentWorkOrderStatus
@@ -109,10 +103,23 @@ class CreateAgentWorkOrderRequest(BaseModel):

    repository_url: str = Field(..., description="Git repository URL")
    sandbox_type: SandboxType = Field(..., description="Sandbox environment type")
-    workflow_type: AgentWorkflowType = Field(..., description="Workflow to execute")
    user_request: str = Field(..., description="User's description of the work to be done")
+    selected_commands: list[str] = Field(
+        default=["create-branch", "planning", "execute", "commit", "create-pr"],
+        description="Commands to run in sequence"
+    )
    github_issue_number: str | None = Field(None, description="Optional explicit GitHub issue number for reference")

+    @field_validator('selected_commands')
+    @classmethod
+    def validate_commands(cls, v: list[str]) -> list[str]:
+        """Validate that all commands are valid WorkflowStep values"""
+        valid = {step.value for step in WorkflowStep}
+        for cmd in v:
+            if cmd not in valid:
+                raise ValueError(f"Invalid command: {cmd}. Must be one of {valid}")
+        return v
+

 class AgentWorkOrderResponse(BaseModel):
    """Response after creating an agent work order"""
@@ -219,23 +226,19 @@ class StepHistory(BaseModel):
    steps: list[StepExecutionResult] = []

    def get_current_step(self) -> WorkflowStep | None:
-        """Get the current/next step to execute"""
+        """Get next step to execute"""
        if not self.steps:
-            return WorkflowStep.CLASSIFY
+            return WorkflowStep.CREATE_BRANCH

        last_step = self.steps[-1]
        if not last_step.success:
-            return last_step.step
+            return last_step.step  # Retry failed step

        step_sequence = [
-            WorkflowStep.CLASSIFY,
-            WorkflowStep.PLAN,
-            WorkflowStep.FIND_PLAN,
-            WorkflowStep.GENERATE_BRANCH,
-            WorkflowStep.IMPLEMENT,
+            WorkflowStep.CREATE_BRANCH,
+            WorkflowStep.PLANNING,
+            WorkflowStep.EXECUTE,
            WorkflowStep.COMMIT,
-            WorkflowStep.TEST,
-            WorkflowStep.REVIEW,
            WorkflowStep.CREATE_PR,
        ]

@@ -246,7 +249,7 @@ class StepHistory(BaseModel):
        except ValueError:
            pass

-        return None
+        return None  # All steps complete


 class CommandNotFoundError(Exception):
--- a/python/src/agent_work_orders/workflow_engine/agent_names.py
+++ b/python/src/agent_work_orders/workflow_engine/agent_names.py
@@ -1,30 +1,12 @@
 """Agent Name Constants

-Defines standard agent names following the workflow phases:
- Discovery: Understanding the task
- Plan: Creating implementation strategy
- Implement: Executing the plan
- Validate: Ensuring quality
+Defines standard agent names for user-selectable workflow commands.
 """

-# Discovery Phase
-CLASSIFIER = "classifier"  # Classifies issue type
-
-# Plan Phase
-PLANNER = "planner"  # Creates plans
-PLAN_FINDER = "plan_finder"  # Locates plan files
-
-# Implement Phase
-IMPLEMENTOR = "implementor"  # Implements changes
-
-# Validate Phase
-CODE_REVIEWER = "code_reviewer"  # Reviews code quality
-TESTER = "tester"  # Runs tests
-REVIEWER = "reviewer"  # Reviews against spec
-
-# Git Operations (support all phases)
-BRANCH_GENERATOR = "branch_generator"  # Creates branches
-COMMITTER = "committer"  # Creates commits
-
-# PR Operations (completion)
-PR_CREATOR = "pr_creator"  # Creates pull requests
+# Command execution agents
+BRANCH_CREATOR = "BranchCreator"
+PLANNER = "Planner"
+IMPLEMENTOR = "Implementor"
+COMMITTER = "Committer"
+PR_CREATOR = "PrCreator"
+REVIEWER = "Reviewer"
--- a/python/src/agent_work_orders/workflow_engine/review_workflow.py
+++ b/python/src/agent_work_orders/workflow_engine/review_workflow.py
@@ -1,308 +0,0 @@
-"""Review Workflow with Automatic Blocker Resolution
-
-Reviews implementation against spec and automatically resolves blocker issues with retry logic (max 3 attempts).
-"""
-
-import json
-from typing import TYPE_CHECKING
-
-from ..agent_executor.agent_cli_executor import AgentCLIExecutor
-from ..command_loader.claude_command_loader import ClaudeCommandLoader
-from ..models import StepExecutionResult, WorkflowStep
-from ..utils.structured_logger import get_logger
-from .agent_names import REVIEWER
-
-if TYPE_CHECKING:
-    import structlog
-
-logger = get_logger(__name__)
-
-
-class ReviewIssue:
-    """Represents a single review issue"""
-
-    def __init__(
-        self,
-        issue_title: str,
-        issue_description: str,
-        issue_severity: str,
-        affected_files: list[str],
-        screenshots: list[str] | None = None,
-    ):
-        self.issue_title = issue_title
-        self.issue_description = issue_description
-        self.issue_severity = issue_severity
-        self.affected_files = affected_files
-        self.screenshots = screenshots or []
-
-    def to_dict(self) -> dict:
-        """Convert to dictionary for JSON serialization"""
-        return {
-            "issue_title": self.issue_title,
-            "issue_description": self.issue_description,
-            "issue_severity": self.issue_severity,
-            "affected_files": self.affected_files,
-            "screenshots": self.screenshots,
-        }
-
-    @classmethod
-    def from_dict(cls, data: dict) -> "ReviewIssue":
-        """Create ReviewIssue from dictionary"""
-        return cls(
-            issue_title=data["issue_title"],
-            issue_description=data["issue_description"],
-            issue_severity=data["issue_severity"],
-            affected_files=data["affected_files"],
-            screenshots=data.get("screenshots", []),
-        )
-
-
-class ReviewResult:
-    """Represents review execution result"""
-
-    def __init__(
-        self,
-        review_passed: bool,
-        review_issues: list[ReviewIssue],
-        screenshots: list[str] | None = None,
-    ):
-        self.review_passed = review_passed
-        self.review_issues = review_issues
-        self.screenshots = screenshots or []
-
-    def get_blocker_count(self) -> int:
-        """Get count of blocker issues"""
-        return sum(1 for issue in self.review_issues if issue.issue_severity == "blocker")
-
-    def get_blocker_issues(self) -> list[ReviewIssue]:
-        """Get list of blocker issues"""
-        return [issue for issue in self.review_issues if issue.issue_severity == "blocker"]
-
-
-async def run_review(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    spec_file: str,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-) -> ReviewResult:
-    """Execute review against specification
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        spec_file: Path to specification file
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-
-    Returns:
-        ReviewResult with issues found
-    """
-    bound_logger.info("review_execution_started", spec_file=spec_file)
-
-    # Execute review command
-    result = await executor.execute_command(
-        command_name="review_runner",
-        arguments=[spec_file, work_order_id],
-        working_directory=working_dir,
-        logger=bound_logger,
-    )
-
-    if not result.success:
-        bound_logger.error("review_execution_failed", error=result.error_message)
-        # Return empty review result indicating failure
-        return ReviewResult(review_passed=False, review_issues=[])
-
-    # Parse review results from output
-    return parse_review_results(result.result_text or result.stdout or "", bound_logger)
-
-
-def parse_review_results(
-    output: str, logger: "structlog.stdlib.BoundLogger"
-) -> ReviewResult:
-    """Parse review results from JSON output
-
-    Args:
-        output: Command output (should be JSON object)
-        logger: Logger instance
-
-    Returns:
-        ReviewResult
-    """
-    try:
-        # Try to parse as JSON
-        data = json.loads(output)
-
-        if not isinstance(data, dict):
-            logger.error("review_results_invalid_format", error="Expected JSON object")
-            return ReviewResult(review_passed=False, review_issues=[])
-
-        review_issues = [
-            ReviewIssue.from_dict(issue) for issue in data.get("review_issues", [])
-        ]
-        review_passed = data.get("review_passed", False)
-        screenshots = data.get("screenshots", [])
-
-        blocker_count = sum(1 for issue in review_issues if issue.issue_severity == "blocker")
-
-        logger.info(
-            "review_results_parsed",
-            review_passed=review_passed,
-            total_issues=len(review_issues),
-            blockers=blocker_count,
-        )
-
-        return ReviewResult(
-            review_passed=review_passed,
-            review_issues=review_issues,
-            screenshots=screenshots,
-        )
-
-    except json.JSONDecodeError as e:
-        logger.error("review_results_parse_failed", error=str(e), output_preview=output[:500])
-        return ReviewResult(review_passed=False, review_issues=[])
-
-
-async def resolve_review_issue(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    review_issue: ReviewIssue,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-) -> StepExecutionResult:
-    """Resolve a single blocker review issue
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        review_issue: Review issue to resolve
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-
-    Returns:
-        StepExecutionResult with resolution outcome
-    """
-    bound_logger.info(
-        "review_issue_resolution_started",
-        issue_title=review_issue.issue_title,
-        severity=review_issue.issue_severity,
-    )
-
-    # Convert review issue to JSON for passing to resolve command
-    issue_json = json.dumps(review_issue.to_dict())
-
-    # Execute resolve_failed_review command
-    result = await executor.execute_command(
-        command_name="resolve_failed_review",
-        arguments=[issue_json],
-        working_directory=working_dir,
-        logger=bound_logger,
-    )
-
-    if not result.success:
-        return StepExecutionResult(
-            step=WorkflowStep.RESOLVE_REVIEW,
-            agent_name=REVIEWER,
-            success=False,
-            output=result.result_text or result.stdout,
-            error_message=f"Review issue resolution failed: {result.error_message}",
-            duration_seconds=result.duration_seconds or 0,
-            session_id=result.session_id,
-        )
-
-    return StepExecutionResult(
-        step=WorkflowStep.RESOLVE_REVIEW,
-        agent_name=REVIEWER,
-        success=True,
-        output=f"Resolved review issue: {review_issue.issue_title}",
-        error_message=None,
-        duration_seconds=result.duration_seconds or 0,
-        session_id=result.session_id,
-    )
-
-
-async def run_review_with_resolution(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    spec_file: str,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-    max_attempts: int = 3,
-) -> ReviewResult:
-    """Run review with automatic blocker resolution and retry logic
-
-    Tech debt and skippable issues are allowed to pass. Only blockers prevent completion.
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        spec_file: Path to specification file
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-        max_attempts: Maximum retry attempts (default 3)
-
-    Returns:
-        Final ReviewResult
-    """
-    bound_logger.info("review_workflow_started", max_attempts=max_attempts)
-
-    for attempt in range(1, max_attempts + 1):
-        bound_logger.info("review_attempt_started", attempt=attempt)
-
-        # Run review
-        review_result = await run_review(
-            executor, command_loader, spec_file, work_order_id, working_dir, bound_logger
-        )
-
-        blocker_count = review_result.get_blocker_count()
-
-        if blocker_count == 0:
-            # No blockers, review passes (tech_debt and skippable are acceptable)
-            bound_logger.info(
-                "review_workflow_completed",
-                attempt=attempt,
-                outcome="no_blockers",
-                total_issues=len(review_result.review_issues),
-            )
-            return review_result
-
-        if attempt >= max_attempts:
-            # Max attempts reached
-            bound_logger.warning(
-                "review_workflow_max_attempts_reached",
-                attempt=attempt,
-                blocker_count=blocker_count,
-            )
-            return review_result
-
-        # Resolve each blocker issue
-        blocker_issues = review_result.get_blocker_issues()
-        bound_logger.info(
-            "review_issue_resolution_batch_started",
-            blocker_count=len(blocker_issues),
-        )
-
-        for blocker_issue in blocker_issues:
-            resolution_result = await resolve_review_issue(
-                executor,
-                command_loader,
-                blocker_issue,
-                work_order_id,
-                working_dir,
-                bound_logger,
-            )
-
-            if not resolution_result.success:
-                bound_logger.warning(
-                    "review_issue_resolution_failed",
-                    issue_title=blocker_issue.issue_title,
-                )
-
-    # Should not reach here, but return last result if we do
-    return review_result
--- a/python/src/agent_work_orders/workflow_engine/test_workflow.py
+++ b/python/src/agent_work_orders/workflow_engine/test_workflow.py
@@ -1,311 +0,0 @@
-"""Test Workflow with Automatic Resolution
-
-Executes test suite and automatically resolves failures with retry logic (max 4 attempts).
-"""
-
-import json
-from typing import TYPE_CHECKING
-
-from ..agent_executor.agent_cli_executor import AgentCLIExecutor
-from ..command_loader.claude_command_loader import ClaudeCommandLoader
-from ..models import StepExecutionResult, WorkflowStep
-from ..utils.structured_logger import get_logger
-from .agent_names import TESTER
-
-if TYPE_CHECKING:
-    import structlog
-
-logger = get_logger(__name__)
-
-
-class TestResult:
-    """Represents a single test result"""
-
-    def __init__(
-        self,
-        test_name: str,
-        passed: bool,
-        execution_command: str,
-        test_purpose: str,
-        error: str | None = None,
-    ):
-        self.test_name = test_name
-        self.passed = passed
-        self.execution_command = execution_command
-        self.test_purpose = test_purpose
-        self.error = error
-
-    def to_dict(self) -> dict:
-        """Convert to dictionary for JSON serialization"""
-        return {
-            "test_name": self.test_name,
-            "passed": self.passed,
-            "execution_command": self.execution_command,
-            "test_purpose": self.test_purpose,
-            "error": self.error,
-        }
-
-    @classmethod
-    def from_dict(cls, data: dict) -> "TestResult":
-        """Create TestResult from dictionary"""
-        return cls(
-            test_name=data["test_name"],
-            passed=data["passed"],
-            execution_command=data["execution_command"],
-            test_purpose=data["test_purpose"],
-            error=data.get("error"),
-        )
-
-
-async def run_tests(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-) -> StepExecutionResult:
-    """Execute test suite and return results
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-
-    Returns:
-        StepExecutionResult with test results
-    """
-    bound_logger.info("test_execution_started")
-
-    # Execute test command
-    result = await executor.execute_command(
-        command_name="test",
-        arguments=[],
-        working_directory=working_dir,
-        logger=bound_logger,
-    )
-
-    if not result.success:
-        return StepExecutionResult(
-            step=WorkflowStep.TEST,
-            agent_name=TESTER,
-            success=False,
-            output=result.result_text or result.stdout,
-            error_message=f"Test execution failed: {result.error_message}",
-            duration_seconds=result.duration_seconds or 0,
-            session_id=result.session_id,
-        )
-
-    # Parse test results from output
-    test_results, passed_count, failed_count = parse_test_results(
-        result.result_text or result.stdout or "", bound_logger
-    )
-
-    success = failed_count == 0
-    output_summary = f"Tests: {passed_count} passed, {failed_count} failed"
-
-    return StepExecutionResult(
-        step=WorkflowStep.TEST,
-        agent_name=TESTER,
-        success=success,
-        output=output_summary,
-        error_message=None if success else f"{failed_count} test(s) failed",
-        duration_seconds=result.duration_seconds or 0,
-        session_id=result.session_id,
-    )
-
-
-def parse_test_results(
-    output: str, logger: "structlog.stdlib.BoundLogger"
-) -> tuple[list[TestResult], int, int]:
-    """Parse test results from JSON output
-
-    Args:
-        output: Command output (should be JSON array)
-        logger: Logger instance
-
-    Returns:
-        Tuple of (test_results, passed_count, failed_count)
-    """
-    try:
-        # Try to parse as JSON
-        data = json.loads(output)
-
-        if not isinstance(data, list):
-            logger.error("test_results_invalid_format", error="Expected JSON array")
-            return [], 0, 0
-
-        test_results = [TestResult.from_dict(item) for item in data]
-        passed_count = sum(1 for t in test_results if t.passed)
-        failed_count = sum(1 for t in test_results if not t.passed)
-
-        logger.info(
-            "test_results_parsed",
-            passed=passed_count,
-            failed=failed_count,
-            total=len(test_results),
-        )
-
-        return test_results, passed_count, failed_count
-
-    except json.JSONDecodeError as e:
-        logger.error("test_results_parse_failed", error=str(e), output_preview=output[:500])
-        return [], 0, 0
-
-
-async def resolve_failed_test(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    test_result: TestResult,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-) -> StepExecutionResult:
-    """Resolve a single failed test
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        test_result: Failed test result
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-
-    Returns:
-        StepExecutionResult with resolution outcome
-    """
-    bound_logger.info(
-        "test_resolution_started",
-        test_name=test_result.test_name,
-    )
-
-    # Convert test result to JSON for passing to resolve command
-    test_json = json.dumps(test_result.to_dict())
-
-    # Execute resolve_failed_test command
-    result = await executor.execute_command(
-        command_name="resolve_failed_test",
-        arguments=[test_json],
-        working_directory=working_dir,
-        logger=bound_logger,
-    )
-
-    if not result.success:
-        return StepExecutionResult(
-            step=WorkflowStep.RESOLVE_TEST,
-            agent_name=TESTER,
-            success=False,
-            output=result.result_text or result.stdout,
-            error_message=f"Test resolution failed: {result.error_message}",
-            duration_seconds=result.duration_seconds or 0,
-            session_id=result.session_id,
-        )
-
-    return StepExecutionResult(
-        step=WorkflowStep.RESOLVE_TEST,
-        agent_name=TESTER,
-        success=True,
-        output=f"Resolved test: {test_result.test_name}",
-        error_message=None,
-        duration_seconds=result.duration_seconds or 0,
-        session_id=result.session_id,
-    )
-
-
-async def run_tests_with_resolution(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    work_order_id: str,
-    working_dir: str,
-    bound_logger: "structlog.stdlib.BoundLogger",
-    max_attempts: int = 4,
-) -> tuple[list[TestResult], int, int]:
-    """Run tests with automatic failure resolution and retry logic
-
-    Args:
-        executor: Agent CLI executor
-        command_loader: Command loader
-        work_order_id: Work order ID
-        working_dir: Working directory
-        bound_logger: Logger instance
-        max_attempts: Maximum retry attempts (default 4)
-
-    Returns:
-        Tuple of (final_test_results, passed_count, failed_count)
-    """
-    bound_logger.info("test_workflow_started", max_attempts=max_attempts)
-
-    for attempt in range(1, max_attempts + 1):
-        bound_logger.info("test_attempt_started", attempt=attempt)
-
-        # Run tests
-        test_result = await run_tests(
-            executor, command_loader, work_order_id, working_dir, bound_logger
-        )
-
-        if test_result.success:
-            bound_logger.info("test_workflow_completed", attempt=attempt, outcome="all_passed")
-            # Parse final results
-            # Re-run to get the actual test results
-            final_result = await executor.execute_command(
-                command_name="test",
-                arguments=[],
-                working_directory=working_dir,
-                logger=bound_logger,
-            )
-            final_results, passed, failed = parse_test_results(
-                final_result.result_text or final_result.stdout or "", bound_logger
-            )
-            return final_results, passed, failed
-
-        # Parse failures
-        test_execution = await executor.execute_command(
-            command_name="test",
-            arguments=[],
-            working_directory=working_dir,
-            logger=bound_logger,
-        )
-        test_results, passed_count, failed_count = parse_test_results(
-            test_execution.result_text or test_execution.stdout or "", bound_logger
-        )
-
-        if failed_count == 0:
-            # No failures, we're done
-            bound_logger.info("test_workflow_completed", attempt=attempt, outcome="all_passed")
-            return test_results, passed_count, failed_count
-
-        if attempt >= max_attempts:
-            # Max attempts reached
-            bound_logger.warning(
-                "test_workflow_max_attempts_reached",
-                attempt=attempt,
-                failed_count=failed_count,
-            )
-            return test_results, passed_count, failed_count
-
-        # Resolve each failed test
-        failed_tests = [t for t in test_results if not t.passed]
-        bound_logger.info(
-            "test_resolution_batch_started",
-            failed_count=len(failed_tests),
-        )
-
-        for failed_test in failed_tests:
-            resolution_result = await resolve_failed_test(
-                executor,
-                command_loader,
-                failed_test,
-                work_order_id,
-                working_dir,
-                bound_logger,
-            )
-
-            if not resolution_result.success:
-                bound_logger.warning(
-                    "test_resolution_failed",
-                    test_name=failed_test.test_name,
-                )
-
-    # Should not reach here, but return last results if we do
-    return test_results, passed_count, failed_count
--- a/python/src/agent_work_orders/workflow_engine/workflow_operations.py
+++ b/python/src/agent_work_orders/workflow_engine/workflow_operations.py
@@ -1,7 +1,7 @@
 """Workflow Operations

-Atomic operations for workflow execution.
-Each function executes one discrete agent operation.
+Command execution functions for user-selectable workflow.
+Each function loads and executes a command file.
 """

 import time
@@ -11,134 +11,144 @@ from ..command_loader.claude_command_loader import ClaudeCommandLoader
 from ..models import StepExecutionResult, WorkflowStep
 from ..utils.structured_logger import get_logger
 from .agent_names import (
-    BRANCH_GENERATOR,
-    CLASSIFIER,
+    BRANCH_CREATOR,
    COMMITTER,
    IMPLEMENTOR,
-    PLAN_FINDER,
    PLANNER,
    PR_CREATOR,
    REVIEWER,
-    TESTER,
 )

 logger = get_logger(__name__)


-async def classify_issue(
+async def run_create_branch_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
-    issue_json: str,
    work_order_id: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Classify issue type using classifier agent
+    """Execute create-branch.md command

-    Returns: StepExecutionResult with issue_class in output (/bug, /feature, /chore)
+    Creates git branch based on user request.
+
+    Args:
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context with user_request
+
+    Returns:
+        StepExecutionResult with branch_name in output
    """
    start_time = time.time()

    try:
-        command_file = command_loader.load_command("classifier")
+        command_file = command_loader.load_command("create-branch")

-        cli_command, prompt_text = executor.build_command(command_file, args=[issue_json])
+        # Get user request from context
+        user_request = context.get("user_request", "")
+
+        cli_command, prompt_text = executor.build_command(
+            command_file, args=[user_request]
+        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time

        if result.success and result.result_text:
-            issue_class = result.result_text.strip()
-
+            branch_name = result.result_text.strip()
            return StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name=CLASSIFIER,
+                step=WorkflowStep.CREATE_BRANCH,
+                agent_name=BRANCH_CREATOR,
                success=True,
-                output=issue_class,
+                output=branch_name,
                duration_seconds=duration,
                session_id=result.session_id,
            )
        else:
            return StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name=CLASSIFIER,
+                step=WorkflowStep.CREATE_BRANCH,
+                agent_name=BRANCH_CREATOR,
                success=False,
-                error_message=result.error_message or "Classification failed",
+                error_message=result.error_message or "Branch creation failed",
                duration_seconds=duration,
            )

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("classify_issue_error", error=str(e), exc_info=True)
+        logger.error("create_branch_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
-            step=WorkflowStep.CLASSIFY,
-            agent_name=CLASSIFIER,
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name=BRANCH_CREATOR,
            success=False,
            error_message=str(e),
            duration_seconds=duration,
        )


-async def build_plan(
+async def run_planning_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
-    issue_class: str,
-    issue_number: str,
    work_order_id: str,
-    issue_json: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Build implementation plan based on issue classification
+    """Execute planning.md command

-    Returns: StepExecutionResult with plan output
+    Creates PRP file based on user request.
+
+    Args:
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context with user_request and optional github_issue_number
+
+    Returns:
+        StepExecutionResult with plan_file path in output
    """
    start_time = time.time()

    try:
-        # Map issue class to planner command
-        planner_map = {
-            "/bug": "planner_bug",
-            "/feature": "planner_feature",
-            "/chore": "planner_chore",
-        }
+        command_file = command_loader.load_command("planning")

-        planner_command = planner_map.get(issue_class)
-        if not planner_command:
-            return StepExecutionResult(
-                step=WorkflowStep.PLAN,
-                agent_name=PLANNER,
-                success=False,
-                error_message=f"Unknown issue class: {issue_class}",
-                duration_seconds=time.time() - start_time,
-            )
+        # Get args from context
+        user_request = context.get("user_request", "")
+        github_issue_number = context.get("github_issue_number") or ""

-        command_file = command_loader.load_command(planner_command)
-
-        # Pass issue_number, work_order_id, issue_json as arguments
        cli_command, prompt_text = executor.build_command(
-            command_file, args=[issue_number, work_order_id, issue_json]
+            command_file, args=[user_request, github_issue_number]
        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time

-        if result.success:
+        if result.success and result.result_text:
+            plan_file = result.result_text.strip()
            return StepExecutionResult(
-                step=WorkflowStep.PLAN,
+                step=WorkflowStep.PLANNING,
                agent_name=PLANNER,
                success=True,
-                output=result.result_text or result.stdout or "",
+                output=plan_file,
                duration_seconds=duration,
                session_id=result.session_id,
            )
        else:
            return StepExecutionResult(
-                step=WorkflowStep.PLAN,
+                step=WorkflowStep.PLANNING,
                agent_name=PLANNER,
                success=False,
                error_message=result.error_message or "Planning failed",
@@ -147,9 +157,9 @@ async def build_plan(

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("build_plan_error", error=str(e), exc_info=True)
+        logger.error("planning_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
-            step=WorkflowStep.PLAN,
+            step=WorkflowStep.PLANNING,
            agent_name=PLANNER,
            success=False,
            error_message=str(e),
@@ -157,100 +167,62 @@ async def build_plan(
        )


-async def find_plan_file(
+async def run_execute_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
-    issue_number: str,
    work_order_id: str,
-    previous_output: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Find plan file created by planner
+    """Execute execute.md command

-    Returns: StepExecutionResult with plan file path in output
+    Implements the PRP plan.
+
+    Args:
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context with plan_file from planning step
+
+    Returns:
+        StepExecutionResult with implementation summary in output
    """
    start_time = time.time()

    try:
-        command_file = command_loader.load_command("plan_finder")
+        command_file = command_loader.load_command("execute")
+
+        # Get plan file from context (output of planning step)
+        plan_file = context.get("planning", "")
+        if not plan_file:
+            raise ValueError("No plan file found in context. Planning step must run before execute.")

        cli_command, prompt_text = executor.build_command(
-            command_file, args=[issue_number, work_order_id, previous_output]
+            command_file, args=[plan_file]
        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
-        )
-
-        duration = time.time() - start_time
-
-        if result.success and result.result_text and result.result_text.strip() != "0":
-            plan_file_path = result.result_text.strip()
-            return StepExecutionResult(
-                step=WorkflowStep.FIND_PLAN,
-                agent_name=PLAN_FINDER,
-                success=True,
-                output=plan_file_path,
-                duration_seconds=duration,
-                session_id=result.session_id,
-            )
-        else:
-            return StepExecutionResult(
-                step=WorkflowStep.FIND_PLAN,
-                agent_name=PLAN_FINDER,
-                success=False,
-                error_message="Plan file not found",
-                duration_seconds=duration,
-            )
-
-    except Exception as e:
-        duration = time.time() - start_time
-        logger.error("find_plan_file_error", error=str(e), exc_info=True)
-        return StepExecutionResult(
-            step=WorkflowStep.FIND_PLAN,
-            agent_name=PLAN_FINDER,
-            success=False,
-            error_message=str(e),
-            duration_seconds=duration,
-        )
-
-
-async def implement_plan(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    plan_file: str,
-    work_order_id: str,
-    working_dir: str,
-) -> StepExecutionResult:
-    """Implement the plan
-
-    Returns: StepExecutionResult with implementation output
-    """
-    start_time = time.time()
-
-    try:
-        command_file = command_loader.load_command("implementor")
-
-        cli_command, prompt_text = executor.build_command(command_file, args=[plan_file])
-
-        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time

        if result.success:
+            implementation_summary = result.result_text or result.stdout or "Implementation completed"
            return StepExecutionResult(
-                step=WorkflowStep.IMPLEMENT,
+                step=WorkflowStep.EXECUTE,
                agent_name=IMPLEMENTOR,
                success=True,
-                output=result.result_text or result.stdout or "",
+                output=implementation_summary,
                duration_seconds=duration,
                session_id=result.session_id,
            )
        else:
            return StepExecutionResult(
-                step=WorkflowStep.IMPLEMENT,
+                step=WorkflowStep.EXECUTE,
                agent_name=IMPLEMENTOR,
                success=False,
                error_message=result.error_message or "Implementation failed",
@@ -259,9 +231,9 @@ async def implement_plan(

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("implement_plan_error", error=str(e), exc_info=True)
+        logger.error("execute_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
-            step=WorkflowStep.IMPLEMENT,
+            step=WorkflowStep.EXECUTE,
            agent_name=IMPLEMENTOR,
            success=False,
            error_message=str(e),
@@ -269,100 +241,52 @@ async def implement_plan(
        )


-async def generate_branch(
+async def run_commit_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
-    issue_class: str,
-    issue_number: str,
    work_order_id: str,
-    issue_json: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Generate and create git branch
+    """Execute commit.md command

-    Returns: StepExecutionResult with branch name in output
+    Commits changes and pushes to remote.
+
+    Args:
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context (no specific args needed)
+
+    Returns:
+        StepExecutionResult with commit_hash and branch_name in output
    """
    start_time = time.time()

    try:
-        command_file = command_loader.load_command("branch_generator")
+        command_file = command_loader.load_command("commit")

+        # Commit command doesn't need args (commits all changes)
        cli_command, prompt_text = executor.build_command(
-            command_file, args=[issue_class, issue_number, work_order_id, issue_json]
+            command_file, args=[]
        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time

        if result.success and result.result_text:
-            branch_name = result.result_text.strip()
-            return StepExecutionResult(
-                step=WorkflowStep.GENERATE_BRANCH,
-                agent_name=BRANCH_GENERATOR,
-                success=True,
-                output=branch_name,
-                duration_seconds=duration,
-                session_id=result.session_id,
-            )
-        else:
-            return StepExecutionResult(
-                step=WorkflowStep.GENERATE_BRANCH,
-                agent_name=BRANCH_GENERATOR,
-                success=False,
-                error_message=result.error_message or "Branch generation failed",
-                duration_seconds=duration,
-            )
-
-    except Exception as e:
-        duration = time.time() - start_time
-        logger.error("generate_branch_error", error=str(e), exc_info=True)
-        return StepExecutionResult(
-            step=WorkflowStep.GENERATE_BRANCH,
-            agent_name=BRANCH_GENERATOR,
-            success=False,
-            error_message=str(e),
-            duration_seconds=duration,
-        )
-
-
-async def create_commit(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    agent_name: str,
-    issue_class: str,
-    issue_json: str,
-    work_order_id: str,
-    working_dir: str,
-) -> StepExecutionResult:
-    """Create git commit
-
-    Returns: StepExecutionResult with commit message in output
-    """
-    start_time = time.time()
-
-    try:
-        command_file = command_loader.load_command("committer")
-
-        cli_command, prompt_text = executor.build_command(
-            command_file, args=[agent_name, issue_class, issue_json]
-        )
-
-        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
-        )
-
-        duration = time.time() - start_time
-
-        if result.success and result.result_text:
-            commit_message = result.result_text.strip()
+            commit_info = result.result_text.strip()
            return StepExecutionResult(
                step=WorkflowStep.COMMIT,
                agent_name=COMMITTER,
                success=True,
-                output=commit_message,
+                output=commit_info,
                duration_seconds=duration,
                session_id=result.session_id,
            )
@@ -371,13 +295,13 @@ async def create_commit(
                step=WorkflowStep.COMMIT,
                agent_name=COMMITTER,
                success=False,
-                error_message=result.error_message or "Commit creation failed",
+                error_message=result.error_message or "Commit failed",
                duration_seconds=duration,
            )

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("create_commit_error", error=str(e), exc_info=True)
+        logger.error("commit_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
            step=WorkflowStep.COMMIT,
            agent_name=COMMITTER,
@@ -387,30 +311,47 @@ async def create_commit(
        )


-async def create_pull_request(
+async def run_create_pr_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
-    branch_name: str,
-    issue_json: str,
-    plan_file: str,
    work_order_id: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Create GitHub pull request
+    """Execute create-pr.md command

-    Returns: StepExecutionResult with PR URL in output
+    Creates GitHub pull request.
+
+    Args:
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context with branch_name and optional plan_file
+
+    Returns:
+        StepExecutionResult with pr_url in output
    """
    start_time = time.time()

    try:
-        command_file = command_loader.load_command("pr_creator")
+        command_file = command_loader.load_command("create-pr")
+
+        # Get args from context
+        branch_name = context.get("create-branch", "")
+        plan_file = context.get("planning", "")
+
+        if not branch_name:
+            raise ValueError("No branch name found in context. create-branch step must run before create-pr.")

        cli_command, prompt_text = executor.build_command(
-            command_file, args=[branch_name, issue_json, plan_file, work_order_id]
+            command_file, args=[branch_name, plan_file]
        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time
@@ -436,7 +377,7 @@ async def create_pull_request(

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("create_pull_request_error", error=str(e), exc_info=True)
+        logger.error("create_pr_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
            step=WorkflowStep.CREATE_PR,
            agent_name=PR_CREATOR,
@@ -446,149 +387,56 @@ async def create_pull_request(
        )


-async def run_tests(
+async def run_review_step(
    executor: AgentCLIExecutor,
    command_loader: ClaudeCommandLoader,
    work_order_id: str,
    working_dir: str,
+    context: dict,
 ) -> StepExecutionResult:
-    """Execute test suite
+    """Execute prp-review.md command

-    Returns: StepExecutionResult with test results summary
-    """
-    start_time = time.time()
-
-    try:
-        command_file = command_loader.load_command("test")
-
-        cli_command, prompt_text = executor.build_command(command_file, args=[])
-
-        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
-        )
-
-        duration = time.time() - start_time
-
-        if result.success:
-            return StepExecutionResult(
-                step=WorkflowStep.TEST,
-                agent_name=TESTER,
-                success=True,
-                output=result.result_text or "Tests passed",
-                duration_seconds=duration,
-                session_id=result.session_id,
-            )
-        else:
-            return StepExecutionResult(
-                step=WorkflowStep.TEST,
-                agent_name=TESTER,
-                success=False,
-                error_message=result.error_message or "Tests failed",
-                output=result.result_text,
-                duration_seconds=duration,
-            )
-
-    except Exception as e:
-        duration = time.time() - start_time
-        logger.error("run_tests_error", error=str(e), exc_info=True)
-        return StepExecutionResult(
-            step=WorkflowStep.TEST,
-            agent_name=TESTER,
-            success=False,
-            error_message=str(e),
-            duration_seconds=duration,
-        )
-
-
-async def resolve_test_failure(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    test_failure_json: str,
-    work_order_id: str,
-    working_dir: str,
-) -> StepExecutionResult:
-    """Resolve a failed test
+    Reviews implementation against PRP specification.

    Args:
-        test_failure_json: JSON string with test failure details
+        executor: CLI executor for running claude commands
+        command_loader: Loads command files
+        work_order_id: Work order ID for logging
+        working_dir: Directory to run command in
+        context: Shared context with plan_file from planning step

-    Returns: StepExecutionResult with resolution outcome
+    Returns:
+        StepExecutionResult with review JSON in output
    """
    start_time = time.time()

    try:
-        command_file = command_loader.load_command("resolve_failed_test")
+        command_file = command_loader.load_command("prp-review")

-        cli_command, prompt_text = executor.build_command(command_file, args=[test_failure_json])
-
-        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
-        )
-
-        duration = time.time() - start_time
-
-        if result.success:
-            return StepExecutionResult(
-                step=WorkflowStep.RESOLVE_TEST,
-                agent_name=TESTER,
-                success=True,
-                output=result.result_text or "Test failure resolved",
-                duration_seconds=duration,
-                session_id=result.session_id,
-            )
-        else:
-            return StepExecutionResult(
-                step=WorkflowStep.RESOLVE_TEST,
-                agent_name=TESTER,
-                success=False,
-                error_message=result.error_message or "Resolution failed",
-                duration_seconds=duration,
-            )
-
-    except Exception as e:
-        duration = time.time() - start_time
-        logger.error("resolve_test_failure_error", error=str(e), exc_info=True)
-        return StepExecutionResult(
-            step=WorkflowStep.RESOLVE_TEST,
-            agent_name=TESTER,
-            success=False,
-            error_message=str(e),
-            duration_seconds=duration,
-        )
-
-
-async def run_review(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    spec_file: str,
-    work_order_id: str,
-    working_dir: str,
-) -> StepExecutionResult:
-    """Execute review against specification
-
-    Returns: StepExecutionResult with review results
-    """
-    start_time = time.time()
-
-    try:
-        command_file = command_loader.load_command("review_runner")
+        # Get plan file from context
+        plan_file = context.get("planning", "")
+        if not plan_file:
+            raise ValueError("No plan file found in context. Planning step must run before review.")

        cli_command, prompt_text = executor.build_command(
-            command_file, args=[spec_file, work_order_id]
+            command_file, args=[plan_file]
        )

        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
+            cli_command, working_dir,
+            prompt_text=prompt_text,
+            work_order_id=work_order_id
        )

        duration = time.time() - start_time

        if result.success:
+            review_output = result.result_text or "Review completed"
            return StepExecutionResult(
                step=WorkflowStep.REVIEW,
                agent_name=REVIEWER,
                success=True,
-                output=result.result_text or "Review completed",
+                output=review_output,
                duration_seconds=duration,
                session_id=result.session_id,
            )
@@ -603,7 +451,7 @@ async def run_review(

    except Exception as e:
        duration = time.time() - start_time
-        logger.error("run_review_error", error=str(e), exc_info=True)
+        logger.error("review_step_error", error=str(e), exc_info=True)
        return StepExecutionResult(
            step=WorkflowStep.REVIEW,
            agent_name=REVIEWER,
@@ -611,60 +459,3 @@ async def run_review(
            error_message=str(e),
            duration_seconds=duration,
        )
-
-
-async def resolve_review_issue(
-    executor: AgentCLIExecutor,
-    command_loader: ClaudeCommandLoader,
-    review_issue_json: str,
-    work_order_id: str,
-    working_dir: str,
-) -> StepExecutionResult:
-    """Resolve a review blocker issue
-
-    Args:
-        review_issue_json: JSON string with review issue details
-
-    Returns: StepExecutionResult with resolution outcome
-    """
-    start_time = time.time()
-
-    try:
-        command_file = command_loader.load_command("resolve_failed_review")
-
-        cli_command, prompt_text = executor.build_command(command_file, args=[review_issue_json])
-
-        result = await executor.execute_async(
-            cli_command, working_dir, prompt_text=prompt_text, work_order_id=work_order_id
-        )
-
-        duration = time.time() - start_time
-
-        if result.success:
-            return StepExecutionResult(
-                step=WorkflowStep.RESOLVE_REVIEW,
-                agent_name=REVIEWER,
-                success=True,
-                output=result.result_text or "Review issue resolved",
-                duration_seconds=duration,
-                session_id=result.session_id,
-            )
-        else:
-            return StepExecutionResult(
-                step=WorkflowStep.RESOLVE_REVIEW,
-                agent_name=REVIEWER,
-                success=False,
-                error_message=result.error_message or "Resolution failed",
-                duration_seconds=duration,
-            )
-
-    except Exception as e:
-        duration = time.time() - start_time
-        logger.error("resolve_review_issue_error", error=str(e), exc_info=True)
-        return StepExecutionResult(
-            step=WorkflowStep.RESOLVE_REVIEW,
-            agent_name=REVIEWER,
-            success=False,
-            error_message=str(e),
-            duration_seconds=duration,
-        )
--- a/python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py
+++ b/python/src/agent_work_orders/workflow_engine/workflow_orchestrator.py
@@ -3,26 +3,21 @@
 Main orchestration logic for workflow execution.
 """

-import json
-import re
-
 from ..agent_executor.agent_cli_executor import AgentCLIExecutor
 from ..command_loader.claude_command_loader import ClaudeCommandLoader
 from ..github_integration.github_client import GitHubClient
 from ..models import (
-    AgentWorkflowType,
    AgentWorkOrderStatus,
    SandboxType,
    StepHistory,
    WorkflowExecutionError,
 )
 from ..sandbox_manager.sandbox_factory import SandboxFactory
+from ..state_manager.file_state_repository import FileStateRepository
 from ..state_manager.work_order_repository import WorkOrderRepository
 from ..utils.id_generator import generate_sandbox_identifier
 from ..utils.structured_logger import get_logger
 from . import workflow_operations
-from .agent_names import IMPLEMENTOR
-from .workflow_phase_tracker import WorkflowPhaseTracker

 logger = get_logger(__name__)

@@ -35,14 +30,12 @@ class WorkflowOrchestrator:
        agent_executor: AgentCLIExecutor,
        sandbox_factory: SandboxFactory,
        github_client: GitHubClient,
-        phase_tracker: WorkflowPhaseTracker,
        command_loader: ClaudeCommandLoader,
-        state_repository: WorkOrderRepository,
+        state_repository: WorkOrderRepository | FileStateRepository,
    ):
        self.agent_executor = agent_executor
        self.sandbox_factory = sandbox_factory
        self.github_client = github_client
-        self.phase_tracker = phase_tracker
        self.command_loader = command_loader
        self.state_repository = state_repository
        self._logger = logger
@@ -50,36 +43,42 @@ class WorkflowOrchestrator:
    async def execute_workflow(
        self,
        agent_work_order_id: str,
-        workflow_type: AgentWorkflowType,
        repository_url: str,
        sandbox_type: SandboxType,
        user_request: str,
+        selected_commands: list[str] | None = None,
        github_issue_number: str | None = None,
-        github_issue_json: str | None = None,
    ) -> None:
-        """Execute workflow as sequence of atomic operations
+        """Execute user-selected commands in sequence

        This runs in the background and updates state as it progresses.

        Args:
            agent_work_order_id: Work order ID
-            workflow_type: Workflow to execute
            repository_url: Git repository URL
            sandbox_type: Sandbox environment type
            user_request: User's description of the work to be done
+            selected_commands: Commands to run in sequence (default: full workflow)
            github_issue_number: Optional GitHub issue number
-            github_issue_json: Optional GitHub issue JSON
        """
+        # Default commands if not provided
+        if selected_commands is None:
+            selected_commands = ["create-branch", "planning", "execute", "commit", "create-pr"]
+
        bound_logger = self._logger.bind(
            agent_work_order_id=agent_work_order_id,
-            workflow_type=workflow_type.value,
            sandbox_type=sandbox_type.value,
+            selected_commands=selected_commands,
        )

        bound_logger.info("agent_work_order_started")

-        # Initialize step history
+        # Initialize step history and context
        step_history = StepHistory(agent_work_order_id=agent_work_order_id)
+        context = {
+            "user_request": user_request,
+            "github_issue_number": github_issue_number,
+        }

        sandbox = None

@@ -97,246 +96,80 @@ class WorkflowOrchestrator:
            await sandbox.setup()
            bound_logger.info("sandbox_created", sandbox_identifier=sandbox_identifier)

-            # Parse GitHub issue from user request if mentioned
-            issue_match = re.search(r'(?:issue|#)\s*#?(\d+)', user_request, re.IGNORECASE)
-            if issue_match and not github_issue_number:
-                github_issue_number = issue_match.group(1)
-                bound_logger.info("github_issue_detected_in_request", issue_number=github_issue_number)
+            # Command mapping
+            command_map = {
+                "create-branch": workflow_operations.run_create_branch_step,
+                "planning": workflow_operations.run_planning_step,
+                "execute": workflow_operations.run_execute_step,
+                "commit": workflow_operations.run_commit_step,
+                "create-pr": workflow_operations.run_create_pr_step,
+                "prp-review": workflow_operations.run_review_step,
+            }

-            # Fetch GitHub issue if number provided
-            if github_issue_number and not github_issue_json:
-                try:
-                    issue_data = await self.github_client.get_issue(repository_url, github_issue_number)
-                    github_issue_json = json.dumps(issue_data)
-                    bound_logger.info("github_issue_fetched", issue_number=github_issue_number)
-                except Exception as e:
-                    bound_logger.warning("github_issue_fetch_failed", error=str(e))
-                    # Continue without issue data - use user_request only
+            # Execute each command in sequence
+            for command_name in selected_commands:
+                if command_name not in command_map:
+                    raise WorkflowExecutionError(f"Unknown command: {command_name}")

-            # Prepare classification input: merge user request with issue data if available
-            classification_input = user_request
-            if github_issue_json:
-                issue_data = json.loads(github_issue_json)
-                classification_input = f"User Request: {user_request}\n\nGitHub Issue Details:\nTitle: {issue_data.get('title', '')}\nBody: {issue_data.get('body', '')}"
+                bound_logger.info("command_execution_started", command=command_name)

-            # Step 1: Classify issue
-            classify_result = await workflow_operations.classify_issue(
-                self.agent_executor,
-                self.command_loader,
-                classification_input,
-                agent_work_order_id,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(classify_result)
+                command_func = command_map[command_name]
+
+                # Execute command
+                result = await command_func(
+                    executor=self.agent_executor,
+                    command_loader=self.command_loader,
+                    work_order_id=agent_work_order_id,
+                    working_dir=sandbox.working_dir,
+                    context=context,
+                )
+
+                # Save step result
+                step_history.steps.append(result)
+                await self.state_repository.save_step_history(
+                    agent_work_order_id, step_history
+                )
+
+                # Log completion
+                bound_logger.info(
+                    "command_execution_completed",
+                    command=command_name,
+                    success=result.success,
+                    duration=result.duration_seconds,
+                )
+
+                # STOP on failure
+                if not result.success:
+                    await self.state_repository.update_status(
+                        agent_work_order_id,
+                        AgentWorkOrderStatus.FAILED,
+                        error_message=result.error_message,
+                    )
+                    raise WorkflowExecutionError(
+                        f"Command '{command_name}' failed: {result.error_message}"
+                    )
+
+                # Store output in context for next command
+                context[command_name] = result.output
+
+                # Special handling for specific commands
+                if command_name == "create-branch":
+                    await self.state_repository.update_git_branch(
+                        agent_work_order_id, result.output or ""
+                    )
+                elif command_name == "create-pr":
+                    await self.state_repository.update_status(
+                        agent_work_order_id,
+                        AgentWorkOrderStatus.COMPLETED,
+                        github_pull_request_url=result.output,
+                    )
+                    # Save final step history
+                    await self.state_repository.save_step_history(agent_work_order_id, step_history)
+                    bound_logger.info("agent_work_order_completed", total_steps=len(step_history.steps))
+                    return  # Exit early if PR created
+
+            # Save final step history
            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not classify_result.success:
-                raise WorkflowExecutionError(
-                    f"Classification failed: {classify_result.error_message}"
-                )
-
-            issue_class = classify_result.output
-            bound_logger.info("step_completed", step="classify", issue_class=issue_class)
-
-            # Step 2: Build plan
-            plan_result = await workflow_operations.build_plan(
-                self.agent_executor,
-                self.command_loader,
-                issue_class or "",
-                github_issue_number or "",
-                agent_work_order_id,
-                classification_input,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(plan_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not plan_result.success:
-                raise WorkflowExecutionError(f"Planning failed: {plan_result.error_message}")
-
-            bound_logger.info("step_completed", step="plan")
-
-            # Step 3: Find plan file
-            plan_finder_result = await workflow_operations.find_plan_file(
-                self.agent_executor,
-                self.command_loader,
-                github_issue_number or "",
-                agent_work_order_id,
-                plan_result.output or "",
-                sandbox.working_dir,
-            )
-            step_history.steps.append(plan_finder_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not plan_finder_result.success:
-                raise WorkflowExecutionError(
-                    f"Plan file not found: {plan_finder_result.error_message}"
-                )
-
-            plan_file = plan_finder_result.output
-            bound_logger.info("step_completed", step="find_plan", plan_file=plan_file)
-
-            # Step 4: Generate branch
-            branch_result = await workflow_operations.generate_branch(
-                self.agent_executor,
-                self.command_loader,
-                issue_class or "",
-                github_issue_number or "",
-                agent_work_order_id,
-                classification_input,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(branch_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not branch_result.success:
-                raise WorkflowExecutionError(
-                    f"Branch creation failed: {branch_result.error_message}"
-                )
-
-            git_branch_name = branch_result.output
-            await self.state_repository.update_git_branch(agent_work_order_id, git_branch_name or "")
-            bound_logger.info("step_completed", step="branch", branch_name=git_branch_name)
-
-            # Step 5: Implement plan
-            implement_result = await workflow_operations.implement_plan(
-                self.agent_executor,
-                self.command_loader,
-                plan_file or "",
-                agent_work_order_id,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(implement_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not implement_result.success:
-                raise WorkflowExecutionError(
-                    f"Implementation failed: {implement_result.error_message}"
-                )
-
-            bound_logger.info("step_completed", step="implement")
-
-            # Step 6: Commit changes
-            commit_result = await workflow_operations.create_commit(
-                self.agent_executor,
-                self.command_loader,
-                IMPLEMENTOR,
-                issue_class or "",
-                classification_input,
-                agent_work_order_id,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(commit_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if not commit_result.success:
-                raise WorkflowExecutionError(f"Commit failed: {commit_result.error_message}")
-
-            bound_logger.info("step_completed", step="commit")
-
-            # Step 7: Run tests (if enabled)
-            from ..config import config
-            if config.ENABLE_TEST_PHASE:
-                from .test_workflow import run_tests_with_resolution
-
-                bound_logger.info("test_phase_started")
-                test_results, passed_count, failed_count = await run_tests_with_resolution(
-                    self.agent_executor,
-                    self.command_loader,
-                    agent_work_order_id,
-                    sandbox.working_dir,
-                    bound_logger,
-                    max_attempts=config.MAX_TEST_RETRY_ATTEMPTS,
-                )
-
-                # Record test execution in step history
-                test_summary = f"Tests: {passed_count} passed, {failed_count} failed"
-                from ..models import StepExecutionResult
-                test_step = StepExecutionResult(
-                    step=WorkflowStep.TEST,
-                    agent_name="Tester",
-                    success=(failed_count == 0),
-                    output=test_summary,
-                    error_message=f"{failed_count} test(s) failed" if failed_count > 0 else None,
-                    duration_seconds=0,
-                )
-                step_history.steps.append(test_step)
-                await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-                if failed_count > 0:
-                    bound_logger.warning("test_phase_completed_with_failures", failed_count=failed_count)
-                else:
-                    bound_logger.info("test_phase_completed", passed_count=passed_count)
-
-            # Step 8: Run review (if enabled)
-            if config.ENABLE_REVIEW_PHASE:
-                from .review_workflow import run_review_with_resolution
-
-                # Determine spec file path from plan_file or default
-                spec_file = plan_file if plan_file else f"PRPs/specs/{issue_class}-spec.md"
-
-                bound_logger.info("review_phase_started", spec_file=spec_file)
-                review_result = await run_review_with_resolution(
-                    self.agent_executor,
-                    self.command_loader,
-                    spec_file,
-                    agent_work_order_id,
-                    sandbox.working_dir,
-                    bound_logger,
-                    max_attempts=config.MAX_REVIEW_RETRY_ATTEMPTS,
-                )
-
-                # Record review execution in step history
-                blocker_count = review_result.get_blocker_count()
-                review_summary = f"Review: {len(review_result.review_issues)} issues found, {blocker_count} blockers"
-                review_step = StepExecutionResult(
-                    step=WorkflowStep.REVIEW,
-                    agent_name="Reviewer",
-                    success=(blocker_count == 0),
-                    output=review_summary,
-                    error_message=f"{blocker_count} blocker(s) remaining" if blocker_count > 0 else None,
-                    duration_seconds=0,
-                )
-                step_history.steps.append(review_step)
-                await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-                if blocker_count > 0:
-                    bound_logger.warning("review_phase_completed_with_blockers", blocker_count=blocker_count)
-                else:
-                    bound_logger.info("review_phase_completed", issue_count=len(review_result.review_issues))
-
-            # Step 9: Create PR
-            pr_result = await workflow_operations.create_pull_request(
-                self.agent_executor,
-                self.command_loader,
-                git_branch_name or "",
-                classification_input,
-                plan_file or "",
-                agent_work_order_id,
-                sandbox.working_dir,
-            )
-            step_history.steps.append(pr_result)
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
-            if pr_result.success:
-                pr_url = pr_result.output
-                await self.state_repository.update_status(
-                    agent_work_order_id,
-                    AgentWorkOrderStatus.COMPLETED,
-                    github_pull_request_url=pr_url,
-                )
-                bound_logger.info("step_completed", step="create_pr", pr_url=pr_url)
-            else:
-                # PR creation failed but workflow succeeded
-                await self.state_repository.update_status(
-                    agent_work_order_id,
-                    AgentWorkOrderStatus.COMPLETED,
-                    error_message=f"PR creation failed: {pr_result.error_message}",
-                )
-
-            # Save step history to state
-            await self.state_repository.save_step_history(agent_work_order_id, step_history)
-
            bound_logger.info("agent_work_order_completed", total_steps=len(step_history.steps))

        except Exception as e:
--- a/python/src/agent_work_orders/workflow_engine/workflow_phase_tracker.py
+++ b/python/src/agent_work_orders/workflow_engine/workflow_phase_tracker.py
@@ -1,137 +0,0 @@
-"""Workflow Phase Tracker
-
-Tracks workflow phases by inspecting git commits.
-"""
-
-from pathlib import Path
-
-from ..models import AgentWorkflowPhase, GitProgressSnapshot
-from ..utils import git_operations
-from ..utils.structured_logger import get_logger
-
-logger = get_logger(__name__)
-
-
-class WorkflowPhaseTracker:
-    """Tracks workflow execution phases via git inspection"""
-
-    def __init__(self):
-        self._logger = logger
-
-    async def get_current_phase(
-        self, git_branch_name: str, repo_path: str | Path
-    ) -> AgentWorkflowPhase:
-        """Determine current phase by inspecting git commits
-
-        Args:
-            git_branch_name: Git branch name
-            repo_path: Path to git repository
-
-        Returns:
-            Current workflow phase
-        """
-        self._logger.info(
-            "workflow_phase_detection_started",
-            git_branch_name=git_branch_name,
-        )
-
-        try:
-            commits = await git_operations.get_commit_count(git_branch_name, repo_path)
-            has_planning = await git_operations.has_planning_commits(
-                git_branch_name, repo_path
-            )
-
-            if has_planning and commits > 0:
-                phase = AgentWorkflowPhase.COMPLETED
-            else:
-                phase = AgentWorkflowPhase.PLANNING
-
-            self._logger.info(
-                "workflow_phase_detected",
-                git_branch_name=git_branch_name,
-                phase=phase.value,
-                commits=commits,
-                has_planning=has_planning,
-            )
-
-            return phase
-
-        except Exception as e:
-            self._logger.error(
-                "workflow_phase_detection_failed",
-                git_branch_name=git_branch_name,
-                error=str(e),
-                exc_info=True,
-            )
-            # Default to PLANNING if detection fails
-            return AgentWorkflowPhase.PLANNING
-
-    async def get_git_progress_snapshot(
-        self,
-        agent_work_order_id: str,
-        git_branch_name: str,
-        repo_path: str | Path,
-    ) -> GitProgressSnapshot:
-        """Get git progress for UI display
-
-        Args:
-            agent_work_order_id: Work order ID
-            git_branch_name: Git branch name
-            repo_path: Path to git repository
-
-        Returns:
-            GitProgressSnapshot with current progress
-        """
-        self._logger.info(
-            "git_progress_snapshot_started",
-            agent_work_order_id=agent_work_order_id,
-            git_branch_name=git_branch_name,
-        )
-
-        try:
-            current_phase = await self.get_current_phase(git_branch_name, repo_path)
-            commit_count = await git_operations.get_commit_count(
-                git_branch_name, repo_path
-            )
-            files_changed = await git_operations.get_files_changed(
-                git_branch_name, repo_path
-            )
-            latest_commit = await git_operations.get_latest_commit_message(
-                git_branch_name, repo_path
-            )
-
-            snapshot = GitProgressSnapshot(
-                agent_work_order_id=agent_work_order_id,
-                current_phase=current_phase,
-                git_commit_count=commit_count,
-                git_files_changed=files_changed,
-                latest_commit_message=latest_commit,
-                git_branch_name=git_branch_name,
-            )
-
-            self._logger.info(
-                "git_progress_snapshot_completed",
-                agent_work_order_id=agent_work_order_id,
-                phase=current_phase.value,
-                commits=commit_count,
-                files=files_changed,
-            )
-
-            return snapshot
-
-        except Exception as e:
-            self._logger.error(
-                "git_progress_snapshot_failed",
-                agent_work_order_id=agent_work_order_id,
-                error=str(e),
-                exc_info=True,
-            )
-            # Return minimal snapshot on error
-            return GitProgressSnapshot(
-                agent_work_order_id=agent_work_order_id,
-                current_phase=AgentWorkflowPhase.PLANNING,
-                git_commit_count=0,
-                git_files_changed=0,
-                latest_commit_message=None,
-                git_branch_name=git_branch_name,
-            )
--- a/python/tests/agent_work_orders/test_models.py
+++ b/python/tests/agent_work_orders/test_models.py
@@ -72,7 +72,6 @@ def test_agent_work_order_creation():
        sandbox_identifier="sandbox-wo-test123",
        git_branch_name="feat-wo-test123",
        agent_session_id="session-123",
-        workflow_type=AgentWorkflowType.PLAN,
        sandbox_type=SandboxType.GIT_BRANCH,
        github_issue_number="42",
        status=AgentWorkOrderStatus.RUNNING,
@@ -86,7 +85,7 @@ def test_agent_work_order_creation():
    )

    assert work_order.agent_work_order_id == "wo-test123"
-    assert work_order.workflow_type == AgentWorkflowType.PLAN
+    assert work_order.sandbox_type == SandboxType.GIT_BRANCH
    assert work_order.status == AgentWorkOrderStatus.RUNNING
    assert work_order.current_phase == AgentWorkflowPhase.PLANNING

@@ -96,16 +95,15 @@ def test_create_agent_work_order_request():
    request = CreateAgentWorkOrderRequest(
        repository_url="https://github.com/owner/repo",
        sandbox_type=SandboxType.GIT_BRANCH,
-        workflow_type=AgentWorkflowType.PLAN,
        user_request="Add user authentication feature",
        github_issue_number="42",
    )

    assert request.repository_url == "https://github.com/owner/repo"
    assert request.sandbox_type == SandboxType.GIT_BRANCH
-    assert request.workflow_type == AgentWorkflowType.PLAN
    assert request.user_request == "Add user authentication feature"
    assert request.github_issue_number == "42"
+    assert request.selected_commands == ["create-branch", "planning", "execute", "commit", "create-pr"]


 def test_create_agent_work_order_request_optional_fields():
@@ -113,12 +111,12 @@ def test_create_agent_work_order_request_optional_fields():
    request = CreateAgentWorkOrderRequest(
        repository_url="https://github.com/owner/repo",
        sandbox_type=SandboxType.GIT_BRANCH,
-        workflow_type=AgentWorkflowType.PLAN,
        user_request="Fix the login bug",
    )

    assert request.user_request == "Fix the login bug"
    assert request.github_issue_number is None
+    assert request.selected_commands == ["create-branch", "planning", "execute", "commit", "create-pr"]


 def test_create_agent_work_order_request_with_user_request():
@@ -126,13 +124,13 @@ def test_create_agent_work_order_request_with_user_request():
    request = CreateAgentWorkOrderRequest(
        repository_url="https://github.com/owner/repo",
        sandbox_type=SandboxType.GIT_BRANCH,
-        workflow_type=AgentWorkflowType.PLAN,
        user_request="Add user authentication with JWT tokens",
    )

    assert request.user_request == "Add user authentication with JWT tokens"
    assert request.repository_url == "https://github.com/owner/repo"
    assert request.github_issue_number is None
+    assert request.selected_commands == ["create-branch", "planning", "execute", "commit", "create-pr"]


 def test_create_agent_work_order_request_with_github_issue():
@@ -140,43 +138,40 @@ def test_create_agent_work_order_request_with_github_issue():
    request = CreateAgentWorkOrderRequest(
        repository_url="https://github.com/owner/repo",
        sandbox_type=SandboxType.GIT_BRANCH,
-        workflow_type=AgentWorkflowType.PLAN,
        user_request="Implement the feature described in issue #42",
        github_issue_number="42",
    )

    assert request.user_request == "Implement the feature described in issue #42"
    assert request.github_issue_number == "42"
+    assert request.selected_commands == ["create-branch", "planning", "execute", "commit", "create-pr"]


 def test_workflow_step_enum():
    """Test WorkflowStep enum values"""
-    assert WorkflowStep.CLASSIFY.value == "classify"
-    assert WorkflowStep.PLAN.value == "plan"
-    assert WorkflowStep.FIND_PLAN.value == "find_plan"
-    assert WorkflowStep.IMPLEMENT.value == "implement"
-    assert WorkflowStep.GENERATE_BRANCH.value == "generate_branch"
+    assert WorkflowStep.CREATE_BRANCH.value == "create-branch"
+    assert WorkflowStep.PLANNING.value == "planning"
+    assert WorkflowStep.EXECUTE.value == "execute"
    assert WorkflowStep.COMMIT.value == "commit"
-    assert WorkflowStep.REVIEW.value == "review"
-    assert WorkflowStep.TEST.value == "test"
-    assert WorkflowStep.CREATE_PR.value == "create_pr"
+    assert WorkflowStep.CREATE_PR.value == "create-pr"
+    assert WorkflowStep.REVIEW.value == "prp-review"


 def test_step_execution_result_success():
    """Test creating successful StepExecutionResult"""
    result = StepExecutionResult(
-        step=WorkflowStep.CLASSIFY,
-        agent_name="classifier",
+        step=WorkflowStep.CREATE_BRANCH,
+        agent_name="BranchCreator",
        success=True,
-        output="/feature",
+        output="feat/add-feature",
        duration_seconds=1.5,
        session_id="session-123",
    )

-    assert result.step == WorkflowStep.CLASSIFY
-    assert result.agent_name == "classifier"
+    assert result.step == WorkflowStep.CREATE_BRANCH
+    assert result.agent_name == "BranchCreator"
    assert result.success is True
-    assert result.output == "/feature"
+    assert result.output == "feat/add-feature"
    assert result.error_message is None
    assert result.duration_seconds == 1.5
    assert result.session_id == "session-123"
@@ -186,15 +181,15 @@ def test_step_execution_result_success():
 def test_step_execution_result_failure():
    """Test creating failed StepExecutionResult"""
    result = StepExecutionResult(
-        step=WorkflowStep.PLAN,
-        agent_name="planner",
+        step=WorkflowStep.PLANNING,
+        agent_name="Planner",
        success=False,
        error_message="Planning failed: timeout",
        duration_seconds=30.0,
    )

-    assert result.step == WorkflowStep.PLAN
-    assert result.agent_name == "planner"
+    assert result.step == WorkflowStep.PLANNING
+    assert result.agent_name == "Planner"
    assert result.success is False
    assert result.output is None
    assert result.error_message == "Planning failed: timeout"
@@ -213,18 +208,18 @@ def test_step_history_creation():
 def test_step_history_with_steps():
    """Test StepHistory with multiple steps"""
    step1 = StepExecutionResult(
-        step=WorkflowStep.CLASSIFY,
-        agent_name="classifier",
+        step=WorkflowStep.CREATE_BRANCH,
+        agent_name="BranchCreator",
        success=True,
-        output="/feature",
+        output="feat/add-feature",
        duration_seconds=1.0,
    )

    step2 = StepExecutionResult(
-        step=WorkflowStep.PLAN,
-        agent_name="planner",
+        step=WorkflowStep.PLANNING,
+        agent_name="Planner",
        success=True,
-        output="Plan created",
+        output="PRPs/features/add-feature.md",
        duration_seconds=5.0,
    )

@@ -232,22 +227,22 @@ def test_step_history_with_steps():

    assert history.agent_work_order_id == "wo-test123"
    assert len(history.steps) == 2
-    assert history.steps[0].step == WorkflowStep.CLASSIFY
-    assert history.steps[1].step == WorkflowStep.PLAN
+    assert history.steps[0].step == WorkflowStep.CREATE_BRANCH
+    assert history.steps[1].step == WorkflowStep.PLANNING


 def test_step_history_get_current_step_initial():
-    """Test get_current_step returns CLASSIFY when no steps"""
+    """Test get_current_step returns CREATE_BRANCH when no steps"""
    history = StepHistory(agent_work_order_id="wo-test123", steps=[])

-    assert history.get_current_step() == WorkflowStep.CLASSIFY
+    assert history.get_current_step() == WorkflowStep.CREATE_BRANCH


 def test_step_history_get_current_step_retry_failed():
    """Test get_current_step returns same step when failed"""
    failed_step = StepExecutionResult(
-        step=WorkflowStep.PLAN,
-        agent_name="planner",
+        step=WorkflowStep.PLANNING,
+        agent_name="Planner",
        success=False,
        error_message="Planning failed",
        duration_seconds=5.0,
@@ -255,22 +250,22 @@ def test_step_history_get_current_step_retry_failed():

    history = StepHistory(agent_work_order_id="wo-test123", steps=[failed_step])

-    assert history.get_current_step() == WorkflowStep.PLAN
+    assert history.get_current_step() == WorkflowStep.PLANNING


 def test_step_history_get_current_step_next():
    """Test get_current_step returns next step after success"""
-    classify_step = StepExecutionResult(
-        step=WorkflowStep.CLASSIFY,
-        agent_name="classifier",
+    branch_step = StepExecutionResult(
+        step=WorkflowStep.CREATE_BRANCH,
+        agent_name="BranchCreator",
        success=True,
-        output="/feature",
+        output="feat/add-feature",
        duration_seconds=1.0,
    )

-    history = StepHistory(agent_work_order_id="wo-test123", steps=[classify_step])
+    history = StepHistory(agent_work_order_id="wo-test123", steps=[branch_step])

-    assert history.get_current_step() == WorkflowStep.PLAN
+    assert history.get_current_step() == WorkflowStep.PLANNING


 def test_command_execution_result_with_result_text():
--- a/python/tests/agent_work_orders/test_workflow_engine.py
+++ b/python/tests/agent_work_orders/test_workflow_engine.py
@@ -1,614 +0,0 @@
-"""Tests for Workflow Engine"""
-
-import pytest
-from pathlib import Path
-from tempfile import TemporaryDirectory
-from unittest.mock import AsyncMock, MagicMock, patch
-
-from src.agent_work_orders.models import (
-    AgentWorkOrderStatus,
-    AgentWorkflowPhase,
-    AgentWorkflowType,
-    SandboxType,
-    WorkflowExecutionError,
-)
-from src.agent_work_orders.workflow_engine.workflow_phase_tracker import (
-    WorkflowPhaseTracker,
-)
-from src.agent_work_orders.workflow_engine.workflow_orchestrator import (
-    WorkflowOrchestrator,
-)
-
-
-@pytest.mark.asyncio
-async def test_phase_tracker_planning_phase():
-    """Test detecting planning phase"""
-    tracker = WorkflowPhaseTracker()
-
-    with TemporaryDirectory() as tmpdir:
-        with patch(
-            "src.agent_work_orders.utils.git_operations.get_commit_count",
-            return_value=0,
-        ):
-            with patch(
-                "src.agent_work_orders.utils.git_operations.has_planning_commits",
-                return_value=False,
-            ):
-                phase = await tracker.get_current_phase("feat-wo-test", tmpdir)
-
-    assert phase == AgentWorkflowPhase.PLANNING
-
-
-@pytest.mark.asyncio
-async def test_phase_tracker_completed_phase():
-    """Test detecting completed phase"""
-    tracker = WorkflowPhaseTracker()
-
-    with TemporaryDirectory() as tmpdir:
-        with patch(
-            "src.agent_work_orders.utils.git_operations.get_commit_count",
-            return_value=3,
-        ):
-            with patch(
-                "src.agent_work_orders.utils.git_operations.has_planning_commits",
-                return_value=True,
-            ):
-                phase = await tracker.get_current_phase("feat-wo-test", tmpdir)
-
-    assert phase == AgentWorkflowPhase.COMPLETED
-
-
-@pytest.mark.asyncio
-async def test_phase_tracker_git_progress_snapshot():
-    """Test creating git progress snapshot"""
-    tracker = WorkflowPhaseTracker()
-
-    with TemporaryDirectory() as tmpdir:
-        with patch(
-            "src.agent_work_orders.utils.git_operations.get_commit_count",
-            return_value=5,
-        ):
-            with patch(
-                "src.agent_work_orders.utils.git_operations.get_files_changed",
-                return_value=10,
-            ):
-                with patch(
-                    "src.agent_work_orders.utils.git_operations.get_latest_commit_message",
-                    return_value="plan: Create implementation plan",
-                ):
-                    with patch(
-                        "src.agent_work_orders.utils.git_operations.has_planning_commits",
-                        return_value=True,
-                    ):
-                        snapshot = await tracker.get_git_progress_snapshot(
-                            "wo-test123", "feat-wo-test", tmpdir
-                        )
-
-    assert snapshot.agent_work_order_id == "wo-test123"
-    assert snapshot.current_phase == AgentWorkflowPhase.COMPLETED
-    assert snapshot.git_commit_count == 5
-    assert snapshot.git_files_changed == 10
-    assert snapshot.latest_commit_message == "plan: Create implementation plan"
-
-
-@pytest.mark.asyncio
-async def test_workflow_orchestrator_success():
-    """Test successful workflow execution with atomic operations"""
-    from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
-
-    # Create mocks for dependencies
-    mock_agent_executor = MagicMock()
-    mock_sandbox_factory = MagicMock()
-    mock_sandbox = MagicMock()
-    mock_sandbox.setup = AsyncMock()
-    mock_sandbox.cleanup = AsyncMock()
-    mock_sandbox.working_dir = "/tmp/sandbox"
-    mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
-
-    mock_github_client = MagicMock()
-    mock_phase_tracker = MagicMock()
-    mock_command_loader = MagicMock()
-
-    mock_state_repository = MagicMock()
-    mock_state_repository.update_status = AsyncMock()
-    mock_state_repository.update_git_branch = AsyncMock()
-    mock_state_repository.save_step_history = AsyncMock()
-
-    # Mock workflow operations to return successful results
-    with patch("src.agent_work_orders.workflow_engine.workflow_orchestrator.workflow_operations") as mock_ops:
-        mock_ops.classify_issue = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name="classifier",
-                success=True,
-                output="/feature",
-                duration_seconds=1.0,
-            )
-        )
-        mock_ops.build_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.PLAN,
-                agent_name="planner",
-                success=True,
-                output="Plan created",
-                duration_seconds=5.0,
-            )
-        )
-        mock_ops.find_plan_file = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.FIND_PLAN,
-                agent_name="plan_finder",
-                success=True,
-                output="specs/issue-42-wo-test123-planner-feature.md",
-                duration_seconds=1.0,
-            )
-        )
-        mock_ops.generate_branch = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.GENERATE_BRANCH,
-                agent_name="branch_generator",
-                success=True,
-                output="feat-issue-42-wo-test123",
-                duration_seconds=2.0,
-            )
-        )
-        mock_ops.implement_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.IMPLEMENT,
-                agent_name="implementor",
-                success=True,
-                output="Implementation completed",
-                duration_seconds=10.0,
-            )
-        )
-        mock_ops.create_commit = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.COMMIT,
-                agent_name="committer",
-                success=True,
-                output="implementor: feat: add feature",
-                duration_seconds=1.0,
-            )
-        )
-        mock_ops.create_pull_request = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CREATE_PR,
-                agent_name="pr_creator",
-                success=True,
-                output="https://github.com/owner/repo/pull/42",
-                duration_seconds=2.0,
-            )
-        )
-
-        orchestrator = WorkflowOrchestrator(
-            agent_executor=mock_agent_executor,
-            sandbox_factory=mock_sandbox_factory,
-            github_client=mock_github_client,
-            phase_tracker=mock_phase_tracker,
-            command_loader=mock_command_loader,
-            state_repository=mock_state_repository,
-        )
-
-        # Execute workflow
-        await orchestrator.execute_workflow(
-            agent_work_order_id="wo-test123",
-            workflow_type=AgentWorkflowType.PLAN,
-            repository_url="https://github.com/owner/repo",
-            sandbox_type=SandboxType.GIT_BRANCH,
-            user_request="Add new user authentication feature",
-            github_issue_number="42",
-            github_issue_json='{"title": "Add feature"}',
-        )
-
-        # Verify all workflow operations were called
-        mock_ops.classify_issue.assert_called_once()
-        mock_ops.build_plan.assert_called_once()
-        mock_ops.find_plan_file.assert_called_once()
-        mock_ops.generate_branch.assert_called_once()
-        mock_ops.implement_plan.assert_called_once()
-        mock_ops.create_commit.assert_called_once()
-        mock_ops.create_pull_request.assert_called_once()
-
-        # Verify sandbox operations
-        mock_sandbox_factory.create_sandbox.assert_called_once()
-        mock_sandbox.setup.assert_called_once()
-        mock_sandbox.cleanup.assert_called_once()
-
-        # Verify state updates
-        assert mock_state_repository.update_status.call_count >= 2
-        mock_state_repository.update_git_branch.assert_called_once_with(
-            "wo-test123", "feat-issue-42-wo-test123"
-        )
-        # Verify step history was saved incrementally (7 steps + 1 final save = 8 total)
-        assert mock_state_repository.save_step_history.call_count == 8
-
-
-@pytest.mark.asyncio
-async def test_workflow_orchestrator_agent_failure():
-    """Test workflow execution with step failure"""
-    from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
-
-    # Create mocks for dependencies
-    mock_agent_executor = MagicMock()
-    mock_sandbox_factory = MagicMock()
-    mock_sandbox = MagicMock()
-    mock_sandbox.setup = AsyncMock()
-    mock_sandbox.cleanup = AsyncMock()
-    mock_sandbox.working_dir = "/tmp/sandbox"
-    mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
-
-    mock_github_client = MagicMock()
-    mock_phase_tracker = MagicMock()
-    mock_command_loader = MagicMock()
-
-    mock_state_repository = MagicMock()
-    mock_state_repository.update_status = AsyncMock()
-    mock_state_repository.save_step_history = AsyncMock()
-
-    # Mock workflow operations - classification fails
-    with patch("src.agent_work_orders.workflow_engine.workflow_orchestrator.workflow_operations") as mock_ops:
-        mock_ops.classify_issue = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name="classifier",
-                success=False,
-                error_message="Classification failed",
-                duration_seconds=1.0,
-            )
-        )
-
-        orchestrator = WorkflowOrchestrator(
-            agent_executor=mock_agent_executor,
-            sandbox_factory=mock_sandbox_factory,
-            github_client=mock_github_client,
-            phase_tracker=mock_phase_tracker,
-            command_loader=mock_command_loader,
-            state_repository=mock_state_repository,
-        )
-
-        # Execute workflow
-        await orchestrator.execute_workflow(
-            agent_work_order_id="wo-test123",
-            workflow_type=AgentWorkflowType.PLAN,
-            repository_url="https://github.com/owner/repo",
-            sandbox_type=SandboxType.GIT_BRANCH,
-            user_request="Fix the critical bug in login system",
-            github_issue_json='{"title": "Test"}',
-        )
-
-        # Verify classification was attempted
-        mock_ops.classify_issue.assert_called_once()
-
-        # Verify cleanup happened
-        mock_sandbox.cleanup.assert_called_once()
-
-        # Verify step history was saved even on failure (incremental + error handler = 2 times)
-        assert mock_state_repository.save_step_history.call_count == 2
-
-        # Check that status was updated to FAILED
-        calls = [call for call in mock_state_repository.update_status.call_args_list]
-        assert any(
-            call[0][1] == AgentWorkOrderStatus.FAILED or call.kwargs.get("status") == AgentWorkOrderStatus.FAILED
-            for call in calls
-        )
-
-
-@pytest.mark.asyncio
-async def test_workflow_orchestrator_pr_creation_failure():
-    """Test workflow execution with PR creation failure"""
-    from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
-
-    # Create mocks for dependencies
-    mock_agent_executor = MagicMock()
-    mock_sandbox_factory = MagicMock()
-    mock_sandbox = MagicMock()
-    mock_sandbox.setup = AsyncMock()
-    mock_sandbox.cleanup = AsyncMock()
-    mock_sandbox.working_dir = "/tmp/sandbox"
-    mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
-
-    mock_github_client = MagicMock()
-    mock_phase_tracker = MagicMock()
-    mock_command_loader = MagicMock()
-
-    mock_state_repository = MagicMock()
-    mock_state_repository.update_status = AsyncMock()
-    mock_state_repository.update_git_branch = AsyncMock()
-    mock_state_repository.save_step_history = AsyncMock()
-
-    # Mock workflow operations - all succeed except PR creation
-    with patch("src.agent_work_orders.workflow_engine.workflow_orchestrator.workflow_operations") as mock_ops:
-        mock_ops.classify_issue = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name="classifier",
-                success=True,
-                output="/feature",
-                duration_seconds=1.0,
-            )
-        )
-        mock_ops.build_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.PLAN,
-                agent_name="planner",
-                success=True,
-                output="Plan created",
-                duration_seconds=5.0,
-            )
-        )
-        mock_ops.find_plan_file = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.FIND_PLAN,
-                agent_name="plan_finder",
-                success=True,
-                output="specs/plan.md",
-                duration_seconds=1.0,
-            )
-        )
-        mock_ops.generate_branch = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.GENERATE_BRANCH,
-                agent_name="branch_generator",
-                success=True,
-                output="feat-issue-42",
-                duration_seconds=2.0,
-            )
-        )
-        mock_ops.implement_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.IMPLEMENT,
-                agent_name="implementor",
-                success=True,
-                output="Implementation completed",
-                duration_seconds=10.0,
-            )
-        )
-        mock_ops.create_commit = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.COMMIT,
-                agent_name="committer",
-                success=True,
-                output="implementor: feat: add feature",
-                duration_seconds=1.0,
-            )
-        )
-        # PR creation fails
-        mock_ops.create_pull_request = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CREATE_PR,
-                agent_name="pr_creator",
-                success=False,
-                error_message="GitHub API error",
-                duration_seconds=2.0,
-            )
-        )
-
-        orchestrator = WorkflowOrchestrator(
-            agent_executor=mock_agent_executor,
-            sandbox_factory=mock_sandbox_factory,
-            github_client=mock_github_client,
-            phase_tracker=mock_phase_tracker,
-            command_loader=mock_command_loader,
-            state_repository=mock_state_repository,
-        )
-
-        # Execute workflow
-        await orchestrator.execute_workflow(
-            agent_work_order_id="wo-test123",
-            workflow_type=AgentWorkflowType.PLAN,
-            repository_url="https://github.com/owner/repo",
-            sandbox_type=SandboxType.GIT_BRANCH,
-            user_request="Implement feature from issue 42",
-            github_issue_number="42",
-            github_issue_json='{"title": "Add feature"}',
-        )
-
-        # Verify PR creation was attempted
-        mock_ops.create_pull_request.assert_called_once()
-
-        # Verify workflow still marked as completed (PR failure is not critical)
-        calls = [call for call in mock_state_repository.update_status.call_args_list]
-        assert any(
-            call[0][1] == AgentWorkOrderStatus.COMPLETED or call.kwargs.get("status") == AgentWorkOrderStatus.COMPLETED
-            for call in calls
-        )
-
-        # Verify step history was saved incrementally (7 steps + 1 final save = 8 total)
-        assert mock_state_repository.save_step_history.call_count == 8
-
-
-@pytest.mark.asyncio
-async def test_orchestrator_saves_step_history_incrementally():
-    """Test that step history is saved after each step, not just at the end"""
-    from src.agent_work_orders.models import (
-        CommandExecutionResult,
-        StepExecutionResult,
-        WorkflowStep,
-    )
-    from src.agent_work_orders.workflow_engine.agent_names import CLASSIFIER
-
-    # Create mocks
-    mock_executor = MagicMock()
-    mock_sandbox_factory = MagicMock()
-    mock_github_client = MagicMock()
-    mock_phase_tracker = MagicMock()
-    mock_command_loader = MagicMock()
-    mock_state_repository = MagicMock()
-
-    # Track save_step_history calls
-    save_calls = []
-    async def track_save(wo_id, history):
-        save_calls.append(len(history.steps))
-
-    mock_state_repository.save_step_history = AsyncMock(side_effect=track_save)
-    mock_state_repository.update_status = AsyncMock()
-    mock_state_repository.update_git_branch = AsyncMock()
-
-    # Mock sandbox
-    mock_sandbox = MagicMock()
-    mock_sandbox.working_dir = "/tmp/test"
-    mock_sandbox.setup = AsyncMock()
-    mock_sandbox.cleanup = AsyncMock()
-    mock_sandbox_factory.create_sandbox = MagicMock(return_value=mock_sandbox)
-
-    # Mock GitHub client
-    mock_github_client.get_issue = AsyncMock(return_value={
-        "title": "Test Issue",
-        "body": "Test body"
-    })
-
-    # Create orchestrator
-    orchestrator = WorkflowOrchestrator(
-        agent_executor=mock_executor,
-        sandbox_factory=mock_sandbox_factory,
-        github_client=mock_github_client,
-        phase_tracker=mock_phase_tracker,
-        command_loader=mock_command_loader,
-        state_repository=mock_state_repository,
-    )
-
-    # Mock workflow operations to return success for all steps
-    with patch("src.agent_work_orders.workflow_engine.workflow_orchestrator.workflow_operations") as mock_ops:
-        # Mock successful results for each step
-        mock_ops.classify_issue = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CLASSIFY,
-                agent_name=CLASSIFIER,
-                success=True,
-                output="/feature",
-                duration_seconds=1.0,
-            )
-        )
-
-        mock_ops.build_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.PLAN,
-                agent_name="planner",
-                success=True,
-                output="Plan created",
-                duration_seconds=2.0,
-            )
-        )
-
-        mock_ops.find_plan_file = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.FIND_PLAN,
-                agent_name="plan_finder",
-                success=True,
-                output="specs/plan.md",
-                duration_seconds=0.5,
-            )
-        )
-
-        mock_ops.generate_branch = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.GENERATE_BRANCH,
-                agent_name="branch_generator",
-                success=True,
-                output="feat-issue-1-wo-test",
-                duration_seconds=1.0,
-            )
-        )
-
-        mock_ops.implement_plan = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.IMPLEMENT,
-                agent_name="implementor",
-                success=True,
-                output="Implementation complete",
-                duration_seconds=5.0,
-            )
-        )
-
-        mock_ops.create_commit = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.COMMIT,
-                agent_name="committer",
-                success=True,
-                output="Commit created",
-                duration_seconds=1.0,
-            )
-        )
-
-        mock_ops.create_pull_request = AsyncMock(
-            return_value=StepExecutionResult(
-                step=WorkflowStep.CREATE_PR,
-                agent_name="pr_creator",
-                success=True,
-                output="https://github.com/owner/repo/pull/1",
-                duration_seconds=1.0,
-            )
-        )
-
-        # Execute workflow
-        await orchestrator.execute_workflow(
-            agent_work_order_id="wo-test",
-            workflow_type=AgentWorkflowType.PLAN,
-            repository_url="https://github.com/owner/repo",
-            sandbox_type=SandboxType.GIT_BRANCH,
-            user_request="Test feature request",
-        )
-
-    # Verify save_step_history was called after EACH step (7 times) + final save (8 total)
-    # OR at minimum, verify it was called MORE than just once at the end
-    assert len(save_calls) >= 7, f"Expected at least 7 incremental saves, got {len(save_calls)}"
-
-    # Verify the progression: 1 step, 2 steps, 3 steps, etc.
-    assert save_calls[0] == 1, "First save should have 1 step"
-    assert save_calls[1] == 2, "Second save should have 2 steps"
-    assert save_calls[2] == 3, "Third save should have 3 steps"
-    assert save_calls[3] == 4, "Fourth save should have 4 steps"
-    assert save_calls[4] == 5, "Fifth save should have 5 steps"
-    assert save_calls[5] == 6, "Sixth save should have 6 steps"
-    assert save_calls[6] == 7, "Seventh save should have 7 steps"
-
-
-@pytest.mark.asyncio
-async def test_step_history_visible_during_execution():
-    """Test that step history can be retrieved during workflow execution"""
-    from src.agent_work_orders.models import StepHistory
-
-    # Create real state repository (in-memory)
-    from src.agent_work_orders.state_manager.work_order_repository import WorkOrderRepository
-    state_repo = WorkOrderRepository()
-
-    # Create empty step history
-    step_history = StepHistory(agent_work_order_id="wo-test")
-
-    # Simulate incremental saves during workflow
-    from src.agent_work_orders.models import StepExecutionResult, WorkflowStep
-
-    # Step 1: Classify
-    step_history.steps.append(StepExecutionResult(
-        step=WorkflowStep.CLASSIFY,
-        agent_name="classifier",
-        success=True,
-        output="/feature",
-        duration_seconds=1.0,
-    ))
-    await state_repo.save_step_history("wo-test", step_history)
-
-    # Retrieve and verify
-    retrieved = await state_repo.get_step_history("wo-test")
-    assert retrieved is not None
-    assert len(retrieved.steps) == 1
-    assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
-
-    # Step 2: Plan
-    step_history.steps.append(StepExecutionResult(
-        step=WorkflowStep.PLAN,
-        agent_name="planner",
-        success=True,
-        output="Plan created",
-        duration_seconds=2.0,
-    ))
-    await state_repo.save_step_history("wo-test", step_history)
-
-    # Retrieve and verify progression
-    retrieved = await state_repo.get_step_history("wo-test")
-    assert len(retrieved.steps) == 2
-    assert retrieved.steps[1].step == WorkflowStep.PLAN
-
-    # Verify both steps are present
-    assert retrieved.steps[0].step == WorkflowStep.CLASSIFY
-    assert retrieved.steps[1].step == WorkflowStep.PLAN
--- a/python/tests/agent_work_orders/test_workflow_operations.py
+++ b/python/tests/agent_work_orders/test_workflow_operations.py
@@ -1,4 +1,4 @@
-"""Tests for Workflow Operations"""
+"""Tests for Workflow Operations - Refactored Command Stitching Architecture"""

 import pytest
 from unittest.mock import AsyncMock, MagicMock, patch
@@ -9,398 +9,385 @@ from src.agent_work_orders.models import (
 )
 from src.agent_work_orders.workflow_engine import workflow_operations
 from src.agent_work_orders.workflow_engine.agent_names import (
-    BRANCH_GENERATOR,
-    CLASSIFIER,
+    BRANCH_CREATOR,
    COMMITTER,
    IMPLEMENTOR,
-    PLAN_FINDER,
    PLANNER,
    PR_CREATOR,
+    REVIEWER,
 )


@pytest.mark.asyncio
-async def test_classify_issue_success():
-    """Test successful issue classification"""
+async def test_run_create_branch_step_success():
+    """Test successful branch creation"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="/feature",
-            result_text="/feature",
-            stderr=None,
+            result_text="feat/add-feature",
+            stdout="feat/add-feature",
            exit_code=0,
-            session_id="session-123",
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/classifier.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock(file_path="create-branch.md"))

-    result = await workflow_operations.classify_issue(
-        mock_executor,
-        mock_loader,
-        '{"title": "Add feature"}',
-        "wo-test",
-        "/tmp/working",
+    context = {"user_request": "Add new feature"}
+
+    result = await workflow_operations.run_create_branch_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

-    assert result.step == WorkflowStep.CLASSIFY
-    assert result.agent_name == CLASSIFIER
    assert result.success is True
-    assert result.output == "/feature"
-    assert result.session_id == "session-123"
-    mock_loader.load_command.assert_called_once_with("classifier")
+    assert result.step == WorkflowStep.CREATE_BRANCH
+    assert result.agent_name == BRANCH_CREATOR
+    assert result.output == "feat/add-feature"
+    mock_command_loader.load_command.assert_called_once_with("create-branch")
+    mock_executor.build_command.assert_called_once()


@pytest.mark.asyncio
-async def test_classify_issue_failure():
-    """Test failed issue classification"""
+async def test_run_create_branch_step_failure():
+    """Test branch creation failure"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=False,
-            stdout=None,
-            stderr="Error",
+            error_message="Branch creation failed",
            exit_code=1,
-            error_message="Classification failed",
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/classifier.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.classify_issue(
-        mock_executor,
-        mock_loader,
-        '{"title": "Add feature"}',
-        "wo-test",
-        "/tmp/working",
+    context = {"user_request": "Add new feature"}
+
+    result = await workflow_operations.run_create_branch_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

-    assert result.step == WorkflowStep.CLASSIFY
-    assert result.agent_name == CLASSIFIER
    assert result.success is False
-    assert result.error_message == "Classification failed"
+    assert result.error_message == "Branch creation failed"
+    assert result.step == WorkflowStep.CREATE_BRANCH


@pytest.mark.asyncio
-async def test_build_plan_feature_success():
-    """Test successful feature plan creation"""
+async def test_run_planning_step_success():
+    """Test successful planning step"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="Plan created successfully",
-            result_text="Plan created successfully",
-            stderr=None,
+            result_text="PRPs/features/add-feature.md",
            exit_code=0,
-            session_id="session-123",
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/planner_feature.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.build_plan(
-        mock_executor,
-        mock_loader,
-        "/feature",
-        "42",
-        "wo-test",
-        '{"title": "Add feature"}',
-        "/tmp/working",
+    context = {
+        "user_request": "Add authentication",
+        "github_issue_number": "123"
+    }
+
+    result = await workflow_operations.run_planning_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

-    assert result.step == WorkflowStep.PLAN
+    assert result.success is True
+    assert result.step == WorkflowStep.PLANNING
    assert result.agent_name == PLANNER
-    assert result.success is True
-    assert result.output == "Plan created successfully"
-    mock_loader.load_command.assert_called_once_with("planner_feature")
+    assert result.output == "PRPs/features/add-feature.md"
+    mock_command_loader.load_command.assert_called_once_with("planning")


@pytest.mark.asyncio
-async def test_build_plan_bug_success():
-    """Test successful bug plan creation"""
+async def test_run_planning_step_with_none_issue_number():
+    """Test planning step handles None issue number"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="Bug plan created",
-            result_text="Bug plan created",
-            stderr=None,
+            result_text="PRPs/features/add-feature.md",
            exit_code=0,
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/planner_bug.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.build_plan(
-        mock_executor,
-        mock_loader,
-        "/bug",
-        "42",
-        "wo-test",
-        '{"title": "Fix bug"}',
-        "/tmp/working",
+    context = {
+        "user_request": "Add authentication",
+        "github_issue_number": None  # None should be converted to ""
+    }
+
+    result = await workflow_operations.run_planning_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

    assert result.success is True
-    mock_loader.load_command.assert_called_once_with("planner_bug")
+    # Verify build_command was called with ["user_request", ""] not None
+    args_used = mock_executor.build_command.call_args[1]["args"]
+    assert args_used[1] == ""  # github_issue_number should be empty string


@pytest.mark.asyncio
-async def test_build_plan_invalid_class():
-    """Test plan creation with invalid issue class"""
-    mock_executor = MagicMock()
-    mock_loader = MagicMock()
-
-    result = await workflow_operations.build_plan(
-        mock_executor,
-        mock_loader,
-        "/invalid",
-        "42",
-        "wo-test",
-        '{"title": "Test"}',
-        "/tmp/working",
-    )
-
-    assert result.step == WorkflowStep.PLAN
-    assert result.success is False
-    assert "Unknown issue class" in result.error_message
-
-
-@pytest.mark.asyncio
-async def test_find_plan_file_success():
-    """Test successful plan file finding"""
+async def test_run_execute_step_success():
+    """Test successful execute step"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="specs/issue-42-wo-test-planner-feature.md",
-            result_text="specs/issue-42-wo-test-planner-feature.md",
-            stderr=None,
-            exit_code=0,
-        )
-    )
-
-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/plan_finder.md")
-
-    result = await workflow_operations.find_plan_file(
-        mock_executor,
-        mock_loader,
-        "42",
-        "wo-test",
-        "Previous output",
-        "/tmp/working",
-    )
-
-    assert result.step == WorkflowStep.FIND_PLAN
-    assert result.agent_name == PLAN_FINDER
-    assert result.success is True
-    assert result.output == "specs/issue-42-wo-test-planner-feature.md"
-
-
-@pytest.mark.asyncio
-async def test_find_plan_file_not_found():
-    """Test plan file not found"""
-    mock_executor = MagicMock()
-    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
-    mock_executor.execute_async = AsyncMock(
-        return_value=CommandExecutionResult(
-            success=True,
-            stdout="0",
-            result_text="0",
-            stderr=None,
-            exit_code=0,
-        )
-    )
-
-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/plan_finder.md")
-
-    result = await workflow_operations.find_plan_file(
-        mock_executor,
-        mock_loader,
-        "42",
-        "wo-test",
-        "Previous output",
-        "/tmp/working",
-    )
-
-    assert result.success is False
-    assert result.error_message == "Plan file not found"
-
-
-@pytest.mark.asyncio
-async def test_implement_plan_success():
-    """Test successful plan implementation"""
-    mock_executor = MagicMock()
-    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
-    mock_executor.execute_async = AsyncMock(
-        return_value=CommandExecutionResult(
-            success=True,
-            stdout="Implementation completed",
            result_text="Implementation completed",
-            stderr=None,
            exit_code=0,
-            session_id="session-123",
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/implementor.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.implement_plan(
-        mock_executor,
-        mock_loader,
-        "specs/plan.md",
-        "wo-test",
-        "/tmp/working",
+    context = {"planning": "PRPs/features/add-feature.md"}
+
+    result = await workflow_operations.run_execute_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

-    assert result.step == WorkflowStep.IMPLEMENT
+    assert result.success is True
+    assert result.step == WorkflowStep.EXECUTE
    assert result.agent_name == IMPLEMENTOR
-    assert result.success is True
-    assert result.output == "Implementation completed"
+    assert "completed" in result.output.lower()
+    mock_command_loader.load_command.assert_called_once_with("execute")


@pytest.mark.asyncio
-async def test_generate_branch_success():
-    """Test successful branch generation"""
+async def test_run_execute_step_missing_plan_file():
+    """Test execute step fails when plan file missing from context"""
+    mock_executor = MagicMock()
+    mock_command_loader = MagicMock()
+
+    context = {}  # No plan file
+
+    result = await workflow_operations.run_execute_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
+    )
+
+    assert result.success is False
+    assert "No plan file" in result.error_message
+
+
+@pytest.mark.asyncio
+async def test_run_commit_step_success():
+    """Test successful commit step"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="feat-issue-42-wo-test-add-feature",
-            result_text="feat-issue-42-wo-test-add-feature",
-            stderr=None,
+            result_text="Commit: abc123\nBranch: feat/add-feature\nPushed: Yes",
            exit_code=0,
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/branch_generator.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.generate_branch(
-        mock_executor,
-        mock_loader,
-        "/feature",
-        "42",
-        "wo-test",
-        '{"title": "Add feature"}',
-        "/tmp/working",
+    context = {}
+
+    result = await workflow_operations.run_commit_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

-    assert result.step == WorkflowStep.GENERATE_BRANCH
-    assert result.agent_name == BRANCH_GENERATOR
    assert result.success is True
-    assert result.output == "feat-issue-42-wo-test-add-feature"
-
-
-@pytest.mark.asyncio
-async def test_create_commit_success():
-    """Test successful commit creation"""
-    mock_executor = MagicMock()
-    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
-    mock_executor.execute_async = AsyncMock(
-        return_value=CommandExecutionResult(
-            success=True,
-            stdout="implementor: feat: add user authentication",
-            result_text="implementor: feat: add user authentication",
-            stderr=None,
-            exit_code=0,
-        )
-    )
-
-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/committer.md")
-
-    result = await workflow_operations.create_commit(
-        mock_executor,
-        mock_loader,
-        "implementor",
-        "/feature",
-        '{"title": "Add auth"}',
-        "wo-test",
-        "/tmp/working",
-    )
-
    assert result.step == WorkflowStep.COMMIT
    assert result.agent_name == COMMITTER
-    assert result.success is True
-    assert result.output == "implementor: feat: add user authentication"
+    mock_command_loader.load_command.assert_called_once_with("commit")


@pytest.mark.asyncio
-async def test_create_pull_request_success():
+async def test_run_create_pr_step_success():
    """Test successful PR creation"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
            success=True,
-            stdout="https://github.com/owner/repo/pull/123",
            result_text="https://github.com/owner/repo/pull/123",
-            stderr=None,
            exit_code=0,
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/pr_creator.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.create_pull_request(
-        mock_executor,
-        mock_loader,
-        "feat-issue-42",
-        '{"title": "Add feature"}',
-        "specs/plan.md",
-        "wo-test",
-        "/tmp/working",
+    context = {
+        "create-branch": "feat/add-feature",
+        "planning": "PRPs/features/add-feature.md"
+    }
+
+    result = await workflow_operations.run_create_pr_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

+    assert result.success is True
    assert result.step == WorkflowStep.CREATE_PR
    assert result.agent_name == PR_CREATOR
-    assert result.success is True
-    assert result.output == "https://github.com/owner/repo/pull/123"
+    assert "github.com" in result.output
+    mock_command_loader.load_command.assert_called_once_with("create-pr")


@pytest.mark.asyncio
-async def test_create_pull_request_failure():
-    """Test failed PR creation"""
+async def test_run_create_pr_step_missing_branch():
+    """Test PR creation fails when branch name missing"""
+    mock_executor = MagicMock()
+    mock_command_loader = MagicMock()
+
+    context = {"planning": "PRPs/features/add-feature.md"}  # No branch name
+
+    result = await workflow_operations.run_create_pr_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
+    )
+
+    assert result.success is False
+    assert "No branch name" in result.error_message
+
+
+@pytest.mark.asyncio
+async def test_run_review_step_success():
+    """Test successful review step"""
    mock_executor = MagicMock()
    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
    mock_executor.execute_async = AsyncMock(
        return_value=CommandExecutionResult(
-            success=False,
-            stdout=None,
-            stderr="PR creation failed",
-            exit_code=1,
-            error_message="GitHub API error",
+            success=True,
+            result_text='{"blockers": [], "tech_debt": []}',
+            exit_code=0,
        )
    )

-    mock_loader = MagicMock()
-    mock_loader.load_command = MagicMock(return_value="/path/to/pr_creator.md")
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())

-    result = await workflow_operations.create_pull_request(
-        mock_executor,
-        mock_loader,
-        "feat-issue-42",
-        '{"title": "Add feature"}',
-        "specs/plan.md",
-        "wo-test",
-        "/tmp/working",
+    context = {"planning": "PRPs/features/add-feature.md"}
+
+    result = await workflow_operations.run_review_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
+    )
+
+    assert result.success is True
+    assert result.step == WorkflowStep.REVIEW
+    assert result.agent_name == REVIEWER
+    mock_command_loader.load_command.assert_called_once_with("prp-review")
+
+
+@pytest.mark.asyncio
+async def test_run_review_step_missing_plan():
+    """Test review step fails when plan file missing"""
+    mock_executor = MagicMock()
+    mock_command_loader = MagicMock()
+
+    context = {}  # No plan file
+
+    result = await workflow_operations.run_review_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
    )

    assert result.success is False
-    assert result.error_message == "GitHub API error"
+    assert "No plan file" in result.error_message
+
+
+@pytest.mark.asyncio
+async def test_context_passing_between_steps():
+    """Test that context is properly used across steps"""
+    mock_executor = MagicMock()
+    mock_executor.build_command = MagicMock(return_value=("cli command", "prompt"))
+    mock_executor.execute_async = AsyncMock(
+        return_value=CommandExecutionResult(
+            success=True,
+            result_text="output",
+            exit_code=0,
+        )
+    )
+
+    mock_command_loader = MagicMock()
+    mock_command_loader.load_command = MagicMock(return_value=MagicMock())
+
+    # Test context flow: create-branch -> planning
+    context = {"user_request": "Test feature"}
+
+    # Step 1: Create branch
+    branch_result = await workflow_operations.run_create_branch_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
+    )
+
+    # Simulate orchestrator storing output
+    context["create-branch"] = "feat/test-feature"
+
+    # Step 2: Planning should have access to branch name via context
+    planning_result = await workflow_operations.run_planning_step(
+        executor=mock_executor,
+        command_loader=mock_command_loader,
+        work_order_id="wo-test",
+        working_dir="/tmp/test",
+        context=context,
+    )
+
+    assert branch_result.success is True
+    assert planning_result.success is True
+    assert "create-branch" in context
--- a/python/tests/agent_work_orders/test_workflow_orchestrator.py
+++ b/python/tests/agent_work_orders/test_workflow_orchestrator.py
@@ -0,0 +1,375 @@
+"""Tests for Workflow Orchestrator - Command Stitching Architecture"""
+
+import pytest
+from unittest.mock import AsyncMock, MagicMock, patch
+
+from src.agent_work_orders.models import (
+    AgentWorkOrderStatus,
+    SandboxType,
+    StepExecutionResult,
+    StepHistory,
+    WorkflowExecutionError,
+    WorkflowStep,
+)
+from src.agent_work_orders.workflow_engine.workflow_orchestrator import WorkflowOrchestrator
+
+
+@pytest.fixture
+def mock_dependencies():
+    """Create mocked dependencies for orchestrator"""
+    mock_executor = MagicMock()
+    mock_sandbox_factory = MagicMock()
+    mock_github_client = MagicMock()
+    mock_command_loader = MagicMock()
+    mock_state_repository = MagicMock()
+
+    # Mock sandbox
+    mock_sandbox = MagicMock()
+    mock_sandbox.working_dir = "/tmp/test-sandbox"
+    mock_sandbox.setup = AsyncMock()
+    mock_sandbox.cleanup = AsyncMock()
+    mock_sandbox_factory.create_sandbox.return_value = mock_sandbox
+
+    # Mock state repository
+    mock_state_repository.update_status = AsyncMock()
+    mock_state_repository.save_step_history = AsyncMock()
+    mock_state_repository.update_git_branch = AsyncMock()
+
+    orchestrator = WorkflowOrchestrator(
+        agent_executor=mock_executor,
+        sandbox_factory=mock_sandbox_factory,
+        github_client=mock_github_client,
+        command_loader=mock_command_loader,
+        state_repository=mock_state_repository,
+    )
+
+    return orchestrator, {
+        "executor": mock_executor,
+        "sandbox_factory": mock_sandbox_factory,
+        "github_client": mock_github_client,
+        "command_loader": mock_command_loader,
+        "state_repository": mock_state_repository,
+        "sandbox": mock_sandbox,
+    }
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_default_commands(mock_dependencies):
+    """Test workflow with default command selection"""
+    orchestrator, mocks = mock_dependencies
+
+    # Mock all command steps to succeed
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_planning_step") as mock_plan, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_execute_step") as mock_execute, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_commit_step") as mock_commit, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_pr_step") as mock_pr:
+
+        # Set up mock returns
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/test-feature",
+            duration_seconds=1.0,
+        )
+
+        mock_plan.return_value = StepExecutionResult(
+            step=WorkflowStep.PLANNING,
+            agent_name="Planner",
+            success=True,
+            output="PRPs/features/test.md",
+            duration_seconds=5.0,
+        )
+
+        mock_execute.return_value = StepExecutionResult(
+            step=WorkflowStep.EXECUTE,
+            agent_name="Implementor",
+            success=True,
+            output="Implementation completed",
+            duration_seconds=30.0,
+        )
+
+        mock_commit.return_value = StepExecutionResult(
+            step=WorkflowStep.COMMIT,
+            agent_name="Committer",
+            success=True,
+            output="Commit: abc123",
+            duration_seconds=2.0,
+        )
+
+        mock_pr.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_PR,
+            agent_name="PrCreator",
+            success=True,
+            output="https://github.com/owner/repo/pull/1",
+            duration_seconds=3.0,
+        )
+
+        # Execute workflow with default commands (None = default)
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=None,  # Should use default
+        )
+
+        # Verify all 5 default commands were executed
+        assert mock_branch.called
+        assert mock_plan.called
+        assert mock_execute.called
+        assert mock_commit.called
+        assert mock_pr.called
+
+        # Verify status updates
+        assert mocks["state_repository"].update_status.call_count >= 2
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_custom_commands(mock_dependencies):
+    """Test workflow with custom command selection"""
+    orchestrator, mocks = mock_dependencies
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_planning_step") as mock_plan:
+
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/test",
+            duration_seconds=1.0,
+        )
+
+        mock_plan.return_value = StepExecutionResult(
+            step=WorkflowStep.PLANNING,
+            agent_name="Planner",
+            success=True,
+            output="PRPs/features/test.md",
+            duration_seconds=5.0,
+        )
+
+        # Execute with only 2 commands
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=["create-branch", "planning"],
+        )
+
+        # Verify only 2 commands were executed
+        assert mock_branch.called
+        assert mock_plan.called
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_stop_on_failure(mock_dependencies):
+    """Test workflow stops on first failure"""
+    orchestrator, mocks = mock_dependencies
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_planning_step") as mock_plan, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_execute_step") as mock_execute:
+
+        # First command succeeds
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/test",
+            duration_seconds=1.0,
+        )
+
+        # Second command fails
+        mock_plan.return_value = StepExecutionResult(
+            step=WorkflowStep.PLANNING,
+            agent_name="Planner",
+            success=False,
+            error_message="Planning failed: timeout",
+            duration_seconds=5.0,
+        )
+
+        # Execute workflow - should stop at planning
+        with pytest.raises(WorkflowExecutionError, match="Planning failed"):
+            await orchestrator.execute_workflow(
+                agent_work_order_id="wo-test",
+                repository_url="https://github.com/owner/repo",
+                sandbox_type=SandboxType.GIT_BRANCH,
+                user_request="Test feature",
+                selected_commands=["create-branch", "planning", "execute"],
+            )
+
+        # Verify only first 2 commands executed, not the third
+        assert mock_branch.called
+        assert mock_plan.called
+        assert not mock_execute.called
+
+        # Verify failure status was set
+        calls = [call for call in mocks["state_repository"].update_status.call_args_list
+                if call[0][1] == AgentWorkOrderStatus.FAILED]
+        assert len(calls) > 0
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_context_passing(mock_dependencies):
+    """Test context is passed correctly between commands"""
+    orchestrator, mocks = mock_dependencies
+
+    captured_contexts = []
+
+    async def capture_branch_context(executor, command_loader, work_order_id, working_dir, context):
+        captured_contexts.append(("branch", dict(context)))
+        return StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/test",
+            duration_seconds=1.0,
+        )
+
+    async def capture_plan_context(executor, command_loader, work_order_id, working_dir, context):
+        captured_contexts.append(("planning", dict(context)))
+        return StepExecutionResult(
+            step=WorkflowStep.PLANNING,
+            agent_name="Planner",
+            success=True,
+            output="PRPs/features/test.md",
+            duration_seconds=5.0,
+        )
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step", side_effect=capture_branch_context), \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_planning_step", side_effect=capture_plan_context):
+
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=["create-branch", "planning"],
+        )
+
+        # Verify context was passed correctly
+        assert len(captured_contexts) == 2
+
+        # First command should have initial context
+        branch_context = captured_contexts[0][1]
+        assert "user_request" in branch_context
+        assert branch_context["user_request"] == "Test feature"
+
+        # Second command should have previous command's output
+        planning_context = captured_contexts[1][1]
+        assert "user_request" in planning_context
+        assert "create-branch" in planning_context
+        assert planning_context["create-branch"] == "feat/test"
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_updates_git_branch(mock_dependencies):
+    """Test that git branch name is updated after create-branch"""
+    orchestrator, mocks = mock_dependencies
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch:
+
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/awesome-feature",
+            duration_seconds=1.0,
+        )
+
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=["create-branch"],
+        )
+
+        # Verify git branch was updated
+        mocks["state_repository"].update_git_branch.assert_called_once_with(
+            "wo-test", "feat/awesome-feature"
+        )
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_updates_pr_url(mock_dependencies):
+    """Test that PR URL is saved after create-pr"""
+    orchestrator, mocks = mock_dependencies
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch, \
+         patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_pr_step") as mock_pr:
+
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=True,
+            output="feat/test",
+            duration_seconds=1.0,
+        )
+
+        mock_pr.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_PR,
+            agent_name="PrCreator",
+            success=True,
+            output="https://github.com/owner/repo/pull/42",
+            duration_seconds=3.0,
+        )
+
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=["create-branch", "create-pr"],
+        )
+
+        # Verify PR URL was saved with COMPLETED status
+        status_calls = [call for call in mocks["state_repository"].update_status.call_args_list
+                       if call[0][1] == AgentWorkOrderStatus.COMPLETED]
+        assert any("github_pull_request_url" in str(call) for call in status_calls)
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_unknown_command(mock_dependencies):
+    """Test that unknown commands raise error"""
+    orchestrator, mocks = mock_dependencies
+
+    with pytest.raises(WorkflowExecutionError, match="Unknown command"):
+        await orchestrator.execute_workflow(
+            agent_work_order_id="wo-test",
+            repository_url="https://github.com/owner/repo",
+            sandbox_type=SandboxType.GIT_BRANCH,
+            user_request="Test feature",
+            selected_commands=["invalid-command"],
+        )
+
+
+@pytest.mark.asyncio
+async def test_execute_workflow_sandbox_cleanup(mock_dependencies):
+    """Test that sandbox is cleaned up even on failure"""
+    orchestrator, mocks = mock_dependencies
+
+    with patch("src.agent_work_orders.workflow_engine.workflow_operations.run_create_branch_step") as mock_branch:
+
+        mock_branch.return_value = StepExecutionResult(
+            step=WorkflowStep.CREATE_BRANCH,
+            agent_name="BranchCreator",
+            success=False,
+            error_message="Failed",
+            duration_seconds=1.0,
+        )
+
+        with pytest.raises(WorkflowExecutionError):
+            await orchestrator.execute_workflow(
+                agent_work_order_id="wo-test",
+                repository_url="https://github.com/owner/repo",
+                sandbox_type=SandboxType.GIT_BRANCH,
+                user_request="Test feature",
+                selected_commands=["create-branch"],
+            )
+
+        # Verify sandbox cleanup was called
+        assert mocks["sandbox"].cleanup.called