feat: Standalone Docker container architecture with parallel agents#4
Merged
feat: Standalone Docker container architecture with parallel agents#4
Conversation
Major architectural change to enable parallel agent execution in separate Docker containers: ## Backend Changes - registry.py: Changed from path to git_url storage, added Container model - container_manager.py: Updated to use git_url instead of volume mounts - agent.py: Fixed to use git_url with container manager - projects.py: Added container control and edit mode endpoints - New services: BeadsSyncManager, LocalProjectManager for host-side operations ## Template Changes - initializer_prompt: Added `bd init --branch beads-sync` for parallel workflow - coding_prompt: Added distributed lock feature claiming and random verification ## UI Changes - New ContainerControl component: Slider (1-10 agents) + control buttons - New ContainerList component: Shows running containers with status - App.tsx: Integrated new components with hooks for container management - Removed FolderBrowser (no longer needed with git URL approach) ## Removed Files - server/routers/filesystem.py: Filesystem browsing no longer needed - ui/src/components/FolderBrowser.tsx: Replaced by git URL input Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix TypeScript build errors: - Change project.path to project.local_path in DeleteProjectModal - Simplify ExistingRepoModal to git URL only flow (remove FolderBrowser) - Add missing paused/completed status configs in AgentControl - Add thread safety: - Add threading.Lock() to beads_sync_manager.py global registry - Add threading.Lock() to local_project_manager.py global registry - Convert blocking I/O to async: - Wrap all subprocess.run() calls with asyncio.to_thread() - Affects beads_sync_manager.py and local_project_manager.py - Fix container stop loop bug: - Remove unnecessary loop in stop_all_containers endpoint - Call container manager once instead of per-container - Fix git merge branch reference: - Store branch name in variable before checkout to main - Prevents merge from referencing wrong branch - Add logging to silent failures: - Log JSON parse errors in issues.jsonl reading - Log agent config read failures - Change return value on read error to True (safer default) - Fix type mismatches: - Add 'paused' and 'completed' to AgentStatus type - Update WizardStep literal from "name"/"folder" to "mode"/"details" - Add task_id validation: - Add validate_task_id() function with regex check - Validate task_id in update_task and delete_task endpoints Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fixes:
- Fix distributed lock bash bugs in coding_prompt.template.md:
- Capture exit code immediately after command (claim_status=$?)
- Send error messages to stderr (>&2) not stdout
- Remove duplicate feature claiming in Steps 2.5 and 3
- Add merge conflict detection
- Fix false success returns in git operations:
- beads_sync_manager.pull_latest() now returns False on failure
- local_project_manager.pull_latest() checks checkout result
- local_project_manager.push_changes() checks add/commit results
- Implement multi-container registry pattern:
- ContainerManager now accepts container_number parameter
- Container naming: zerocoder-{project}-{number}
- Registry: dict[str, dict[int, ContainerManager]]
- Added get_all_container_managers() helper
- Updated all callers in agent.py, websocket.py, features.py
- Add beads-sync branch management methods:
- ensure_beads_sync_branch(): create if not exists
- pre_agent_sync(): sync before agent starts
- post_agent_cleanup(): cleanup feature branches
- recover_stuck_features(): reset in_progress → open
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Fixes remaining issues from comprehensive code review: P2 Issues Fixed: - fix-o4c: Per-container log filtering in AgentLogViewer with container tabs, badges, and proper WebSocket callback cleanup - fix-61b: Edit mode conflict resolution with git stash/pop flow - fix-dxl: Container registry restoration on server restart P3 Issues Fixed: - fix-4up: UI accessibility (ARIA attributes on slider, aria-pressed on buttons, error handling for async callbacks, removed unused hook) - fix-85d: SQLAlchemy CheckConstraints for type enforcement at DB level (container_type, status, target_container_count, feature status) Key changes: - WebSocket now properly tracks callbacks per container for cleanup - AgentLogViewer shows container filter tabs when multiple containers - LocalProjectManager handles stash/pop for uncommitted changes - registry.py enforces data constraints at database level Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
_get_registry_functions() returns 11 values but several call sites only unpacked 10. Updated all call sites to correctly unpack all 11 values (added missing list_project_containers to unpacking). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Phase 1/2 container startup: - New /start-all endpoint runs init container FIRST - Init container waits for repo clone before pre_agent_sync - Coding containers spawned AFTER init completes Key changes: - agent.py: Add start_all_containers endpoint with 2-phase startup - agent.py: Update stop/graceful-stop to affect ALL containers - container_manager.py: Add _is_init_container flag and special handling - container_manager.py: Wait for repo clone before git operations - feature_poller.py: Fix iteration over nested dict structure - api.ts: Route startAgent to /start-all endpoint Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The existing project recovery path in start_all_containers() was calling pre_agent_sync() and recover_stuck_features() without waiting for the container to clone the repo first. This caused git operations to fail. Added wait-for-clone check (test -d /project/.git) before running recovery operations in the else branch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix /project directory permissions for coder user in Dockerfile - Set up SSH key for both root and coder users in entrypoint - Run git clone as coder user who has SSH key configured - Fix tilde expansion in start-app.sh for env var loading - Pull latest changes to local clone before checking project state - Add SSH key mount logging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Container Selector Fix: - Send registered containers list via WebSocket on connect - UI shows all container tabs immediately instead of deriving from logs - Added ContainerInfo type and containers state to useWebSocket hook Beads Sync Integration: - Replace feature_poller with beads_sync_manager for reading feature state - Initialize beads-sync clones for all registered projects on startup - Poll beads-sync every 15s only for projects with active containers - Add get_cached_stats() and get_cached_features() compatibility functions - Fix container naming for multi-container architecture (init + coding) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file has been fully replaced by beads_sync_manager.py which reads task state directly from local beads-sync branch clone instead of querying containers via docker exec. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Beads issues store content in 'description' field, not 'body'. This fixes task descriptions not loading in the UI. Also fixes import of list_valid_projects (was list_all_projects). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add staggered 10s delay between container starts to prevent race condition where all containers claim the same feature - Make start() non-blocking by spawning agent in background task - Add registry calls for container lifecycle tracking (create, start, stop) - Add dynamic polling interval (5s active, 15s idle) for faster UI updates Fixes issues found during parallel container testing: 1. Race condition in feature claiming (staggered startup) 2. False failure reports (non-blocking start) 3. Containers not tracked in registry 4. Slow stats updates when containers running Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix registry tracking: use self.project_name, self.container_number, self.container_type instead of underscore-prefixed versions - Fix View Logs button: pass container.container_number instead of container.id to filter logs correctly Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
WebSocket was only registering log callbacks at connection time. If containers were created after the UI loaded, their logs wouldn't be sent to the UI. Add background task that periodically checks for new container managers and registers callbacks dynamically. This ensures logs appear for containers created while the WebSocket is connected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Issue: fix-6os 1. coding_prompt.template.md: - Add safe_bd_json() helper for clean JSON extraction with stderr suppression - Add safe_bd_sync() helper for quiet sync operations - Update all bd command usages to use safe helpers - Add KEY RULE #9 about suppressing bd stderr 2. beads_commands.py: - Update parse_json_output() to return (data, error) tuple - Include stderr in error messages for debugging - Update callers to handle the error tuple 3. prompts.py: - Add CLAUDE.md refresh to refresh_project_prompts() - Smart merge: preserves user content above BEADS WORKFLOW section - Replaces or appends beads workflow section from template 4. opencode_config/agent/*.md: - Add CLAUDE.md reference to coder, overseer, and hound agents - OpenCode agents now get beads workflow instructions like Claude agents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The verbose output from bd sync goes to stdout, not stderr. Updated safe_bd_sync() to suppress stdout instead of stderr. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updated Step 2 to ensure dependencies are installed before running servers. Detects package manager (pnpm, yarn, npm). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
After refresh_project_prompts() copies templates to the project dir, commit and push them to git so containers get the latest when they clone/pull the repo. This ensures the safe_bd_json and safe_bd_sync fixes reach agents. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add git safe.directory config to Dockerfile for both root and coder users - Add bd sync after clone/pull in entrypoint - Filter bd ready to only open status (not in_progress) to prevent race conditions - Use shuf to randomize feature selection across parallel agents Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The _stream_logs() method already broadcasts container output via docker logs -f. The send_instruction() method was also broadcasting, causing duplicate messages in UI. Removed the broadcast call from send_instruction() while keeping stdout consumption for activity updates. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Agents were pushing feature branches but not cleaning them up. Added step 5 to delete local and remote feature branches after merge. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
bd show --json returns an array [{...}], not an object.
Fixed all jq commands to access .[0].title instead of .title
to properly extract feature titles for branch names.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Initializer creates scripts/safe_bd_json.sh and scripts/safe_bd_sync.sh - Coding prompt now uses these scripts instead of inline function definitions - Scripts are committed by initializer if they don't exist - Coding prompt validates scripts exist before proceeding Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously all containers shared a single .agent_started marker file,
causing race conditions where one container completing would remove
the marker for all containers, breaking health monitor auto-recovery.
Now each container gets its own marker: .agent_started.{container_number}
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Server restarts no longer kill active agent work. User-started containers with open features are preserved and will be restored when the server restarts. The health monitor will restart their agents automatically. This fixes the issue where restarting the server (e.g., to apply code changes) would kill all running agents. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The _user_started flag was only set at manager creation time, causing the health monitor to ignore containers whose marker files were created after the manager was initialized. Now _sync_status() refreshes the flag from the marker file, allowing recovery to work even when markers are created externally (e.g., during server restart recovery). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Added is_agent_stuck() method and health monitor check for agents that are running but haven't produced any output for AGENT_STUCK_TIMEOUT_MINUTES (default 10 minutes). This catches scenarios like: - OpenCode API hung/not responding - Network timeouts - Agent blocked waiting on external service The health monitor will now restart these stuck agents automatically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move feature ID into the colored category badge - Show category as plain text beside the badge - Change priority display from #N to PN format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…s_cache The column was defined in the model but missing from the migration function, causing errors when querying the table. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Without this, all containers defaulted to container 1, causing all feature claims to update the same DB record. Now each container correctly identifies itself for proper feature tracking. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The ContainerStatus and AgentStatus schemas were missing 'reviewer' in their agent_type Literal, causing validation errors when the containers endpoint returned reviewer agent data. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…_feature Use container's current_feature field from registry as primary indicator for in-progress status instead of relying solely on beads status. This is more reliable since current_feature is set on claim and cleared on close. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The stop() method was resetting _user_started=False on every call, including during restart_agent(), restart_with_reviewer(), and restart_with_overseer(). This caused containers to stop auto-restarting after the first restart cycle (e.g., reviewer → coder transition). Added preserve_user_started parameter to stop() method. Restart methods now pass preserve_user_started=True to maintain auto-restart capability. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace direct issues.jsonl file reads with BeadsManager API calls for consistency. Host-side code now uses get_cached_stats() and get_cached_features() which run live bd commands. Changes: - progress.py: Remove JSONL/cache fallbacks from 4 functions - features.py: Remove read_local_beads_features() and fallbacks - container_manager.py: Remove _has_open_features_direct() method Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add apiFailureLogged flag to only log first API failure - Check graceful stop API every 10 seconds instead of every 1 second - Prevents "[WARN] Failed to check graceful stop via API" spam when API unreachable Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create SVG favicon with stylized Z and circuit patterns - Generate PNG versions (16x16, 32x32) for browser compatibility - Add apple-touch-icon (180x180) for iOS devices - Remove default Vite favicon Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add banner image to welcome screen when no project selected - Add logo icon to header next to title - Optimize banner image for web (5.5MB → 228KB) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add requests module to Dockerfile.project pip install (was causing agent_app.py to fail with ModuleNotFoundError) - Fix session endpoint logic: init containers should only run when NO features exist yet (new project), not based on open features - Remove fixed [FEATURE_COUNT] placeholder from initializer prompt, replace with flexible guidance to cover the full spec - Add memory limits (4GB) to containers to prevent OOM crashes - Add git lock cleanup on container restart - Add timeout settings to opencode MCP config Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add minimax-m2-1 to UI model selector with OpenCode badge - Add backend validation for minimax-m2-1 model - Update container_manager to detect MiniMax as OpenCode model - Pass MINIMAX_API_KEY to containers - Configure minimax-coding-plan provider in OpenCode config - Add MiniMax MCP server configuration (uvx minimax-coding-plan-mcp) - Add model routing in opencode_agent_app.ts to dynamically switch between GLM-4.7 and MiniMax-M2.1 based on .agent_config.json - Install uv in Dockerfile for uvx command support - Set thinking budget_tokens: 32000 for MiniMax model Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Logs were being truncated too aggressively in multiple places: - Thinking blocks: 150 → 500 chars - Text output: 200 → 500 chars - Tool results: 500 → 2000 chars - UI log buffer: 500 → 1000 entries Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add Context7 MCP server to both Claude and OpenCode agents for up-to-date library documentation access. Also document GIT_SSH_KEY_PATH environment variable for custom SSH key paths in Docker builds. - Add Context7 to opencode_config/config.json for GLM/MiniMax agents - Add `claude mcp add context7` to Dockerfile.project for Claude agents - Add usage instructions to coding_prompt.template.md - Document GIT_SSH_KEY_PATH in CLAUDE.md and .env.example Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The bd CLI expects `comments add <id> "msg"` not `comments <id> --add "msg"`. Updated all references across server, client script, templates, and tests. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests were using stale function signatures and invalid status values: - register_project() no longer takes a path parameter - update_container_status/get_container need container_type argument - "completed" is not a valid container status (use "stopped") - _notify_status_change dispatches async tasks that need event loop time - count_passing_tests requires DB cache, not file-based reading - Agent router mock targets updated to actual function names Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Increase Docker container memory limit from 4GB to 64GB (host has 128GB) - Enable MCP servers for all agent types (coder, reviewer, overseer) - Pass agentType to updateOpencodeConfig for future per-type configuration - Fix WebSocket stale state when switching projects (reset all state, ignore stale messages via currentProjectRef) - Add container_type to update_container_status calls for proper tracking - Remove unused MiniMax thinking options from config.json The OOM crashes were caused by insufficient container memory (4GB) when MCP servers spawn alongside the agent. Rather than disabling MCP, we allocate 64GB per container since the host has 128GB available. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ispatch Two race conditions caused containers to restart unexpectedly after graceful stop: 1. _handle_agent_exit() could fire after stop() already ran, seeing _graceful_stop_requested=False (cleared by stop) and exit code 137 (from SIGKILL), then restarting. Fixed by returning early if container status is already "stopped" or "not_created". 2. A second start() call during SDK readiness wait would dispatch a duplicate agent (status already "running"). Fixed with an _agent_dispatched flag that prevents double-dispatch and resets on stop. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ceful stop The mutation hooks in useContainers.ts used 'agentStatus' as the query key but the actual query uses 'agent-status', so cache invalidation after start/stop never triggered an immediate refetch. Also adds optimistic update to useGracefulStopAgent for instant UI feedback. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents stale button state when switching between projects by resetting agent-status and containers queries, and forcing ContainerControl remount to clear local isStarting/isStopping state. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add list_issues, update_issue, close_issue, reopen_issue, delete_issue, and add_dependency tools to the issue MCP server. Rename server from "issue-creator" to "issue-manager" to reflect broader capabilities. Update assistant prompt template with new workflow and tool descriptions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pull latest changes (git pull --ff-only) when starting an assistant session so beads issues are up-to-date, and sync issues back to remote (bd sync + commit + push) when the session closes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When bd list fails with "out of sync" error (e.g., after git pull brings new commits), run bd sync --import-only to reconcile the database before retrying. Fixes issues not showing for projects with stale SQLite state. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously cleanup deleted all non-protected branches, which was too
aggressive. Now only branches with the feature/ prefix are targeted,
matching the naming convention agents use (feature/{id}-{title-slug}).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The agent model config should be per-instance, not shared via git. Pass model to containers via AGENT_MODEL env var instead of relying on the config file being in the repo. - Remove git commit/push from settings endpoint - Add prompts/.gitignore template to exclude config files - Copy gitignore in refresh_project_prompts() - Untrack config in _push_template_updates() for existing projects - Pass AGENT_MODEL env var in all docker exec paths - Add AGENT_MODEL env var support to opencode_agent_app.ts - Fix DEFAULT_AGENT_MODEL to claude-sonnet-4-5-20250514 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…d format Replace the multi-section markdown template (Summary/Context/Implementation Notes/Acceptance Criteria) with the initializer's simpler format: brief description + numbered Steps that serve as both implementation guide and verification criteria. The coding agent is optimized for this format. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Template refresh now compares content before copying, eliminating spurious commits on every container start. Push logic fetches and rebases from remote first, with claude-based conflict resolution as fallback for diverged repos. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The claude-code npm package install path changed. Switch to the official curl installer which puts the binary in ~/.local/bin/, and update PATH accordingly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents stale modals/panels from showing wrong project context and cross-project sound triggers when switching between projects. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
🤖 Generated with Claude Code