Skip to content

feat: detect and kill stuck agents after 20 minutes of inactivity#173

Open
maswa wants to merge 1 commit intoAutoForgeAI:masterfrom
maswa:feat/stuck-agent-detection
Open

feat: detect and kill stuck agents after 20 minutes of inactivity#173
maswa wants to merge 1 commit intoAutoForgeAI:masterfrom
maswa:feat/stuck-agent-detection

Conversation

@maswa
Copy link

@maswa maswa commented Feb 7, 2026

What this does

When running in parallel mode, agents sometimes hang indefinitely — usually waiting for an API response that never comes back, or stuck in some internal loop. This leaves features permanently marked as "in progress" and wastes a concurrency slot.

This PR adds a simple inactivity timeout: if an agent produces no output for 20 minutes, it gets killed and the feature is released back to the queue.

How it works

  • Tracks a last_activity timestamp per agent, updated every time stdout produces output
  • A _check_stuck_agents() method runs each iteration of the main orchestrator loop
  • If an agent exceeds 1200 seconds (20 min) of silence, its process tree is killed
  • Activity tracking is cleaned up when agents complete normally

Why 20 minutes?

Working agents produce continuous output — tool calls, code generation, thinking blocks. Even complex features that take 1-2 hours always have activity. 20 minutes of complete silence reliably indicates something is wrong.

Changes

  • parallel_orchestrator.py — 80 lines added (constant, tracking dict, spawn hooks, check method, cleanup, main loop integration)

Test plan

  • Run parallel mode with --max-concurrency 2 and verify agents complete normally (no false kills)
  • Manually test by killing a network connection mid-agent to verify stuck detection triggers
  • Verify killed features become available for other agents to pick up

Agents that hang without producing output are now automatically detected
and killed after 20 minutes of inactivity. This prevents features from
being stuck indefinitely when an agent hangs.

Changes:
- Add AGENT_INACTIVITY_TIMEOUT constant (1200 seconds)
- Track last activity timestamp per agent
- Kill and restart agents with no output for 20+ minutes
- Clean up tracking on agent completion

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CoreAspectStu added a commit to CoreAspectStu/autocoder-custom that referenced this pull request Feb 9, 2026
ISSUE:
------
The production systemd service was starting uvicorn directly without
ensuring the React frontend was built. This caused UI changes to not
appear until someone manually ran 'npm run build'.

ROOT CAUSE:
-----------
The ExecStart line in autocoder-ui.service bypassed start_ui.py,
which contains smart build detection logic:
  ExecStart=/home/stu/.../venv/bin/python -m uvicorn server.main:app

SOLUTION:
---------
Created a production wrapper script that:
1. Runs 'npm run build' to compile TypeScript and bundle React app
2. Starts uvicorn server
3. Ensures UI changes are reflected on every service restart

FILES CREATED:
--------------
1. start_ui_production.sh - Production launcher for systemd
   - Builds frontend before starting server
   - Reports build status in logs
   - Fails fast if build fails

2. docs/BUILD_PROCESS.md - Comprehensive documentation
   - Problem description and solution
   - How build process works
   - Troubleshooting guide
   - Verification steps

3. verify_feature_173.py - Automated verification script
   - Tests wrapper script exists and is executable
   - Verifies systemd service configuration
   - Tests TypeScript compilation
   - Confirms dist directory is created

SYSTEMD CHANGES:
----------------
Modified: ~/.config/systemd/user/autocoder-ui.service
  ExecStart: /home/stu/projects/autocoder/venv/bin/python ...
  → ExecStart: /home/stu/projects/autocoder/start_ui_production.sh

VERIFICATION:
-------------
All 6/6 checks passed:
✅ Wrapper script exists and is executable
✅ Systemd service uses wrapper script
✅ Wrapper script contains build command
✅ TypeScript strict mode enabled
✅ TypeScript compilation succeeds (7.03s)
✅ dist directory created with assets

IMPACT:
-------
Before: UI changes required manual 'npm run build' → service restart
After:  UI changes automatically built on every service start

Build time: ~7 seconds (TypeScript + Vite bundling)
Output: ui/dist/ with optimized assets (~1.2 MB gzipped)

Marked feature AutoForgeAI#173 as PASSING.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant