Skip to content

feat: multi-agent foundation with recovery, quality gates, and migration compatibility#415

Open
NextDoorLaoHuang-HF wants to merge 4 commits intoRightNow-AI:mainfrom
NextDoorLaoHuang-HF:feat/multiagent-foundation-v1
Open

feat: multi-agent foundation with recovery, quality gates, and migration compatibility#415
NextDoorLaoHuang-HF wants to merge 4 commits intoRightNow-AI:mainfrom
NextDoorLaoHuang-HF:feat/multiagent-foundation-v1

Conversation

@NextDoorLaoHuang-HF
Copy link

Summary

This PR delivers the multi-agent foundation requested for long-running, resumable Codex workflows and OpenClaw migration compatibility.

Main scope:

  • Durable workflow/task state + recovery snapshot/resume primitives.
  • Declarative workflow routing + fan-out/fan-in execution.
  • Review reject-and-return control loop + retry/block escalation.
  • Step-level quality gate enforcement + gate execution logs.
  • Session isolation guardrails across kernel/runtime/API (including process scope isolation).
  • Approval enforcement + workflow audit/trace + observability metrics.
  • Shadow-run comparison + rollout/rollback controls.
  • OpenClaw migration compatibility hardening (identity/provider alias/bindings route variants).
  • Docs + e2e/focused tests for the above.

Why

OpenFang needs stronger multi-step orchestration and reliability foundations to support 24h+ autonomous runs with deterministic recovery, strict quality control, and safer migration from existing OpenClaw layouts.

Validation (comprehensive pre-PR review gate)

All commands below were re-run successfully before opening this PR:

  • cargo fmt --all -- --check
  • cargo test -p openfang-kernel workflow::tests::test_route_workflow_by_channel_task_type_and_risk -- --exact
  • cargo test -p openfang-kernel workflow::tests::test_review_reject_and_return_to_planning -- --exact
  • cargo test -p openfang-kernel --test session_resume_integration_test test_multi_session_e2e_session_summaries_stay_scoped -- --exact
  • cargo test -p openfang-api --test api_integration_test test_workflow_shadow_run_compares_against_production_output -- --exact
  • cargo test -p openfang-api --test api_integration_test test_workflow_rollout_controls_promote_and_rollback_with_checklist -- --exact
  • cargo test -p openfang-runtime tool_runner::tests::test_approval_required_approved -- --exact
  • cargo test -p openfang-migrate provider_alias_compatibility
  • cargo check -p openfang-api

Risks

  • Large cross-crate surface area (kernel/runtime/api/migrate) with behavior changes in workflow control-flow and session semantics.
  • Rollout should remain staged with shadow mode and rollback checklist.

Rollback

  • Revert by planned slice order (state/recovery -> routing -> review/retry -> session/approval/audit -> shadow/rollout -> migrate compatibility).
  • Operational fallback: restore stable production path and disable shadow promotion while keeping migration isolated.

@NextDoorLaoHuang-HF NextDoorLaoHuang-HF marked this pull request as draft March 7, 2026 16:20
@NextDoorLaoHuang-HF
Copy link
Author

Pre-PR comprehensive review gate has been completed, and this PR is intentionally kept in Draft because unresolved high-severity findings were identified.\n\nBlocking findings summary:\n1. Session ownership/isolation risk in attachment injection path () — caller-supplied is not fully ownership-guarded and missing sessions can be auto-created.\n2. Streaming multi-session compaction risk () — compaction path may target default session instead of requested session under concurrent sessions.\n3. Provider normalization compatibility risk () — global to normalization can rewrite providers into unsupported runtime names.\n4. Rollout/rollback control-plane vs execution-path drift risk () — rollout state management needs stronger enforcement in routing/execution path.\n\nI will address these blockers before marking this PR ready for review.

@NextDoorLaoHuang-HF
Copy link
Author

Pre-PR comprehensive review gate has been completed, and this PR is intentionally kept in Draft because unresolved high-severity findings were identified.

Blocking findings summary:

  1. Session ownership/isolation risk in attachment injection path (routes.rs) — caller-supplied session_id is not fully ownership-guarded and missing sessions can be auto-created.
  2. Streaming multi-session compaction risk (kernel.rs) — compaction path may target default session instead of requested session under concurrent sessions.
  3. Provider normalization compatibility risk (openclaw.rs) — global - to _ normalization can rewrite providers into unsupported runtime names.
  4. Rollout/rollback control-plane vs execution-path drift risk (workflow.rs) — rollout state management needs stronger enforcement in routing/execution path.

Next action: fix blockers, rerun comprehensive pre-PR review, then move PR out of Draft.

@NextDoorLaoHuang-HF
Copy link
Author

High-severity blocker fixes are now landed on feat/multiagent-foundation-v1.

Fixed items:

  1. routes/ws attachment session safety
    • Enforced explicit session ownership checks before attachment injection.
    • Rejected unknown explicit sessions (no implicit arbitrary session creation).
    • WebSocket invalid session_id now returns explicit error (no silent fallback).
  2. kernel multi-session compaction targeting
    • Added compact_agent_session_in_session.
    • Streaming pre/post compaction now targets resolved_session_id instead of default session.
  3. migrate provider normalization
    • Removed global unknown-provider - -> _ rewrite.
    • Added explicit claude_code/claude-code handling and compatibility assertions.
  4. workflow/kernel rollout execution enforcement
    • Added route_workflow_for_primary_path and enforced Openfang primary path in routed execution.
    • Made route score arithmetic overflow-safe (saturating_add).

Validation:

  • Full comprehensive pre-PR gate rerun passed (run_pre_pr_review.py, 11/11 commands).
  • Latest gate log:
    • .codex-tasks/openfang-multiagent-foundation/logs/pre-pr-review-20260308-124253.log
  • Additional targeted regressions passed for:
    • attachment session ownership checks
    • cross-agent compaction rejection
    • rollout-primary routing enforcement
    • provider mapping compatibility

T028 (comprehensive pre-PR review gate) is back to DONE.

@NextDoorLaoHuang-HF NextDoorLaoHuang-HF marked this pull request as ready for review March 8, 2026 04:47
@NextDoorLaoHuang-HF
Copy link
Author

Post-fix update for blocking review items (corrected formatting):

  • Fixed all High findings from the comprehensive pre-PR review in commit b64964b.
  • Added/updated regressions for session ownership, compaction session targeting, provider alias normalization, and rollout primary-path enforcement.
  • Also fixed Medium issue: websocket invalid session_id now returns explicit error instead of silent fallback.
  • Re-ran comprehensive pre-PR gate with GO decision (log: .codex-tasks/openfang-multiagent-foundation/logs/pre-pr-review-20260308-124253.log).

This PR is now Ready for review.

@NextDoorLaoHuang-HF
Copy link
Author

Process hardening update pushed in 99d7d3b:

  • Added pre-pr-review-gate CI workflow to enforce required PR review sections/checklist on pull_request -> main.
  • Added PR template with mandatory sections for validation evidence, findings, risks, and rollback.
  • Added branch-protection automation helper scripts/ci/configure_branch_protection.sh and verified it on fork NextDoorLaoHuang-HF/openfang:main.
  • Added docs/pr-quality-gates.md and wired it into CONTRIBUTING.md / docs/README.md.

Note: per local policy preference, codex longrun runtime scripts/logs remain local-only in ignored paths (.codex-tasks/, .longrun/) and are not committed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant