entireio · alishakawaguchi · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026 · Mar 6, 2026
@@ -0,0 +1,5 @@
+{
+  "name": "e2e",
+  "description": "E2E test triage, debugging, and fix implementation toolkit",
+  "version": "1.0.0"
+}
@@ -0,0 +1,15 @@
+# E2E Plugin
+
+Local plugin providing individual commands for E2E test triage and debugging.
+
+## Commands
+
+| Command | Description |
+|---------|-------------|
+| `/e2e:triage-ci` | Run failing tests locally, classify flaky vs real-bug, present findings report |
+| `/e2e:debug` | Deep-dive artifact analysis for root cause diagnosis |
+| `/e2e:implement` | Apply fixes from triage/debug findings, verify with E2E tests |
+
+## Related
+
+- Orchestrator skill: `.claude/skills/e2e/SKILL.md` (`/e2e` — runs triage-ci then implement)
@@ -0,0 +1,7 @@
+---
+description: "Deep-dive artifact analysis for diagnosing E2E test failures"
+---
+
+# Debug Command
+
+Read and follow the full procedure from `.claude/skills/e2e/debug.md`.
@@ -0,0 +1,7 @@
+---
+description: "Apply fixes from triage/debug findings, verify with scoped E2E tests"
+---
+
+# Implement Command
+
+Read and follow the full procedure from `.claude/skills/e2e/implement.md`.
@@ -0,0 +1,7 @@
+---
+description: "Run failing E2E tests locally, classify flaky vs real-bug, present findings report"
+---
+
+# Triage CI Command
+
+Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.
@@ -52,7 +52,7 @@ This skill enforces strict E2E-first test-driven development. The rules:
 3. **Unit tests are written last.** After all E2E tiers pass (Step 14), you write unit tests using real data collected from E2E runs as golden fixtures.
 4. **If you didn't watch it fail, you don't know if it tests the right thing.** Never write a test you haven't seen fail first.
 5. **Minimum viable fix.** At each E2E failure, implement only the code needed to fix that failure. Don't anticipate future tiers.
-6. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
+6. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.
 
 ## Pipeline
 

@@ -13,7 +13,7 @@ Build the agent Go package using strict E2E-first TDD. Unit tests are written ON
 1. **E2E tests are the spec.** The existing `ForEachAgent` test scenarios define "working". You implement until they pass.
 2. **Watch it fail first.** Every E2E tier starts by running the test and observing the failure. If you haven't seen the failure, you don't understand what needs fixing.
 3. **Minimum viable fix.** At each failure, implement only the code needed to make that specific assertion pass. Don't anticipate future tiers.
-4. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
+4. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.
 5. **No unit tests during Steps 4-13.** Unit tests are written in Step 14 after all E2E tiers pass, using real data from E2E runs as golden fixtures.
 6. **Format and lint, don't unit test.** Between E2E tiers, run `mise run fmt && mise run lint` to keep code clean. Any earlier `mise run test` invocations (e.g., in Step 3) are strictly compile-only sanity checks — no `mise run test` between E2E tiers (Steps 4-13).
 7. **If you didn't watch it fail, you don't know if it tests the right thing.**
@@ -83,7 +83,7 @@ This test requires no agent prompts — it only exercises hooks, so it's the fas
 
 1. Run: `mise run test:e2e --agent $AGENT_SLUG TestHumanOnlyChangesAndCommits`
 2. **Watch it fail** — read the failure output carefully
-3. If there are artifact dirs, use `/debug-e2e {artifact-dir}` to understand what happened
+3. If there are artifact dirs, use `/e2e:debug {artifact-dir}` to understand what happened
 4. Implement the minimum code to fix the first failure
 5. Repeat until the test passes
 
@@ -105,7 +105,7 @@ The foundational test. This exercises the full agent lifecycle: start session
 
 1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionManualCommit`
 2. **Watch it fail** — read the failure output carefully
-3. Use `/debug-e2e {artifact-dir}` to understand what happened
+3. Use `/e2e:debug {artifact-dir}` to understand what happened
 4. Implement the minimum code to fix the first failure
 5. Repeat until the test passes
 
@@ -127,7 +127,7 @@ Validates transcript quality: JSONL validity, content hash correctness, prompt e
 
 1. Run: `mise run test:e2e --agent $AGENT_SLUG TestCheckpointMetadataDeepValidation`
 2. **Watch it fail** — this test often exposes subtle transcript formatting bugs
-3. Use `/debug-e2e {artifact-dir}` on any failures
+3. Use `/e2e:debug {artifact-dir}` on any failures
 4. Fix and repeat
 
 Run: `mise run fmt && mise run lint`
@@ -146,7 +146,7 @@ Agent creates files and commits them within a single prompt turn. Tests the in-t
 **Cycle:**
 
 1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionAgentCommitInTurn`
-2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
+2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
 3. Fix and repeat — if the agent doesn't support committing, skip this test
 
 Run: `mise run fmt && mise run lint`
@@ -164,7 +164,7 @@ Run these tests to validate multi-session behavior:
 **Cycle (for each test):**
 
 1. Run: `mise run test:e2e --agent $AGENT_SLUG TestMultiSessionManualCommit`
-2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
+2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
 3. Fix and repeat
 4. Move to next test
 
@@ -183,7 +183,7 @@ Run these tests for file operation correctness:
 - `TestDeletedFilesCommitDeletion` — Agent deletes a file, user commits the deletion
 - `TestMixedNewAndModifiedFiles` — Agent both creates and modifies files
 
-**Cycle:** Same as above — run each test, **watch it fail**, use `/debug-e2e` on failures, fix, repeat.
+**Cycle:** Same as above — run each test, **watch it fail**, use `/e2e:debug` on failures, fix, repeat.
 
 Run: `mise run fmt && mise run lint`
 
@@ -215,7 +215,7 @@ Run these if the agent supports interactive multi-step sessions:
 - `TestRewindAfterCommit` — Rewind to a checkpoint after committing
 - `TestRewindMultipleFiles` — Rewind with multiple files changed
 
-**Cycle:** Same pattern — run, **watch it fail**, `/debug-e2e` on failures, fix, repeat.
+**Cycle:** Same pattern — run, **watch it fail**, `/e2e:debug` on failures, fix, repeat.
 
 Run: `mise run fmt && mise run lint`
 
@@ -256,7 +256,7 @@ mise run test:e2e --agent $AGENT_SLUG TestFailingTestName
 
 If a test passes when run individually but fails in the full suite, it's a flaky failure — not a real error. Only investigate failures that reproduce consistently when run in isolation.
 
-Fix any real failures before proceeding — the same cycle applies: read the failure, use `/debug-e2e {artifact-dir}`, implement the minimum fix, re-run.
+Fix any real failures before proceeding — the same cycle applies: read the failure, use `/e2e:debug {artifact-dir}`, implement the minimum fix, re-run.
 
 All E2E tests must pass before writing unit tests.
 
@@ -321,7 +321,7 @@ At every E2E failure, follow this protocol:
 
 1. **Read the test output** — the assertion message often tells you exactly what's wrong
 2. **Find the artifact directory** — E2E tests save artifacts (logs, transcripts, git state) to a temp dir printed in the output
-3. **Run `/debug-e2e {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
+3. **Run `/e2e:debug {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
 4. **Implement the minimum fix** — don't over-engineer; fix only what the test demands
 5. **Re-run the failing test** — not the whole suite, just the one test
 

@@ -199,7 +199,7 @@ Use `/commit` to commit all files.
 - **Interactive tests**: Use `s.StartSession`, `s.Send`, `s.WaitFor` — tmux pane is auto-captured in artifacts
 - **Run commands**: `mise run test:e2e --agent ${slug} TestName` — see `e2e/README.md` for all options
 - **E2E tests are run during the implement phase**: This phase only creates the runner. The implement phase runs E2E tests at each tier to drive development.
-- **Debugging failures**: If tests fail during the implement phase, use `/debug-e2e` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)
+- **Debugging failures**: If tests fail during the implement phase, use `/e2e:debug` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)
 
 ## Output
 

@@ -0,0 +1,32 @@
+---
+name: e2e
+description: >
+  Orchestrate E2E test triage and fix implementation: runs triage-ci then implement sequentially.
+  Accepts test names, --agent, artifact path, or CI run reference.
+  For individual phases, use /e2e:triage-ci, /e2e:debug, or /e2e:implement.
+  Use when the user says "triage e2e", "fix e2e failures", or wants the full triage-to-fix pipeline.
+---
+
+# E2E Triage & Fix — Full Pipeline
+
+Run triage-ci then implement sequentially. Parameters are collected once and reused across both phases.
+
+## Parameters
+
+The user provides one or more of:
+- **Test name(s)** -- e.g., `TestInteractiveMultiStep`
+- **`--agent <agent>`** -- optional, defaults to all agents that previously failed
+- **A local artifact path** -- skip straight to analysis of existing artifacts
+- **CI run reference** -- `latest`, a run ID, or a run URL
+
+## Phase 1: Triage CI
+
+Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.
+
+This produces a findings report with classifications (flaky/real-bug/test-bug) for each test+agent pair.
+
+## Phase 2: Implement Fixes
+
+Read and follow the full procedure from `.claude/skills/e2e/implement.md`.
+
+Uses the findings from Phase 1 (already in conversation context) to propose, apply, and verify fixes.
@@ -1,17 +1,12 @@
----
-name: debug-e2e
-description: Use when investigating E2E test failures from artifacts to diagnose bugs in the Entire CLI, or when pointed at an artifact path for root cause analysis
----
-
 # Debug Entire CLI via E2E Artifacts
 
 Diagnose Entire CLI bugs using captured artifacts from the E2E test suite. Artifacts are written to `e2e/artifacts/` locally or downloaded from CI via GitHub Actions.
 
 ## Inputs
 
 The user provides either:
-- **A test run directory:** `e2e/artifacts/{timestamp}/` — triage all failures
-- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` — debug one test
+- **A test run directory:** `e2e/artifacts/{timestamp}/` -- triage all failures
+- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` -- debug one test
 
 ## Artifact Layout
 
@@ -32,7 +27,7 @@ e2e/artifacts/{timestamp}/
 
 ## Preserved Repo
 
-When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on — you can inspect it directly.
+When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on -- you can inspect it directly.
 
 **Navigate via the symlink** (e.g., `{artifact-dir}/repo/`) rather than resolving the `/tmp/...` path. The symlink lives inside the artifact directory so permissions and paths stay consistent.
 
@@ -42,7 +37,7 @@ The preserved repo contains:
 - The `.claude/` directory (if Claude Code was the agent)
 - All files the agent created or modified, in their final state
 
-This is the most powerful debugging tool — you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.
+This is the most powerful debugging tool -- you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.
 
 ## Debugging Workflow
 
@@ -53,9 +48,9 @@ Read `report.nocolor.txt` to identify failures and their error messages. Each en
 ### 2. Read console.log (most important)
 
 Full transcript of every operation:
-- `> claude -p "..." ...` — agent prompts with stdout/stderr
-- `> git add/commit/...` — git commands
-- `> send: ...` — interactive session inputs
+- `> claude -p "..." ...` -- agent prompts with stdout/stderr
+- `> git add/commit/...` -- git commands
+- `> send: ...` -- interactive session inputs
 
 This tells you what happened chronologically.
 
@@ -78,7 +73,7 @@ Cross-reference console.log (what happened) with the test (what should have happ
 
 ### 5. Deep dive files
 
-- **entire-logs/entire.log**: Structured JSON logs — hook lifecycle, session phases (`active` → `idle` → `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
+- **entire-logs/entire.log**: Structured JSON logs -- hook lifecycle, session phases (`active` -> `idle` -> `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
 - **git-log.txt**: Commit graph showing main branch, `entire/checkpoints/v1`, checkpoint initialization.
 - **git-tree.txt**: Files at HEAD vs checkpoint branch (separated by `--- entire/checkpoints/v1 ---`).
 - **checkpoint-metadata/**: `metadata.json` has `checkpoint_id`, `strategy`, `files_touched`, `token_usage`, and `sessions` array. Session subdirs have per-session details.

@@ -0,0 +1,113 @@
+# E2E Implement Fixes
+
+Apply fixes for E2E test failures, verify with scoped E2E tests.
+
+> **IMPORTANT: Running real E2E tests is a HARD REQUIREMENT of this procedure.**
+> Every fix MUST be verified with real E2E tests before the summary step.
+> Canary tests use the Vogon fake agent and cannot catch agent-specific issues.
+> Do NOT skip E2E verification unless the user explicitly declines due to cost.
+
+## Inputs
+
+This procedure accepts findings from one of:
+- **`/e2e:triage-ci` output** -- findings report already in conversation context
+- **`/e2e:debug` output** -- root cause analysis already in conversation context
+- **Standalone description** -- user describes known failure and desired fix
+
+## Step 1: Identify Fixes
+
+From the findings in context, identify actionable fixes:
+
+### For `flaky` failures: describe the proposed fix
+
+For agent-behavior flaky issues, fixes typically modify test prompts. For test-bug flaky issues, fixes target `e2e/` infrastructure code (harness setup, helpers, env propagation).
+
+```
+**Proposed fix:** <description>
+  - File: <path to test file or e2e infrastructure file>
+  - Change: <what will be modified -- e.g., append "Do not ask for confirmation" to prompt, or fix env propagation in NewTmuxSession>
+```
+
+Common flaky fixes:
+- Agent asked for confirmation -> append "Do not ask for confirmation" to prompt
+- Agent wrote to wrong path -> be more explicit about paths in prompt
+- Agent committed when shouldn't -> add "Do not commit" to prompt
+- Checkpoint wait timeout -> increase timeout argument
+- Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
+- Auth/env not propagated -> fix test harness env setup in `e2e/` code
+- Test helper bug (wrong assertion, bad glob) -> fix test helper in `e2e/`
+- tmux session setup issue -> fix `NewTmuxSession` or session config in `e2e/`
+
+### For `real-bug` failures: describe root cause analysis
+
+```
+**Root cause analysis:**
+  - Component: <hooks | session | checkpoint | strategy | agent>
+  - Suspected location: <file:function>
+  - Description: <what's wrong and why>
+  - Proposed fix: <what code change would address it>
+```
+
+## Step 2: Ask the User
+
+Prompt the user:
+
+> **Should I fix these?**
+> - [list of tests with classifications and proposed fixes]
+> - You can select all, specific tests, or skip.
+
+Wait for user response before proceeding.
+
+## Step 3: Apply Fixes
+
+For **flaky** fixes the user approved:
+1. Apply fixes directly in the working tree (no branch creation)
+2. Run static checks:
+   ```bash
+   mise run fmt && mise run lint
+   mise run test:e2e:canary   # Must pass
+   ```
+3. **Run real E2E tests to verify the fix.** Scope depends on what was changed:
+   - **Agent-specific fix** (e.g., `e2e/agents/cursor_cli.go`, one agent's config/trust/env): run the full suite for that agent only:
+     ```bash
+     mise run test:e2e --agent <agent>
+     ```
+   - **Shared test infra fix** (e.g., `e2e/agents/agent.go`, `e2e/testutil/`, `TmuxSession`, test helpers): run the full suite for all agents that failed, since the fix could affect any of them:
+     ```bash
+     mise run test:e2e --agent <agent1>
+     mise run test:e2e --agent <agent2>
+     # ... for each agent that had failures
+     ```
+   - **Test prompt fix** (e.g., changed wording in a specific test): run that test across all agents that failed it:
+     ```bash
+     mise run test:e2e --agent <agent> <TestName>
+     ```
+   **This step is MANDATORY** -- canary tests use the Vogon fake agent and cannot verify agent-specific behavior (trust dialogs, env propagation, config directories, etc.).
+4. If any step fails, investigate and adjust. Report what happened to the user.
+
+For **real-bug** fixes the user approved:
+1. Apply the fix directly in the working tree (no branch creation)
+2. Run static checks and unit tests:
+   ```bash
+   mise run fmt && mise run lint
+   mise run test        # Unit tests
+   mise run test:e2e:canary  # Canary tests
+   ```
+3. **Run real E2E tests to verify the fix (MANDATORY).** Same scoping rules as flaky fixes above:
+   - **Agent-specific change** -> full suite for that agent
+   - **Shared CLI/infra change** -> full suite for all agents that failed
+   - **Narrow change** (single test affected) -> just that test across affected agents
+4. Report results to the user.
+
+**GATE: Do NOT proceed to the summary until real E2E tests have been run and results reported for every fix applied above.** If E2E tests were not run, go back and run them now.
+
+## Step 4: Summary
+
+Print a summary table:
+```
+| Test | Agent(s) | Classification | Action Taken |
+|------|----------|----------------|--------------|
+| TestFoo | claude-code | flaky | Fixed in working tree |
+| TestBar | all agents | real-bug | Fix applied, tests passing |
+| TestBaz | opencode | flaky | Skipped (user declined) |
+```