Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
b4cd09c
Add E2E triage automation: skill, CI workflow, and artifact download …
alishakawaguchi Mar 6, 2026
7121885
Add Slack notifications to E2E triage workflow
alishakawaguchi Mar 6, 2026
3edf654
Fix JSON injection bug and add webhook guards in E2E triage Slack not…
alishakawaguchi Mar 6, 2026
1e50aba
Add local mode and CI re-run verification to E2E triage skill
alishakawaguchi Mar 6, 2026
1ca90ee
Split E2E triage Take Action and Summary steps by local vs CI mode
alishakawaguchi Mar 6, 2026
76ebb5d
Clarify real-bug vs flaky/test-bug classification in E2E triage skill
alishakawaguchi Mar 6, 2026
6bdf24d
Add README to E2E triage skill
alishakawaguchi Mar 6, 2026
fad002b
Delegate artifact analysis from e2e-triage to debug-e2e skill
alishakawaguchi Mar 6, 2026
39d1d8f
For testing purposes only
alishakawaguchi Mar 6, 2026
4f54947
For testing purposes only
alishakawaguchi Mar 6, 2026
270d48d
Fix allowed tools
alishakawaguchi Mar 6, 2026
86af343
Switch E2E triage to manual workflow_dispatch, remove auto-trigger an…
alishakawaguchi Mar 7, 2026
1cd2b0e
Remove CI mode from e2e-triage skill, keep local-only debugging
alishakawaguchi Mar 7, 2026
649f765
Add CI artifact download support to e2e-triage skill
alishakawaguchi Mar 7, 2026
790a8f2
Cache downloaded CI artifacts to avoid redundant downloads
alishakawaguchi Mar 7, 2026
fefa994
Fix cursor-cli ENOENT on CI and require real E2E verification in tria…
alishakawaguchi Mar 7, 2026
3e27584
Restructure e2e-triage and debug-e2e skills into e2e plugin
alishakawaguchi Mar 7, 2026
f20a222
Merge branch 'main' into alisha/e2e-triage-local-only
alishakawaguchi Mar 7, 2026
3373091
Pre-seed cursor cli-config.json to avoid ENOENT race
alishakawaguchi Mar 7, 2026
a258927
Isolate cursor config dir per-session to fix ENOENT race
alishakawaguchi Mar 8, 2026
e90e3f9
revert e2e changes
alishakawaguchi Mar 9, 2026
f1bd2f5
Fix cursor-cli E2E: isolate config dir and fix wait pattern
alishakawaguchi Mar 9, 2026
d2a9b63
Fix deferred condensation when live transcript is empty at commit time
alishakawaguchi Mar 9, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .claude/plugins/e2e/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"name": "e2e",
"description": "E2E test triage, debugging, and fix implementation toolkit",
"version": "1.0.0"
}
15 changes: 15 additions & 0 deletions .claude/plugins/e2e/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# E2E Plugin

Local plugin providing individual commands for E2E test triage and debugging.

## Commands

| Command | Description |
|---------|-------------|
| `/e2e:triage-ci` | Run failing tests locally, classify flaky vs real-bug, present findings report |
| `/e2e:debug` | Deep-dive artifact analysis for root cause diagnosis |
| `/e2e:implement` | Apply fixes from triage/debug findings, verify with E2E tests |

## Related

- Orchestrator skill: `.claude/skills/e2e/SKILL.md` (`/e2e` — runs triage-ci then implement)
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/debug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Deep-dive artifact analysis for diagnosing E2E test failures"
---

# Debug Command

Read and follow the full procedure from `.claude/skills/e2e/debug.md`.
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/implement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Apply fixes from triage/debug findings, verify with scoped E2E tests"
---

# Implement Command

Read and follow the full procedure from `.claude/skills/e2e/implement.md`.
7 changes: 7 additions & 0 deletions .claude/plugins/e2e/commands/triage-ci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
description: "Run failing E2E tests locally, classify flaky vs real-bug, present findings report"
---

# Triage CI Command

Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.
2 changes: 1 addition & 1 deletion .claude/skills/agent-integration/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ This skill enforces strict E2E-first test-driven development. The rules:
3. **Unit tests are written last.** After all E2E tiers pass (Step 14), you write unit tests using real data collected from E2E runs as golden fixtures.
4. **If you didn't watch it fail, you don't know if it tests the right thing.** Never write a test you haven't seen fail first.
5. **Minimum viable fix.** At each E2E failure, implement only the code needed to fix that failure. Don't anticipate future tiers.
6. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
6. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.

## Pipeline

Expand Down
20 changes: 10 additions & 10 deletions .claude/skills/agent-integration/implementer.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Build the agent Go package using strict E2E-first TDD. Unit tests are written ON
1. **E2E tests are the spec.** The existing `ForEachAgent` test scenarios define "working". You implement until they pass.
2. **Watch it fail first.** Every E2E tier starts by running the test and observing the failure. If you haven't seen the failure, you don't understand what needs fixing.
3. **Minimum viable fix.** At each failure, implement only the code needed to make that specific assertion pass. Don't anticipate future tiers.
4. **`/debug-e2e` is your debugger.** When an E2E test fails, use the artifact directory with `/debug-e2e` before guessing at fixes.
4. **`/e2e:debug` is your debugger.** When an E2E test fails, use the artifact directory with `/e2e:debug` before guessing at fixes.
5. **No unit tests during Steps 4-13.** Unit tests are written in Step 14 after all E2E tiers pass, using real data from E2E runs as golden fixtures.
6. **Format and lint, don't unit test.** Between E2E tiers, run `mise run fmt && mise run lint` to keep code clean. Any earlier `mise run test` invocations (e.g., in Step 3) are strictly compile-only sanity checks — no `mise run test` between E2E tiers (Steps 4-13).
7. **If you didn't watch it fail, you don't know if it tests the right thing.**
Expand Down Expand Up @@ -83,7 +83,7 @@ This test requires no agent prompts — it only exercises hooks, so it's the fas

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestHumanOnlyChangesAndCommits`
2. **Watch it fail** — read the failure output carefully
3. If there are artifact dirs, use `/debug-e2e {artifact-dir}` to understand what happened
3. If there are artifact dirs, use `/e2e:debug {artifact-dir}` to understand what happened
4. Implement the minimum code to fix the first failure
5. Repeat until the test passes

Expand All @@ -105,7 +105,7 @@ The foundational test. This exercises the full agent lifecycle: start session

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionManualCommit`
2. **Watch it fail** — read the failure output carefully
3. Use `/debug-e2e {artifact-dir}` to understand what happened
3. Use `/e2e:debug {artifact-dir}` to understand what happened
4. Implement the minimum code to fix the first failure
5. Repeat until the test passes

Expand All @@ -127,7 +127,7 @@ Validates transcript quality: JSONL validity, content hash correctness, prompt e

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestCheckpointMetadataDeepValidation`
2. **Watch it fail** — this test often exposes subtle transcript formatting bugs
3. Use `/debug-e2e {artifact-dir}` on any failures
3. Use `/e2e:debug {artifact-dir}` on any failures
4. Fix and repeat

Run: `mise run fmt && mise run lint`
Expand All @@ -146,7 +146,7 @@ Agent creates files and commits them within a single prompt turn. Tests the in-t
**Cycle:**

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestSingleSessionAgentCommitInTurn`
2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
3. Fix and repeat — if the agent doesn't support committing, skip this test

Run: `mise run fmt && mise run lint`
Expand All @@ -164,7 +164,7 @@ Run these tests to validate multi-session behavior:
**Cycle (for each test):**

1. Run: `mise run test:e2e --agent $AGENT_SLUG TestMultiSessionManualCommit`
2. **Watch it fail** — use `/debug-e2e {artifact-dir}` on failures
2. **Watch it fail** — use `/e2e:debug {artifact-dir}` on failures
3. Fix and repeat
4. Move to next test

Expand All @@ -183,7 +183,7 @@ Run these tests for file operation correctness:
- `TestDeletedFilesCommitDeletion` — Agent deletes a file, user commits the deletion
- `TestMixedNewAndModifiedFiles` — Agent both creates and modifies files

**Cycle:** Same as above — run each test, **watch it fail**, use `/debug-e2e` on failures, fix, repeat.
**Cycle:** Same as above — run each test, **watch it fail**, use `/e2e:debug` on failures, fix, repeat.

Run: `mise run fmt && mise run lint`

Expand Down Expand Up @@ -215,7 +215,7 @@ Run these if the agent supports interactive multi-step sessions:
- `TestRewindAfterCommit` — Rewind to a checkpoint after committing
- `TestRewindMultipleFiles` — Rewind with multiple files changed

**Cycle:** Same pattern — run, **watch it fail**, `/debug-e2e` on failures, fix, repeat.
**Cycle:** Same pattern — run, **watch it fail**, `/e2e:debug` on failures, fix, repeat.

Run: `mise run fmt && mise run lint`

Expand Down Expand Up @@ -256,7 +256,7 @@ mise run test:e2e --agent $AGENT_SLUG TestFailingTestName

If a test passes when run individually but fails in the full suite, it's a flaky failure — not a real error. Only investigate failures that reproduce consistently when run in isolation.

Fix any real failures before proceeding — the same cycle applies: read the failure, use `/debug-e2e {artifact-dir}`, implement the minimum fix, re-run.
Fix any real failures before proceeding — the same cycle applies: read the failure, use `/e2e:debug {artifact-dir}`, implement the minimum fix, re-run.

All E2E tests must pass before writing unit tests.

Expand Down Expand Up @@ -321,7 +321,7 @@ At every E2E failure, follow this protocol:

1. **Read the test output** — the assertion message often tells you exactly what's wrong
2. **Find the artifact directory** — E2E tests save artifacts (logs, transcripts, git state) to a temp dir printed in the output
3. **Run `/debug-e2e {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
3. **Run `/e2e:debug {artifact-dir}`** — this skill analyzes artifacts and diagnoses the root cause
4. **Implement the minimum fix** — don't over-engineer; fix only what the test demands
5. **Re-run the failing test** — not the whole suite, just the one test

Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/agent-integration/test-writer.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,7 @@ Use `/commit` to commit all files.
- **Interactive tests**: Use `s.StartSession`, `s.Send`, `s.WaitFor` — tmux pane is auto-captured in artifacts
- **Run commands**: `mise run test:e2e --agent ${slug} TestName` — see `e2e/README.md` for all options
- **E2E tests are run during the implement phase**: This phase only creates the runner. The implement phase runs E2E tests at each tier to drive development.
- **Debugging failures**: If tests fail during the implement phase, use `/debug-e2e` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)
- **Debugging failures**: If tests fail during the implement phase, use `/e2e:debug` with the artifact directory to diagnose CLI-level issues (hooks, checkpoints, session phases, attribution)

## Output

Expand Down
32 changes: 32 additions & 0 deletions .claude/skills/e2e/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: e2e
description: >
Orchestrate E2E test triage and fix implementation: runs triage-ci then implement sequentially.
Accepts test names, --agent, artifact path, or CI run reference.
For individual phases, use /e2e:triage-ci, /e2e:debug, or /e2e:implement.
Use when the user says "triage e2e", "fix e2e failures", or wants the full triage-to-fix pipeline.
---

# E2E Triage & Fix — Full Pipeline

Run triage-ci then implement sequentially. Parameters are collected once and reused across both phases.

## Parameters

The user provides one or more of:
- **Test name(s)** -- e.g., `TestInteractiveMultiStep`
- **`--agent <agent>`** -- optional, defaults to all agents that previously failed
- **A local artifact path** -- skip straight to analysis of existing artifacts
- **CI run reference** -- `latest`, a run ID, or a run URL

## Phase 1: Triage CI

Read and follow the full procedure from `.claude/skills/e2e/triage-ci.md`.

This produces a findings report with classifications (flaky/real-bug/test-bug) for each test+agent pair.

## Phase 2: Implement Fixes

Read and follow the full procedure from `.claude/skills/e2e/implement.md`.

Uses the findings from Phase 1 (already in conversation context) to propose, apply, and verify fixes.
Original file line number Diff line number Diff line change
@@ -1,17 +1,12 @@
---
name: debug-e2e
description: Use when investigating E2E test failures from artifacts to diagnose bugs in the Entire CLI, or when pointed at an artifact path for root cause analysis
---

# Debug Entire CLI via E2E Artifacts

Diagnose Entire CLI bugs using captured artifacts from the E2E test suite. Artifacts are written to `e2e/artifacts/` locally or downloaded from CI via GitHub Actions.

## Inputs

The user provides either:
- **A test run directory:** `e2e/artifacts/{timestamp}/` triage all failures
- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` debug one test
- **A test run directory:** `e2e/artifacts/{timestamp}/` -- triage all failures
- **A specific test directory:** `e2e/artifacts/{timestamp}/{TestName}-{agent}/` -- debug one test

## Artifact Layout

Expand All @@ -32,7 +27,7 @@ e2e/artifacts/{timestamp}/

## Preserved Repo

When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on you can inspect it directly.
When the test run was executed with `E2E_KEEP_REPOS=1`, each test's artifact directory contains a `repo` symlink pointing to the preserved temporary git repository. This is the actual repo the test operated on -- you can inspect it directly.

**Navigate via the symlink** (e.g., `{artifact-dir}/repo/`) rather than resolving the `/tmp/...` path. The symlink lives inside the artifact directory so permissions and paths stay consistent.

Expand All @@ -42,7 +37,7 @@ The preserved repo contains:
- The `.claude/` directory (if Claude Code was the agent)
- All files the agent created or modified, in their final state

This is the most powerful debugging tool you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.
This is the most powerful debugging tool -- you can run `git log`, `git diff`, `git show`, inspect `.entire/` internals, and see exactly what the CLI left behind.

## Debugging Workflow

Expand All @@ -53,9 +48,9 @@ Read `report.nocolor.txt` to identify failures and their error messages. Each en
### 2. Read console.log (most important)

Full transcript of every operation:
- `> claude -p "..." ...` agent prompts with stdout/stderr
- `> git add/commit/...` git commands
- `> send: ...` interactive session inputs
- `> claude -p "..." ...` -- agent prompts with stdout/stderr
- `> git add/commit/...` -- git commands
- `> send: ...` -- interactive session inputs

This tells you what happened chronologically.

Expand All @@ -78,7 +73,7 @@ Cross-reference console.log (what happened) with the test (what should have happ

### 5. Deep dive files

- **entire-logs/entire.log**: Structured JSON logs hook lifecycle, session phases (`active` `idle` `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
- **entire-logs/entire.log**: Structured JSON logs -- hook lifecycle, session phases (`active` -> `idle` -> `ended`), warnings, errors. Key fields: `component`, `hook`, `strategy`, `session_id`.
- **git-log.txt**: Commit graph showing main branch, `entire/checkpoints/v1`, checkpoint initialization.
- **git-tree.txt**: Files at HEAD vs checkpoint branch (separated by `--- entire/checkpoints/v1 ---`).
- **checkpoint-metadata/**: `metadata.json` has `checkpoint_id`, `strategy`, `files_touched`, `token_usage`, and `sessions` array. Session subdirs have per-session details.
Expand Down
113 changes: 113 additions & 0 deletions .claude/skills/e2e/implement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# E2E Implement Fixes

Apply fixes for E2E test failures, verify with scoped E2E tests.

> **IMPORTANT: Running real E2E tests is a HARD REQUIREMENT of this procedure.**
> Every fix MUST be verified with real E2E tests before the summary step.
> Canary tests use the Vogon fake agent and cannot catch agent-specific issues.
> Do NOT skip E2E verification unless the user explicitly declines due to cost.

## Inputs

This procedure accepts findings from one of:
- **`/e2e:triage-ci` output** -- findings report already in conversation context
- **`/e2e:debug` output** -- root cause analysis already in conversation context
- **Standalone description** -- user describes known failure and desired fix

## Step 1: Identify Fixes

From the findings in context, identify actionable fixes:

### For `flaky` failures: describe the proposed fix

For agent-behavior flaky issues, fixes typically modify test prompts. For test-bug flaky issues, fixes target `e2e/` infrastructure code (harness setup, helpers, env propagation).

```
**Proposed fix:** <description>
- File: <path to test file or e2e infrastructure file>
- Change: <what will be modified -- e.g., append "Do not ask for confirmation" to prompt, or fix env propagation in NewTmuxSession>
```

Common flaky fixes:
- Agent asked for confirmation -> append "Do not ask for confirmation" to prompt
- Agent wrote to wrong path -> be more explicit about paths in prompt
- Agent committed when shouldn't -> add "Do not commit" to prompt
- Checkpoint wait timeout -> increase timeout argument
- Agent timeout (signal: killed) -> increase per-test timeout, simplify prompt
- Auth/env not propagated -> fix test harness env setup in `e2e/` code
- Test helper bug (wrong assertion, bad glob) -> fix test helper in `e2e/`
- tmux session setup issue -> fix `NewTmuxSession` or session config in `e2e/`

### For `real-bug` failures: describe root cause analysis

```
**Root cause analysis:**
- Component: <hooks | session | checkpoint | strategy | agent>
- Suspected location: <file:function>
- Description: <what's wrong and why>
- Proposed fix: <what code change would address it>
```

## Step 2: Ask the User

Prompt the user:

> **Should I fix these?**
> - [list of tests with classifications and proposed fixes]
> - You can select all, specific tests, or skip.

Wait for user response before proceeding.

## Step 3: Apply Fixes

For **flaky** fixes the user approved:
1. Apply fixes directly in the working tree (no branch creation)
2. Run static checks:
```bash
mise run fmt && mise run lint
mise run test:e2e:canary # Must pass
```
3. **Run real E2E tests to verify the fix.** Scope depends on what was changed:
- **Agent-specific fix** (e.g., `e2e/agents/cursor_cli.go`, one agent's config/trust/env): run the full suite for that agent only:
```bash
mise run test:e2e --agent <agent>
```
- **Shared test infra fix** (e.g., `e2e/agents/agent.go`, `e2e/testutil/`, `TmuxSession`, test helpers): run the full suite for all agents that failed, since the fix could affect any of them:
```bash
mise run test:e2e --agent <agent1>
mise run test:e2e --agent <agent2>
# ... for each agent that had failures
```
- **Test prompt fix** (e.g., changed wording in a specific test): run that test across all agents that failed it:
```bash
mise run test:e2e --agent <agent> <TestName>
```
**This step is MANDATORY** -- canary tests use the Vogon fake agent and cannot verify agent-specific behavior (trust dialogs, env propagation, config directories, etc.).
4. If any step fails, investigate and adjust. Report what happened to the user.

For **real-bug** fixes the user approved:
1. Apply the fix directly in the working tree (no branch creation)
2. Run static checks and unit tests:
```bash
mise run fmt && mise run lint
mise run test # Unit tests
mise run test:e2e:canary # Canary tests
```
3. **Run real E2E tests to verify the fix (MANDATORY).** Same scoping rules as flaky fixes above:
- **Agent-specific change** -> full suite for that agent
- **Shared CLI/infra change** -> full suite for all agents that failed
- **Narrow change** (single test affected) -> just that test across affected agents
4. Report results to the user.

**GATE: Do NOT proceed to the summary until real E2E tests have been run and results reported for every fix applied above.** If E2E tests were not run, go back and run them now.

## Step 4: Summary

Print a summary table:
```
| Test | Agent(s) | Classification | Action Taken |
|------|----------|----------------|--------------|
| TestFoo | claude-code | flaky | Fixed in working tree |
| TestBar | all agents | real-bug | Fix applied, tests passing |
| TestBaz | opencode | flaky | Skipped (user declined) |
```
Loading