Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
207 changes: 207 additions & 0 deletions docs/perfmon-live-data-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
# PerfMon Live-Data Audit

Date: 2026-02-08
Scope: Read-only audit of current PerfMon implementation and minimal path to live telemetry.

## 1) Current PerfMon Rendering And Toggle

- PerfMon is mounted in the top header controls in `ui/src/App.tsx:369` as `<PerfMonPanel isLive={isAgentLive} />`.
- The header is sticky and high stacking context (`ui/src/App.tsx:265`), with a shared tooltip provider wrapping controls (`ui/src/App.tsx:267`, `ui/src/App.tsx:394`).
- `isAgentLive` is derived from websocket agent state in `ui/src/App.tsx:87`:
- `wsState.agentStatus === 'running' || wsState.agentStatus === 'paused'`
- The panel open state is local in `ui/src/components/PerfMonPanel.tsx:25`.
- Open/close behavior:
- Button toggles state (`ui/src/components/PerfMonPanel.tsx:58`)
- Outside click closes (`ui/src/components/PerfMonPanel.tsx:32`)
- `Escape` closes (`ui/src/components/PerfMonPanel.tsx:38`)
- Panel mount behavior:
- Not portaled; rendered as absolute element under button (`ui/src/components/PerfMonPanel.tsx:80`)
- Uses `z-40` inside header context.
- Tooltip behavior:
- PerfMon button uses Radix tooltip (`ui/src/components/PerfMonPanel.tsx:69`)
- Tooltip content is portaled in shared wrapper (`ui/src/components/ui/tooltip.tsx:42`, `ui/src/components/ui/tooltip.tsx:44`).

## 2) Current Data Flow

### `usePerfMonMock` shape and cadence

`ui/src/hooks/usePerfMonMock.ts` returns:

- `tokens`
- `currentRun`
- `totalSession`
- `usagePercent`
- `cpu`
- `percent`
- `memory`
- `used`
- `total`
- `percent`
- `gpu`
- `available`
- `percent`
- `used`
- `total`

Generation model:

- Deterministic smooth wave + clamp (`const wave`, `const clamp`) in `ui/src/hooks/usePerfMonMock.ts:3`, `ui/src/hooks/usePerfMonMock.ts:6`.
- No random noise spikes.
- GPU availability flips periodically (`ui/src/hooks/usePerfMonMock.ts:53`).

Update cadence:

- `1s` while running, `3s` while idle (`ui/src/hooks/usePerfMonMock.ts:85`).
- Interval-based updates (`ui/src/hooks/usePerfMonMock.ts:94`).

### PerfMonPanel consumption

- `PerfMonPanel` consumes `usePerfMonMock(isLive)` directly (`ui/src/components/PerfMonPanel.tsx:27`).
- `isLive` affects:
- mock update rate (through hook)
- button/panel badge variant and label (`Live`/`Idle`) (`ui/src/components/PerfMonPanel.tsx:72`, `ui/src/components/PerfMonPanel.tsx:83`).

Assumption:

- “Agent running” is inferred only from websocket `agentStatus` in `App`.

## 3) Existing Live Sources In Repo

### WebSocket channel already used in app

- Client hook: `useProjectWebSocket(projectName)` in `ui/src/hooks/useWebSocket.ts:61`.
- Connects to `/ws/projects/{project_name}` (`ui/src/hooks/useWebSocket.ts:88`).
- Current state already tracked in UI hook:
- `progress`, `agentStatus`, `isConnected`, `activeAgents`, `orchestratorStatus`, `devServerStatus` (`ui/src/hooks/useWebSocket.ts:33`, `ui/src/hooks/useWebSocket.ts:39`, `ui/src/hooks/useWebSocket.ts:41`, `ui/src/hooks/useWebSocket.ts:46`, `ui/src/hooks/useWebSocket.ts:54`).
- Message handlers exist for:
- `progress` (`ui/src/hooks/useWebSocket.ts:104`)
- `agent_status` (`ui/src/hooks/useWebSocket.ts:116`)
- `orchestrator_update` (`ui/src/hooks/useWebSocket.ts:284`)
- `dev_server_status` (`ui/src/hooks/useWebSocket.ts:321`)

### Backend websocket wiring

- WebSocket endpoint in `server/main.py:167`, delegated to `project_websocket` (`server/main.py:170`).
- Handler in `server/websocket.py:719`.
- Existing push sources:
- progress polling task (`server/websocket.py:685`, `server/websocket.py:843`)
- agent output callback (`server/websocket.py:758`, registered at `server/websocket.py:810`)
- agent status callback (`server/websocket.py:795`, registered at `server/websocket.py:811`)
- dev server status callback (`server/websocket.py:827`, registered at `server/websocket.py:840`)

### REST status endpoints and polling patterns

- Agent status endpoint:
- UI call `getAgentStatus` at `ui/src/lib/api.ts:232`
- Backend route `server/routers/agent.py:70`
- Polling hook every 3s at `ui/src/hooks/useProjects.ts:140`, `ui/src/hooks/useProjects.ts:145`
- Features polling every 5s at `ui/src/hooks/useProjects.ts:82`, `ui/src/hooks/useProjects.ts:87`
- Project stats endpoint exists (`/stats`) for pass/in-progress totals only (`server/routers/projects.py:367`).

### Current WS contract does not include perf metrics

- `WSMessageType` and `WSMessage` in `ui/src/lib/types.ts:243`, `ui/src/lib/types.ts:318` do not include perf telemetry fields.

## 4) Minimal Live-Data Integration Plan

### Preferred transport

- Prefer existing project WebSocket path over new polling endpoint.
- Reason:
- Current page already keeps one WS open.
- Existing architecture already emits multiple live message types over this channel.
- Lower surface area than adding new endpoint + polling hook + cache invalidation path.

### Proposed API contract

Message type: `perf_metrics`

```json
{
"type": "perf_metrics",
"timestamp": "2026-02-08T20:15:30.123Z",
"project": "my-project",
"run": {
"status": "running",
"pid": 12345,
"started_at": "2026-02-08T20:10:00Z",
"run_id": "12345-2026-02-08T20:10:00Z"
},
"tokens": {
"current_run": 1420,
"total_session": 9180,
"available": false
},
"cpu": {
"percent": 37.2
},
"memory": {
"used_gb": 6.2,
"total_gb": 16.0,
"percent": 38.8
},
"gpu": {
"available": false,
"percent": null,
"vram_used_gb": null,
"vram_total_gb": null
}
}
```

### Error/empty states and dev fallback

- UI behavior:
- If no live payload yet, show empty placeholders (`Not available`) without errors.
- If stale payload (e.g. >10s old), mark as stale and dim values.
- Dev fallback to mock only:
- `import.meta.env.DEV && import.meta.env.VITE_PERFMON_MOCK !== '0'`
- Use mock when live payload absent or disabled by backend.
- Production:
- No automatic fake fallback; show unavailable states instead.

## 5) Risks / Maintainer Concerns

### Security

- Host-level metrics may expose machine characteristics.
- In remote mode (`AUTOFORGE_ALLOW_REMOTE` in `server/main.py:97`), exposure risk is higher.
- Keep payload coarse and project-scoped, avoid file paths/usernames/process cmdline leakage.
- Ensure project-name validation and existing WS auth/path checks remain the gate.

### Performance

- Target 1s updates while running, 3s when idle to match current UI cadence expectations.
- Avoid rerender churn:
- do not update PerfMon state if values change minimally
- optionally pause updates when panel is closed (or keep lower-frequency store updates).

### Cross-platform

- CPU/memory are feasible with existing server dependency `psutil`.
- GPU/VRAM portability is inconsistent across OS/vendors.
- Contract should support `gpu.available=false` and null GPU values.

## 6) Implementation Checklist (5-8 Small Steps)

1. Add perf telemetry message types to shared UI contracts in `ui/src/lib/types.ts` (`WSPerfMetricsMessage`, union update).
2. Extend websocket UI state in `ui/src/hooks/useWebSocket.ts` with `perfMetrics` and add `case 'perf_metrics'`.
3. Add backend schema class(es) in `server/schemas.py` for perf payload (optional but recommended for contract clarity).
4. In `server/websocket.py`, add a lightweight perf sampling loop inside `project_websocket` that emits `perf_metrics` at 1s/3s cadence.
5. Source initial live fields from existing manager state:
- `status`, `pid`, `started_at` from process manager already used in `server/routers/agent.py:78`.
- CPU/memory from `psutil`.
- GPU optional/null when unavailable.
6. Update `ui/src/components/PerfMonPanel.tsx` to consume live metrics from WS first, with mock fallback in dev only.
7. Add UI stale/unavailable states and keep existing `Live`/`Idle` badge behavior.
8. Add a focused test for panel rendering with telemetry payload shape (or hook-level test), without adding a new test framework.

## Recommended Next PR Scope

Preferred scope: **add server websocket perf message + UI wiring** in one small PR.

Rationale:

- Delivers true live data immediately.
- Reuses established transport and state patterns already central to this screen.
- Keeps diff localized to websocket contract + PerfMon consumption code, avoiding new endpoint and polling complexity.
61 changes: 61 additions & 0 deletions server/websocket.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from typing import Set

from fastapi import WebSocket, WebSocketDisconnect
import psutil

from .schemas import AGENT_MASCOTS
from .services.chat_constants import ROOT_DIR
Expand Down Expand Up @@ -716,6 +717,60 @@ async def poll_progress(websocket: WebSocket, project_name: str, project_dir: Pa
break


def _collect_perf_metrics(agent_manager) -> dict:
"""Collect coarse, non-sensitive metrics for PerfMon."""
# System-level metrics are robust across platforms and do not expose paths/usernames.
cpu_percent = float(psutil.cpu_percent(interval=None))
memory = psutil.virtual_memory()

started_at = None
if agent_manager.started_at is not None:
started_at = agent_manager.started_at.isoformat()

return {
"type": "perf_metrics",
"timestamp": datetime.now().isoformat(),
"run": {
"status": agent_manager.status,
"pid": agent_manager.pid,
"started_at": started_at,
},
"tokens": {
"available": False,
"current_run": None,
"total_session": None,
},
"cpu": {
"percent": round(cpu_percent, 1),
},
"memory": {
"used_gb": round(memory.used / (1024 ** 3), 1),
"total_gb": round(memory.total / (1024 ** 3), 1),
"percent": round(float(memory.percent), 1),
},
"gpu": {
"available": False,
"percent": None,
"vram_used_gb": None,
"vram_total_gb": None,
},
}


async def poll_perf_metrics(websocket: WebSocket, agent_manager):
"""Send PerfMon telemetry over WebSocket at adaptive cadence."""
while True:
try:
await websocket.send_json(_collect_perf_metrics(agent_manager))
interval = 1 if agent_manager.status in ("running", "paused") else 3
await asyncio.sleep(interval)
except asyncio.CancelledError:
raise
except Exception as e:
logger.warning(f"Perf metrics polling error: {e}")
break


async def project_websocket(websocket: WebSocket, project_name: str):
"""
WebSocket endpoint for project updates.
Expand Down Expand Up @@ -841,6 +896,7 @@ async def on_dev_status_change(status: str):

# Start progress polling task
poll_task = asyncio.create_task(poll_progress(websocket, project_name, project_dir))
perf_task = asyncio.create_task(poll_perf_metrics(websocket, agent_manager))

try:
# Send initial agent status
Expand Down Expand Up @@ -889,10 +945,15 @@ async def on_dev_status_change(status: str):
finally:
# Clean up
poll_task.cancel()
perf_task.cancel()
try:
await poll_task
except asyncio.CancelledError:
pass
try:
await perf_task
except asyncio.CancelledError:
pass

# Unregister agent callbacks
agent_manager.remove_output_callback(on_output)
Expand Down
4 changes: 4 additions & 0 deletions ui/src/App.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ import { KeyboardShortcutsHelp } from './components/KeyboardShortcutsHelp'
import { ThemeSelector } from './components/ThemeSelector'
import { ResetProjectModal } from './components/ResetProjectModal'
import { ProjectSetupRequired } from './components/ProjectSetupRequired'
import { PerfMonPanel } from './components/PerfMonPanel'
import { getDependencyGraph, startAgent } from './lib/api'
import { Loader2, Settings, Moon, Sun, RotateCcw, BookOpen } from 'lucide-react'
import type { Feature } from './lib/types'
Expand Down Expand Up @@ -83,6 +84,7 @@ function App() {
useAgentStatus(selectedProject) // Keep polling for status updates
const wsState = useProjectWebSocket(selectedProject)
const { theme, setTheme, darkMode, toggleDarkMode, themes } = useTheme()
const isAgentLive = wsState.agentStatus === 'running' || wsState.agentStatus === 'paused'

// Get has_spec from the selected project
const selectedProjectData = projects?.find(p => p.name === selectedProject)
Expand Down Expand Up @@ -364,6 +366,8 @@ function App() {
<TooltipContent>Docs</TooltipContent>
</Tooltip>

<PerfMonPanel isLive={isAgentLive} perfMetrics={wsState.perfMetrics} />

{/* Theme selector */}
<ThemeSelector
themes={themes}
Expand Down
Loading