Skip to content

feat: add ambient-control-plane with Kubernetes reconcilers and gRPC watch#815

Open
markturansky wants to merge 8 commits intomainfrom
feat/ambient-control-plane
Open

feat: add ambient-control-plane with Kubernetes reconcilers and gRPC watch#815
markturansky wants to merge 8 commits intomainfrom
feat/ambient-control-plane

Conversation

@markturansky
Copy link
Contributor

Summary

Adds the ambient-control-plane component that watches the ambient-api-server via gRPC streams and reconciles desired state into Kubernetes. This implements the controller pattern similar to kube-controller-manager.

Initial list-sync + gRPC watch: Performs paginated list calls via Go SDK, then subscribes to gRPC watch streams for real-time updates
Resource reconcilers: Session, Project, and ProjectSettings reconcilers for Kubernetes CRD management
Two operating modes: kube (production K8s reconciliation) and test (tally reconcilers for testing)
Graceful shutdown: Signal handling with context cancellation propagation
Thread-safe caching: Mutex-protected resource caches with event synthesis

Architecture

ambient-api-server (REST + gRPC)
        │
        │ initial sync: SDK list calls (paginated)  
        │ live updates: gRPC watch streams
        ▼
   ┌─────────┐
   │ Informer│──── cache + event synthesis ──→ ResourceEvent
   └─────────┘                                      │
        │                                           ▼
        │                              ┌──────────────────┐
        └──────────────────────────────│    Reconcilers    │
                                       └──────────────────┘
                                       Session  < /dev/null |  Project | ProjectSettings

Test Plan

  • Unit tests pass with race detector: make test
  • Code formatting and linting: make fmt && make lint (0 issues)
  • Binary builds successfully: make binary
  • Stress test handles 1000 concurrent events correctly
  • Integration test against live ambient-api-server
  • E2E test of full session lifecycle (pending API server deployment)

🤖 Generated with Claude Code

markturansky and others added 2 commits March 4, 2026 21:29
Add SessionWatcher for real-time session event streaming via gRPC,
session watch types, and a watch example. Adds google.golang.org/grpc
and ambient-api-server proto dependencies.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Fix defer cancel() blocker: store timeoutCancel on watcher, called in Stop()
- Replace deprecated grpc.DialContext/WithBlock with grpc.NewClient
- Replace fragile string trimming in deriveGRPCAddress with url.Parse + net.JoinHostPort
- Extract hardcoded port 4434 to grpcDefaultPort constant
- Run go mod tidy to correct indirect markers on direct deps

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
  Introduces a Go-based control plane that watches the ambient-api-server and reconciles state into Kubernetes, following the controller pattern. Supports three modes: kube (K8s reconciliation), local (direct process spawning), and test (tally only).

  Key features:
  - gRPC watch streams for real-time event processing
  - Multi-mode operation with environment-based configuration
  - Session, Project, and ProjectSettings reconcilers
  - Local mode with AG-UI proxy and process management
  - Comprehensive test coverage and documentation
@markturansky markturansky force-pushed the feat/ambient-control-plane branch from 7817469 to 377176e Compare March 5, 2026 02:54
@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Claude Code Review

Summary

This PR adds the ambient-control-plane component - a new Go service that watches ambient-api-server via gRPC streams and reconciles desired state into Kubernetes CRDs. The architecture is well-designed: informer + cache + dispatcher is sound, exponential-backoff reconnection is correct, and write-back echo detection via UpdatedAt timestamps is elegant. The kubeclient and tally unit tests are thorough.

The new Deployment manifest ships without a container SecurityContext, the core reconciler logic lacks unit tests, and the Makefile silently removes image-tag override support.


Issues by Severity

Blocker Issues: None.


Critical Issues

1. Deployment manifest missing container SecurityContext
File: components/manifests/base/ambient-control-plane-service.yml
The ambient-control-plane container has no securityContext block. Per security-standards.md, all containers must have AllowPrivilegeEscalation: false and Capabilities.Drop: ALL. Violates CLAUDE.md Container SecurityContext requirements.
Fix - add to the container spec: allowPrivilegeEscalation: false, readOnlyRootFilesystem: true, capabilities.drop: ALL.


Major Issues

2. Core reconciler logic has no unit tests
File: components/ambient-control-plane/internal/reconciler/reconciler.go (935 lines)
SessionReconciler, ProjectReconciler, and ProjectSettingsReconciler have zero test coverage. handleAdded/handleModified/handleDeleted, writeStatusToAPI, isWritebackEcho are all untested despite excellent tests for tally.go and kubeclient.go. buildSpec, sessionToUnstructured, namespaceForSession are pure functions that are straightforwardly testable.
Violates: backend-development.md Pre-Commit Checklist.
Fix: Add reconciler_test.go covering: buildSpec field combinations, crNameForSession/namespaceForSession edge cases, SessionReconciler.handleAdded via fake KubeClient, isWritebackEcho timestamp equality, ProjectReconciler.ensureNamespace paths.

3. MODE validation rejects documented local mode
File: components/ambient-control-plane/internal/config/config.go:978-981
Component CLAUDE.md and architecture.md document three modes: kube, local, test. config.go only accepts kube and test, with error message must be one of kube, test - directly contradicting the docs.
Fix: Either implement local mode or update CLAUDE.md/architecture.md to remove it until implemented.

4. Makefile removes IMAGE_TAG override support
File: Makefile:59-78
The IMAGE_TAG ?= latest variable (enabling make build-all IMAGE_TAG=v1.2.3) is removed. All image names are now hardcoded to :latest. CI pipelines using this override will silently produce :latest images.
Fix: Restore IMAGE_TAG ?= latest and use it for all image names including the new CONTROL_PLANE_IMAGE.

5. PostgreSQL pod-level SecurityContext removed without justification
File: components/manifests/base/ambient-api-server-db.yml
Pod-level securityContext (runAsUser: 999, runAsGroup: 999, fsGroup: 999) removed. PGDATA env var also dropped, changing the default data directory to the image default. Removing explicit identity settings makes the security posture implicit.
Violates: security-standards.md principle of least privilege.


Minor Issues

6. Duplicate deploy script added without extension
File: components/manifests/deploy (new 444-line file, identical to deploy.sh)
Both files updated in this PR - changes now maintained in two places. Fix: keep only deploy.sh.

7. WatchEvent.Object uses any - component standard violation
File: components/ambient-control-plane/internal/watcher/watcher.go
Object any uses interface{}. Component CLAUDE.md says "No interface{} in new code". architecture.md acknowledges as known limitation - needs a // TODO: replace with typed union or generics comment at minimum.

8. Documentation import path mismatch
File: components/ambient-control-plane/docs/api-surface.md
Doc shows github.com/ambient/platform/... but go.mod uses github.com/ambient-code/platform/... (missing -code).

9. Direct type assertion in test code
File: components/ambient-control-plane/internal/kubeclient/kubeclient_test.go
session.Object["spec"].(map[string]interface{}) is a direct type assertion. backend-development.md requires unstructured.NestedMap() instead, even in tests.


Positive Highlights

  • Write-back echo detection: storing UpdatedAt.Truncate(time.Microsecond) in a sync.Map and comparing on subsequent watch events is an elegant solution to the infinite-update-loop problem.
  • ResourceObject union type in informer.go: avoids any at the informer/reconciler boundary via a typed struct with constructor helpers.
  • Exponential backoff with jitter: 30s cap with 25% jitter, correct thundering-herd prevention.
  • Conflict retry loop in updateCR: re-fetches on 409 Conflict before retrying, up to maxUpdateRetries - correct K8s optimistic concurrency.
  • kubeclient_test.go: comprehensive CRUD, namespace isolation, GVR constant verification, and round-trip spec tests.
  • Input validation via isValidK8sName applied consistently before all K8s resource creation.
  • Dockerfile: UBI9 base, non-root USER 1001, stripped binary (-s -w ldflags).

Recommendations (priority order)

  1. Add container securityContext to ambient-control-plane-service.yml (required by platform standards).
  2. Add unit tests for the three reconcilers using the existing fake KubeClient pattern.
  3. Resolve the local MODE inconsistency - implement it or remove it from the docs.
  4. Restore IMAGE_TAG in the Makefile.
  5. Remove the duplicate deploy script.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 5, 2026

Summary

This PR introduces the ambient-control-plane component: a Go service that watches the ambient-api-server via gRPC streams and reconciles desired state into Kubernetes CRDs, namespaces, and role bindings. The architecture is well-designed (informer + reconciler pattern mirroring kube-controller-manager), the concurrency model is sound, and test coverage is strong. However, one security gap in the Kubernetes manifest requires a fix before merge.


Issues by Severity

Blocker Issues

1. Missing container SecurityContext on ambient-control-plane Deployment

  • File: components/manifests/base/ambient-control-plane-service.yml
  • Problem: The container spec has no securityContext. Per security-standards.md, all containers must set allowPrivilegeEscalation: false and capabilities.drop: [ALL]. The container currently runs without these hardening settings.
  • Standard violated: security-standards.md Container Security section.
  • Fix — add to the container spec:
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]

Critical Issues

2. any type in WatchEvent.Object violates component own standards

  • File: components/ambient-control-plane/internal/watcher/watcher.go (WatchEvent struct)
  • Problem: WatchEvent.Object any requires unsafe type assertions throughout informer.go. The component CLAUDE.md states: "No interface{} in new code — use generics or concrete types." Architecture docs acknowledge this as a known limitation, but that does not exempt new code from the standard.
  • Suggested fix: Open a follow-up issue and add // TODO(issue-NNN): replace any with generic type comments in watcher.go and informer.go before merge.

3. local mode documented but rejected by config validation

  • Files: components/ambient-control-plane/internal/config/config.go lines 978-981 and component CLAUDE.md
  • Problem: The component CLAUDE.md documents three modes: kube, local, and test. The config switch only permits kube and test. Any deployment using MODE=local fails at startup. This is a broken documentation contract.
  • Fix: Either add "local" to the switch case, or remove local from CLAUDE.md until it is implemented.

Major Issues

4. Removal of IMAGE_TAG override from root Makefile

  • File: Makefile
  • Problem: IMAGE_TAG ?= latest was removed; :latest is hardcoded in all image variables. This breaks make build-all IMAGE_TAG=v1.2.3 for versioned CI/CD releases.
  • Fix: Restore IMAGE_TAG ?= latest and use it in all image names including the new CONTROL_PLANE_IMAGE.

5. Pod-level SecurityContext removed from PostgreSQL Deployment

  • File: components/manifests/base/ambient-api-server-db.yml
  • Problem: Pod-level securityContext (runAsUser: 999, runAsGroup: 999, fsGroup: 999) was removed. The UBI PostgreSQL image may depend on these for /var/lib/pgsql/data ownership, potentially causing volume permission errors on non-OCP clusters.
  • Suggestion: Investigate whether the UBI image handles this via OCP SCC. If yes, document why it was removed; if no, restore the pod-level securityContext.

6. Direct type assertion in test code

  • File: components/ambient-control-plane/internal/kubeclient/kubeclient_test.go (~line 1860 of diff)
  • Problem: spec := session.Object["spec"].(map[string]interface{}) is the exact anti-pattern banned by backend-development.md (Type-Safe Unstructured Access). Will panic if spec is absent.
  • Fix: Use unstructured.NestedMap(session.Object, "spec") with proper error/not-found handling.

7. New component not wired into .pre-commit-config.yaml

  • File: .pre-commit-config.yaml (not modified by this PR)
  • Problem: components/ambient-control-plane is absent from all Go lint hooks (gofmt, go vet, golangci-lint), so commits to this component bypass repo-level lint gates entirely.
  • Fix: Add ambient-control-plane to the relevant Go hook files matchers in .pre-commit-config.yaml.

Minor Issues

8. Silent error discard in buildRestConfig

  • File: components/ambient-control-plane/internal/kubeclient/kubeclient.go
  • Problem: home, _ := os.UserHomeDir() silently discards the error. Per error-handling.md, errors must not be swallowed. On failure localPath becomes "/.kube/config" with no diagnostic.
  • Fix: Log the error at debug level before continuing the fallback chain.

9. No liveness/readiness probes on Deployment

  • File: components/manifests/base/ambient-control-plane-service.yml
  • Problem: Without probes, Kubernetes cannot detect if the control plane is wedged or disconnected from the gRPC stream.

10. Trailing blank lines in RBAC kustomization

  • File: components/manifests/base/rbac/kustomization.yaml
  • Problem: Two trailing blank lines at end of file. Would be caught by end-of-file-fixer pre-commit if the component were wired in (see issue 7).

Positive Highlights

  • Excellent concurrency design. Write lock held during cache mutations and released before dispatch avoids deadlock. The dispatchBlocking channel pattern (buffered at 256) correctly prioritizes event delivery over backpressure, and the architecture doc explains exactly why.

  • Write-back echo detection via lastWritebackAt sync.Map cleanly prevents infinite update loops, well-documented in both CLAUDE.md and architecture.md.

  • Thorough test coverage. kubeclient_test.go (601 lines) covers all CRUD, namespace isolation, and round-trip fidelity. tally_test.go covers concurrent access, snapshot isolation, and ID deduplication. Race detector enabled in CI.

  • Correct use of errors.IsNotFound / errors.IsConflict throughout reconcilers. The conflict retry loop in updateCR with re-fetch is the correct Kubernetes optimistic concurrency pattern per error-handling.md.

  • Strong input validation. isValidK8sName with a compiled regex is called before every K8s resource creation, preventing name injection from API server data.

  • Context-aware backoff with jitter in watcher.go (exponential up to 30s, 25% jitter) is the correct gRPC stream reconnection pattern.

  • Dockerfile hardening: multi-stage build, USER 1001, stripped binary, UBI minimal base image.


Recommendations

  1. [Must fix before merge] Add securityContext to container in ambient-control-plane-service.yml: allowPrivilegeEscalation: false and capabilities.drop: [ALL] (Blocker 1).
  2. [Should fix before merge] Resolve local mode mismatch between config.go and CLAUDE.md (Critical 3).
  3. [Should fix before merge] Restore IMAGE_TAG override in root Makefile (Major 4).
  4. [Should address] Open a tracked issue for any-typed WatchEvent.Object and add TODO comments in affected files (Critical 2).
  5. [Should address] Wire ambient-control-plane into .pre-commit-config.yaml (Major 7).
  6. [Consider] Investigate and document the PostgreSQL pod SecurityContext removal (Major 5).

Review generated by Claude Code (claude-sonnet-4-6) using amber.review

markturansky and others added 5 commits March 4, 2026 22:11
SDK go.mod updated from 1.21 to 1.24.0 (required by gRPC deps);
sync CLI module to match.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…dcoding

Generator now parses the base path (e.g. /api/ambient/v1) from the spec's
path keys instead of hardcoding /api/ambient-api-server/v1. Updates model,
parser, and all three language templates (Go, Python, TypeScript). Regenerates
all SDK output with the correct /api/ambient/v1 base path.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…ace condition

- Add handlePhaseTransition() method to detect "Stopping" phase and set desired-phase=Stopped annotation
- Remove phase write-back in crStatusToStatusPatch() to prevent race conditions
- Add auto-creation of ambient-runner-secrets in operator namespace handler
- Bridge gap between API server /stop endpoint and Kubernetes operator pod deletion

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add k8sinformer package for watching AgenticSession CRs across namespaces
- Implement HandleK8sCREvent for operator→API status writeback
- Add WatchAgenticSessions and GetDynamicClient to kubeclient
- Update production kustomization to use OpenShift internal registry
- Add Dockerfile.simple for streamlined container builds
- Update Go dependencies (gRPC 1.79.1, OAuth2, OpenTelemetry)
- Fix phase writeback race condition by avoiding "Stopping" phase sync

Enables real-time synchronization between Kubernetes operator state
changes and API server database, completing the control plane integration.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@ambient-code
Copy link

ambient-code bot commented Mar 5, 2026

Merge Readiness — Blockers Found

Check Status Detail
CI FAIL Failing: Unit Tests
Merge conflicts pass
Review comments FAIL Blocker: Missing SecurityContext on ambient-control-plane
Jira hygiene warn No Jira reference found
Staleness pass
Diff overlap risk pass

This comment is auto-generated by the PR Overview workflow and will be updated when the PR changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant