Skip to content

feat: complete enterprise hardening and storage lock reliability#34

Open
ndycode wants to merge 5 commits intofeat/enterprise-hardeningfrom
fix/pr32-feedback
Open

feat: complete enterprise hardening and storage lock reliability#34
ndycode wants to merge 5 commits intofeat/enterprise-hardeningfrom
fix/pr32-feedback

Conversation

@ndycode
Copy link
Owner

@ndycode ndycode commented Mar 3, 2026

Summary

  • add enterprise hardening baseline (audit, ABAC, idempotency, retention, secret scan hooks, CI gates)
  • harden secret key derivation and rotation command handling
  • add file-based lock stale cleanup to prevent storage deadlocks
  • add runbooks and operational docs

Validation

  • npm test
  • npm run lint
  • npm run typecheck
  • npm run build

note: greptile review for oc-chatgpt-multi-auth. cite files like lib/foo.ts:123. confirm regression tests + windows concurrency/token redaction coverage.

Greptile Summary

this pr applies a broad enterprise hardening baseline across storage concurrency, token redaction, data retention resilience, supply-chain gating, and operational runbooks. the core storage change introduces a three-layer lock ordering (accountFileMutex → file lock → storageMutex) via withStorageSerializedFileLock, which makes cross-process save ordering deterministic and explicit. accompanying changes add EDECRYPT fast-fail, releaseStorageLockFallback, url-stripping in audit logs, error-message sanitization in background jobs, and a token-set license check fix.

key issues found:

  • lib/file-lock.ts — when handle.writeFile throws (e.g. windows EPERM from av-scanner), the lock file is left on disk empty/partial. stale-detection cannot parse its pid and skips the unlink, blocking all further acquisitions for up to 120 s. both the async and sync paths are affected. no vitest coverage for this branch.
  • lib/data-retention.tswithRetentionIoRetry wraps the entire recursive pruneDirectoryByAge call rather than individual leaf i/o ops, causing exponential retry amplification on deep trees and under-counting removals across aborted passes. the new throw error on non-enoent errors also changes behaviour from skip-and-continue to abort-entire-cycle, which can stop a full retention run if one file is persistently av-locked on windows.
  • lib/unified-settings.ts — sync temp path uses no random suffix (${pid}.${Date.now()}.tmp), creating a cross-process collision risk for same-millisecond writes; the async path already uses a random suffix.
  • lib/background-jobs.tssanitizeErrorMessage token regex character class [A-Z0-9._-] misses standard base64 chars + and /, allowing partial token leakage for non-url-safe tokens.

Confidence Score: 2/5

  • not safe to merge — two logic bugs can cause cross-process lock deadlocks and abort full retention cycles on windows deployments.
  • the file-lock orphaned-file bug is a credible 120 s deadlock path on any windows filesystem where av-scanner triggers EPERM on write. the data-retention recursive-retry wrapping is a correctness and availability issue. both are in core storage paths that run on every desktop install. the rest of the hardening (audit redaction, license check, cursor validation, ci gates) is solid.
  • lib/file-lock.ts and lib/data-retention.ts need fixes before merge; lib/unified-settings.ts sync temp path is a lower-priority hardening item.

Important Files Changed

Filename Overview
lib/file-lock.ts adds write/close error separation to guarantee fd is always closed — but leaves the lock file on disk when writeFile throws, risking a 120 s deadlock window on windows EPERM scenarios where stale-detection cannot parse an empty lock file.
lib/data-retention.ts adds windows-friendly EBUSY/EPERM retry wrapper — but wraps entire recursive pruneDirectoryByAge call in withRetentionIoRetry, causing exponential retry amplification on deep trees and under-counted removal totals; also adds throw error that can abort full prune cycles on persistent locks.
lib/storage.ts introduces withStorageSerializedFileLock (queue mutex → file lock → storage mutex), EDECRYPT fast-fail error code, releaseStorageLockFallback for ENOENT-guarded cleanup, and isStoredAccountCandidate guard — concurrency reasoning is explicit and lock ordering is consistent across all mutation paths.
lib/unified-settings.ts moves file lock from inside writeSettingsRecord* to callers to cover the full read-modify-write; write is correctly protected, but sync temp path lacks random suffix — same-millisecond cross-process writes could collide.
lib/background-jobs.ts adds sanitizeErrorMessage to redact emails/tokens in dead-letter and warning logs, and adds 429 as a retryable status code; token regex misses standard base64 chars (+/) but covers the common jwt/oauth2 surface area.
lib/codex-manager.ts adds digit-only guard to decodePaginationCursor before parseInt — prevents crafted base64 payloads from slipping through the finite/non-negative checks; clean fix.
index.ts strips query params from url before audit logging via URL.origin+pathname, replaces raw error message with errorType in audit metadata — good token-leakage reduction; fallback to raw url on parse failure is a minor residual risk.
.github/workflows/ci.yml coverage-gate job now depends on test job (needs: test), downloads the dist artifact, and uses the correct npm run test:coverage script — ordering and artifact handoff look correct.
scripts/license-policy-check.js replaces substring includes check with token-set membership check, handles object-style license fields and arrays — fixes false positives like MIT-0 matching MIT.
test/storage.test.ts adds EDECRYPT fast-fail test, EBUSY/EPERM lock.release fallback tests, and a concurrent-save serialization test — good coverage of the new storage lock paths; no race test for the new three-layer mutex ordering.

Sequence Diagram

sequenceDiagram
    participant C as Caller
    participant QM as accountFileMutex<br/>(in-process queue)
    participant FL as acquireFileLock<br/>(cross-process .lock file)
    participant SM as storageMutex<br/>(in-process queue)
    participant FS as Filesystem

    C->>QM: withAccountFileMutex()
    QM-->>C: queued (waits for prev)
    C->>FL: acquireFileLock(path.lock)
    Note over FL,FS: wx open → write PID → close
    FL-->>C: lock handle
    C->>SM: withStorageLock()
    SM-->>C: queued (waits for prev)
    C->>FS: loadAccountsInternal / saveAccountsUnlocked
    FS-->>C: result
    C->>SM: release storageMutex
    C->>FL: lock.release() [unlink .lock]
    alt release throws non-ENOENT
        FL-->>C: error
        C->>FS: releaseStorageLockFallback (fs.rm --force)
    end
    C->>QM: release accountFileMutex
Loading

Comments Outside Diff (2)

  1. General comment

    orphaned lock file on write failure — windows filesystem deadlock risk

    when handle.writeFile throws (common on windows due to av-scanner holding the fd briefly — EPERM), the file was already created by fs.open("wx"). the error path closes the handle and throws, but never unlinks the file. subsequent acquireFileLock calls hit EEXIST and fall into the stale-detection path, which tries to JSON.parse the lock file. an empty or partially-written file causes the parse to throw, and the stale cleanup skips the unlink — leaving the lock stuck for the full staleAfterMs (120 s).

    same problem exists in acquireFileLockSync at the equivalent block.

    no vitest coverage for this branch — add a test that injects EPERM on handle.writeFile and asserts the lock file is removed before the next attempt.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: lib/file-lock.ts
    Line: 311-332
    
    Comment:
    **orphaned lock file on write failure — windows filesystem deadlock risk**
    
    when `handle.writeFile` throws (common on windows due to av-scanner holding the fd briefly — `EPERM`), the file was already created by `fs.open("wx")`. the error path closes the handle and throws, but never unlinks the file. subsequent `acquireFileLock` calls hit `EEXIST` and fall into the stale-detection path, which tries to `JSON.parse` the lock file. an empty or partially-written file causes the parse to throw, and the stale cleanup skips the unlink — leaving the lock stuck for the full `staleAfterMs` (120 s).
    
    same problem exists in `acquireFileLockSync` at the equivalent block.
    
    
    
    no vitest coverage for this branch — add a test that injects `EPERM` on `handle.writeFile` and asserts the lock file is removed before the next attempt.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Codex

  2. General comment

    write happens outside the file lock in the refactored writeSettingsRecordSync

    the old code held the file lock for the entire temp-write → rename sequence. the refactored writeSettingsRecordSync writes the temp file and renames before any lock is acquired — the lock is now only acquired in the callers (saveUnifiedPluginConfigSync, etc.). since writeSettingsRecordSync is called inside the locked block in the callers, the write is actually protected today.

    however, the rename loop in writeSettingsRecordSync happens while the lock is held, which is correct. but the temp path for the sync variant is ${path}.${process.pid}.${Date.now()}.tmp — no random suffix. two same-process sync writes occurring at the same millisecond timestamp would overwrite each other's temp file. the file lock serialises same-process callers, but a cross-process scenario (multiple terminals starting at the same ms) could clobber the temp file before rename. the async variant already uses a random suffix — apply the same pattern here for consistency and safety.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: lib/unified-settings.ts
    Line: 133-162
    
    Comment:
    **write happens outside the file lock in the refactored `writeSettingsRecordSync`**
    
    the old code held the file lock for the entire temp-write → rename sequence. the refactored `writeSettingsRecordSync` writes the temp file and renames before any lock is acquired — the lock is now only acquired in the callers (`saveUnifiedPluginConfigSync`, etc.). since `writeSettingsRecordSync` is called *inside* the locked block in the callers, the write is actually protected today.
    
    however, the rename loop in `writeSettingsRecordSync` happens while the lock is held, which is correct. but the temp path for the sync variant is `${path}.${process.pid}.${Date.now()}.tmp` — no random suffix. two same-process sync writes occurring at the same millisecond timestamp would overwrite each other's temp file. the file lock serialises same-process callers, but a cross-process scenario (multiple terminals starting at the same ms) could clobber the temp file before rename. the async variant already uses a random suffix — apply the same pattern here for consistency and safety.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Codex

Fix All in Codex

Last reviewed commit: 0c48823

Greptile also left 1 inline comment on this PR.

Context used:

  • Rule from dashboard - What: Every code change must explain how it defends against Windows filesystem concurrency bugs and ... (source)

- cleanup stale/dead process lock artifacts before acquiring account lock

- ensure lock release always attempts fallback cleanup

- keep clearAccounts/saveTransactions serialized across file and memory locks

Co-authored-by: Codex <noreply@openai.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Summary

This PR implements enterprise hardening with critical storage lock reliability improvements to resolve race conditions and file system failures in multi-process scenarios. While the changes are major in scope and affect core storage architecture, comprehensive regression tests (1990+ lines in storage tests alone) validate the three-layer lock serialization, encryption failure handling, and concurrent save scenarios, mitigating data-loss and corruption risks.

Key Architectural Changes

Three-Layer Lock Serialization: Introduces deterministic lock acquisition order (in-process accountFileMutex → file lock → storage mutex) via withStorageSerializedFileLock wrapper applied to all storage mutations (saveAccounts, withAccountStorageTransaction, clearAccounts). This prevents concurrent file corruption and race conditions during parallel saves.

Stale Lock Cleanup: Implements file-based recovery from crashed processes with 120-second timeout and PID liveness checks. When withAccountFileLock runs cleanup before acquiring, it removes stale locks from dead processes, addressing Windows filesystem and EBUSY/EPERM scenarios.

Encryption-First Error Handling: Introduces StorageError class and STORAGE_DECRYPT_ERROR_CODE (EDECRYPT) to propagate decryption failures explicitly, ensuring failed decrypt attempts bypass fallback migrations and clearly signal misconfiguration of CODEX_AUTH_ENCRYPTION_KEY.

Best-Effort Lock Release: Adds releaseStorageLockFallback using fs.rm(..., { force: true }) on non-ENOENT release failures (EBUSY/EPERM), ensuring cleanup even when normal release fails on Windows or during filesystem contention.

Data Protection & Test Coverage

  • Backup & WAL (Write-Ahead Log): Rotating backup mechanism with .wal journal files and recovery candidates validation preserves historical backups and enables recovery from incomplete operations.
  • Regression Tests: 136 new lines in test/storage.test.ts cover encrypted storage decryption failures, lock release EBUSY/EPERM fallback, concurrent save serialization, and backup rotation integrity. Tests in test/file-lock.test.ts, test/background-jobs.test.ts, and test/data-retention.test.ts validate retry logic and transient error handling.
  • Concurrent Save Serialization: Tests explicitly verify that concurrent saves do not corrupt accounts and that lock acquisition is properly serialized.

Public API Changes

Breaking Changes to Lock Signatures: acquireFileLock() and acquireFileLockSync() now accept an options object ({ maxAttempts?, baseDelayMs?, maxDelayMs?, staleAfterMs? }); all call sites must be updated. Test harness was refactored to dynamically import and validate this new signature.

New Exports: StorageError class and error handling via toStorageDecryptError helper provide explicit error classification for encryption failures.

Security & Operational Hardening

  • Audit Logging: Replaces raw URLs with safeAuditResource (normalized from URL with fallback), replaces error objects with errorType classification to prevent sensitive data leakage in audit logs.
  • Error Redaction: sanitizeErrorMessage in background jobs redacts sensitive content before logging DLQ entries and warning messages.
  • Idempotent Secret Rotation: Runbook updated to include --idempotency-key parameter for rotate-secrets command with guidance on stable run-id sourcing (e.g., weekly-YYYYMMDD).
  • License Policy & Supply Chain: Improved token-based license matching and added devDependency @cyclonedx/cyclonedx-npm@4.2.0; CI workflows now enforce supply-chain checks before coverage gates.

Risk Assessment

Blockers: None identified; stale lock cleanup only triggers on non-ENOENT errors, and encryption failures are explicitly handled without silent fallbacks.

Data Loss Risks: Mitigated by WAL journaling, rotating backups, and explicit test coverage for concurrent scenarios and recovery paths.

Breaking Changes: Lock API signatures require updates in all call sites (already tested in updated worker harness).

walkthrough

introduces serialized, platform-aware storage locking and error handling. adds withStorageSerializedFileLock, an in-process accountFileMutex, StorageError with decrypt error code handling, and best-effort lock-release fallback. replaces direct file-lock usage in account mutation paths. (lib/storage.ts:1)

changes

Cohort / File(s) Summary
account storage locking & errors
lib/storage.ts
adds StorageError, STORAGE_DECRYPT_ERROR_CODE, toStorageDecryptError, accountFileMutex, withStorageSerializedFileLock, withAccountFileMutex, and releaseStorageLockFallback. refactors withAccountStorageTransaction, saveAccounts, clearAccounts, and normalization/decrypt paths to use serialized lock ordering and to rethrow decrypt errors. (lib/storage.ts:1)
file lock primitives
lib/file-lock.ts
captures write and close errors separately in both async and sync acquireFileLock implementations to avoid masking failures and improve error propagation. (lib/file-lock.ts:1)
tests touching storage & locks
test/storage.test.ts, test/file-lock.test.ts, test/unified-settings.test.ts
adds tests for decrypt-fail fast path, lock.release fallback behavior, deterministic concurrent saves, and worker-based dynamic import for file-lock. test harness updated to pass module URL to worker and to assert file-lock acquisition ordering. (test/storage.test.ts:1, test/file-lock.test.ts:1, test/unified-settings.test.ts:1)
retention & retry hygiene
lib/data-retention.ts, test/data-retention.test.ts, lib/file-lock.ts
adds retry wrappers for transient IO errors (with exponential backoff) around retention IO ops and test helpers that assert retry behavior. (lib/data-retention.ts:1, test/data-retention.test.ts:1)

sequence diagram(s)

sequenceDiagram
    autonumber
    participant caller as "caller"
    participant inproc as "in-process mutex\n(rgba(70,130,180,0.5))"
    participant filequeue as "file queue / file-lock\n(rgba(34,139,34,0.5))"
    participant storagemtx as "storage mutex\n(rgba(218,165,32,0.5))"
    participant fs as "filesystem\n(rgba(128,0,128,0.5))"

    caller->>inproc: request serialized lock
    inproc->>filequeue: enqueue / acquire file lock
    filequeue->>fs: create/acquire lock file
    filequeue-->>inproc: lock acquired
    inproc->>storagemtx: acquire storage mutex
    storagemtx-->>caller: perform storage mutation
    caller->>storagemtx: release storage mutex
    storagemtx-->>inproc: signal release
    inproc->>filequeue: release file lock
    filequeue->>fs: remove lock file (or leave for fallback)
    filequeue-->>inproc: fully released
Loading

estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

notes for review

  • concurrency risk: verify deterministic acquisition order and absence of deadlocks across withStorageSerializedFileLock, accountFileMutex, and file locks. inspect lib/storage.ts:1 for ordering and all error/release branches.
  • missing regression tests: ensure explicit contention and crash/recovery tests exist for serialized locking and fallback cleanup. add tests targeting withStorageSerializedFileLock and releaseStorageLockFallback (see test/storage.test.ts:1 for related coverage; more focused tests appear required).
  • windows edge cases: validate lock file removal, rename, and fallback semantics under windows (delete/rename failures). review lib/storage.ts:1 and lib/file-lock.ts:1 for platform-aware hints and ENOENT handling.
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning Description covers summary and what changed comprehensively; validation checklist is present but incomplete (missing build checkmark and all docs/governance items unchecked); risk/rollback section missing. Complete validation checklist with all npm run commands; add risk level and rollback plan sections; check relevant docs/governance items or explicitly explain why they don't apply.
Docstring Coverage ⚠️ Warning Docstring coverage is 15.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed Title follows conventional commits format with type 'feat' and summary in lowercase imperative under 72 characters; clearly relates to the main changes in lock reliability and hardening.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/pr32-feedback

Comment @coderabbitai help to get the list of available commands and usage tips.

Ensure account-storage mutations keep deterministic ordering while preserving the historical file-lock before in-process mutex acquisition sequence.

Co-authored-by: Codex <noreply@openai.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/storage.ts`:
- Around line 382-386: withAccountFileLock currently calls
cleanupDeadProcessStorageLock(lockPath) before calling
acquireFileLock(lockPath,...), which creates a race where two processes can both
remove a stale lock then race to create it; instead remove the external cleanup
call and rely on acquireFileLock to handle stale-lock removal (or move the
cleanup logic into acquireFileLock). Update withAccountFileLock to stop calling
cleanupDeadProcessStorageLock and ensure acquireFileLock (in lib/file-lock.ts)
invokes removeIfStale/remove-or-cleanup atomically so stale locks are handled
inside acquireFileLock rather than externally.
- Around line 373-379: The catch block around the storage read should not call
releaseStorageLockFallback when the error is an EBUSY from
fs.readFile/JSON.parse; update the catch in the function that reads the lock
(the block that currently inspects (error as NodeJS.ErrnoException).code) to
explicitly handle error.code === "ENOENT" (return) and error.code === "EBUSY"
(return or rethrow as appropriate) and only call
releaseStorageLockFallback(lockPath) for other error types (e.g., parse errors).
Reference the existing catch scope that uses (error as
NodeJS.ErrnoException).code and the releaseStorageLockFallback(lockPath) call to
implement this conditional behavior.
- Around line 123-139: The comment claiming "file lock -> in-process mutex"
contradicts the implementation in withStorageSerializedFileLock (which currently
calls withAccountFileMutex → withAccountFileLock → withStorageLock), so either
update the comment to state the actual acquisition order (in-process mutex ->
file lock -> storage lock) or reorder the calls to match the comment;
specifically, either change the comment near withStorageSerializedFileLock to
reflect the real sequence or change the implementation to call
withAccountFileLock(path, () => withAccountFileMutex(() => withStorageLock(fn)))
so the acquisition becomes file lock -> in-process mutex -> storage lock (and
ensure variable/closure usage still compiles).
- Around line 340-346: The catch block in releaseStorageLockFallback currently
swallows all errors from fs.rm; update it to log a debug-level message including
the lockPath and the caught error so failed cleanup is observable (e.g., use the
existing logger or processLogger if available), while preserving the
"best-effort" behavior by not rethrowing; ensure you reference
releaseStorageLockFallback and the fs.rm call so the log includes both the path
and error details.
- Around line 355-362: When detecting and removing a stale lock (the branch that
computes lockPid and isDeadProcess using process.kill) and the separate
age-based cleanup branch, add observability logs using the existing logger
(e.g., processLogger or storage logger used elsewhere in this module) that
record the action and context: include the lock path/name, lockPid, whether
isDeadProcess was true, the lock age (timestamp or computed age), and a concise
reason ("stale: dead PID" or "stale: age threshold exceeded"). Place one log
right before or immediately after the dead-process cleanup path (where
isDeadProcess is true) and another log in the age-based cleanup path to make
both events visible to operators. Ensure the log messages are structured and
include these fields so they match ops runbook expectations.
- Around line 390-401: The current finally always calls
releaseStorageLockFallback(lockPath) even when lock.release() succeeds; change
the flow so the fallback is only invoked when lock.release() throws or otherwise
fails: try calling await lock.release() and on catch (error) check the error
code like in the existing block, log the warning, then call await
releaseStorageLockFallback(lockPath) from inside that catch (or when a boolean
“failedToRelease” flag is set) so the fallback only runs on failure; update
references to lock.release, releaseStorageLockFallback, lockPath and preserve
the ENOENT handling/logging behavior.
- Around line 348-380: Add explicit debug/info/error logs inside
cleanupDeadProcessStorageLock to record: when the lock file is read and parsed
(include pid and acquiredAt), when process.kill(pid, 0) indicates the process is
alive (skipping cleanup) or throws ESRCH (releasing lock), when EPERM or other
errors occur (and whether lock is stale), when
releaseStorageLockFallback(lockPath) is invoked, and when ENOENT or JSON parse
errors are encountered; use the existing processLogger (or the module logger)
and include lockPath and relevant metadata in each message. Also add vitest
regression tests for cleanupDeadProcessStorageLock covering: (1) process exists
and lock is fresh — assert no releaseStorageLockFallback call, (2) process dead
(mock process.kill to throw ESRCH) — assert releaseStorageLockFallback called,
(3) process exists but process.kill throws EPERM — assert no release unless lock
is stale (simulate acquiredAt older than
ACCOUNT_STORAGE_LOCK_OPTIONS.staleAfterMs), and (4) missing or corrupted lock
file — assert fallback release or ENOENT path; mock fs.readFile, process.kill,
and releaseStorageLockFallback to verify log calls and behavior.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0d15b434-50c2-4cf6-8218-c272deab5dd3

📥 Commits

Reviewing files that changed from the base of the PR and between d1c603e and d267377.

📒 Files selected for processing (1)
  • lib/storage.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (1)
lib/**

⚙️ CodeRabbit configuration file

focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.

Files:

  • lib/storage.ts
🧬 Code graph analysis (1)
lib/storage.ts (1)
lib/file-lock.ts (1)
  • acquireFileLock (57-107)
🔇 Additional comments (2)
lib/storage.ts (2)

112-112: lgtm - mutex declaration is straightforward.

the accountFileMutex follows the same pattern as storageMutex at line 111.


1318-1321: regression tests do exist for the locking pattern—they just weren't cited in the pr.

test/storage.test.ts:175 directly exercises withAccountStorageTransaction with concurrent promise.all calls and verifies serialization without data loss. test/storage-async.test.ts:295 covers the mutex behavior with rapid sequential saves and concurrent writes. test/storage.test.ts:1407+ tests ebusy/eperm retry logic, covering windows filesystem edge cases. these tests validate the lock-order changes in withAccountStorageTransaction, saveAccounts, and clearAccounts at lib/storage.ts:1318, 1334, 1346.

			> Likely an incorrect or invalid review comment.

Drop pre-acquire dead-process lock cleanup and only run fallback lock-file deletion when lock.release() fails. Also align lock-order comment and add debug observability for fallback cleanup failures.

Co-authored-by: Codex <noreply@openai.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@lib/storage.ts`:
- Around line 123-138: The new file-lock serialization helpers
withAccountFileMutex and withStorageSerializedFileLock and their use in
saveAccounts lack tests covering lock.release() error paths; add unit tests in
test/storage.test.ts that mock the lock.release() implementation to throw EBUSY
and EPERM and assert that releaseStorageLockFallback is invoked and saveAccounts
still completes (no uncaught rejection), add a test that simulates concurrent
saveAccounts calls to verify ordering under lock contention (use mocked locks
that delay release) and add a Windows-specific test case where lock.release()
throws EPERM to validate the fallback cleanup path; reference the functions
withAccountFileMutex, withStorageSerializedFileLock, saveAccounts,
releaseStorageLockFallback and the lock.release() call when adding the mocks and
assertions.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3d9bb546-70b4-4fee-b0a7-5413a4e724b5

📥 Commits

Reviewing files that changed from the base of the PR and between d267377 and 1f516d1.

📒 Files selected for processing (1)
  • lib/storage.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (1)
lib/**

⚙️ CodeRabbit configuration file

focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.

Files:

  • lib/storage.ts
🧬 Code graph analysis (1)
lib/storage.ts (1)
lib/file-lock.ts (1)
  • acquireFileLock (57-107)
🔇 Additional comments (7)
lib/storage.ts (7)

112-112: lgtm: in-process mutex matches existing storageMutex pattern.

the accountFileMutex at lib/storage.ts:112 and withAccountFileMutex at lib/storage.ts:123-130 mirror the existing storageMutex/withStorageLock implementation. pattern is correct for serializing async operations within a single process.

Also applies to: 123-130


132-138: lock order comment now matches implementation.

lib/storage.ts:133-134 correctly states acquisition order: file-queue mutex → file lock → storage mutex. code at lines 135-137 does exactly that: withAccountFileMutexwithAccountFileLockwithStorageLock. past review comment resolved.


339-348: fallback now has debug observability as requested.

lib/storage.ts:343-346 logs cleanup failures at debug level with path and error. addresses past review comment about silent swallowing.


350-370: release fallback only fires on failure—double cleanup fixed.

lib/storage.ts:357-367 now only calls releaseStorageLockFallback when lock.release() throws (and it's not ENOENT). this addresses the past review about unconditional double cleanup.

one edge case worth noting: if lock.release() throws EBUSY (windows antivirus holding the file), releaseStorageLockFallback uses fs.rm({ force: true }) which will also likely fail with EBUSY and log at debug level. lock file may remain until next acquire cleans it via removeIfStale in lib/file-lock.ts:91-93. this is probably acceptable for "best-effort" but could leave stale locks on heavily-contested windows systems.


1285-1288: withAccountStorageTransaction now uses serialized lock path.

lib/storage.ts:1285-1288 routes through withStorageSerializedFileLock, ensuring consistent acquisition order with saveAccounts and clearAccounts. this eliminates the possibility of lock-order deadlocks between these paths.


1299-1304: saveAccounts uses serialized lock—consistent with transaction path.

lib/storage.ts:1301-1303 wraps saveAccountsUnlocked in withStorageSerializedFileLock. matches the acquisition order used by withAccountStorageTransaction.


1311-1336: clearAccounts uses serialized lock—consistent with other mutation paths.

lib/storage.ts:1313-1335 wraps the entire clear operation in withStorageSerializedFileLock. all three mutation paths (withAccountStorageTransaction, saveAccounts, clearAccounts) now share the same lock acquisition sequence.

Add regression tests for lock.release() EBUSY/EPERM fallback cleanup and concurrent save lock serialization to close outstanding PR feedback.

Co-authored-by: Codex <noreply@openai.com>
@ndycode
Copy link
Owner Author

ndycode commented Mar 4, 2026

@greptile pls review

Address unresolved review threads by hardening audit/log redaction, lock scope, cursor validation, retention retries, and storage decrypt behavior; align workflows/docs/scripts; and add regression tests for all actionable items.

Co-authored-by: Codex <noreply@openai.com>
Comment on lines 78 to 79
const fullPath = join(path, entry.name);
try {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapping entire recursive call in withRetentionIoRetry amplifies retries and under-counts removals

pruneDirectoryByAge is recursive. wrapping the whole call in withRetentionIoRetry means a single transient EBUSY on any file anywhere in the subtree causes the entire tree-traversal to restart from scratch (up to 5 times per nesting level → 5^depth max attempts on deep trees).

this also breaks the removed counter: files successfully deleted in an aborted pass appear as ENOENT in the retry pass (correctly skipped) but their removal is not counted in the final return value, causing the caller to under-report removals.

the throw error added at line ~102 further means a persistently-locked file (e.g., av-locked log on windows that never clears in 5 retries) now aborts the entire prune cycle rather than skipping that file as the old code did.

recommend applying withRetentionIoRetry only around the individual leaf i/o calls (stat, unlink, rmdir) rather than the recursive call, and restoring the per-file skip behaviour for non-retryable errors:

// apply retry only to the leaf ops — not to the entire recursive call
removed += await pruneDirectoryByAge(fullPath, maxAgeMs);
const childEntries = await withRetentionIoRetry(() => fs.readdir(fullPath));

and in the per-file catch block, log and continue rather than re-throw for non-ENOENT errors so one locked file cannot abort the whole cycle.

Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/data-retention.ts
Line: 78-79

Comment:
**wrapping entire recursive call in `withRetentionIoRetry` amplifies retries and under-counts removals**

`pruneDirectoryByAge` is recursive. wrapping the whole call in `withRetentionIoRetry` means a single transient `EBUSY` on any file anywhere in the subtree causes the entire tree-traversal to restart from scratch (up to 5 times per nesting level → 5^depth max attempts on deep trees).

this also breaks the `removed` counter: files successfully deleted in an aborted pass appear as `ENOENT` in the retry pass (correctly skipped) but their removal is not counted in the final return value, causing the caller to under-report removals.

the `throw error` added at line ~102 further means a persistently-locked file (e.g., av-locked log on windows that never clears in 5 retries) now aborts the entire prune cycle rather than skipping that file as the old code did.

recommend applying `withRetentionIoRetry` only around the individual leaf i/o calls (`stat`, `unlink`, `rmdir`) rather than the recursive call, and restoring the per-file skip behaviour for non-retryable errors:

```typescript
// apply retry only to the leaf ops — not to the entire recursive call
removed += await pruneDirectoryByAge(fullPath, maxAgeMs);
const childEntries = await withRetentionIoRetry(() => fs.readdir(fullPath));
```

and in the per-file catch block, log and `continue` rather than re-throw for non-ENOENT errors so one locked file cannot abort the whole cycle.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Codex

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
lib/storage.ts (1)

930-947: ⚠️ Potential issue | 🟠 Major

v1 migration output is bypassed when rebuilding normalized accounts.

lib/storage.ts:930 computes baseStorage via migrateV1ToV3, but lib/storage.ts:936 iterates rawAccounts instead of baseStorage.accounts. this can skip account-level migration normalization before decrypt/dedupe.

proposed fix
-  const rawActiveIndex = clampIndex(activeIndexValue, rawAccounts.length);
-  const activeKey = extractActiveKey(rawAccounts, rawActiveIndex);
+  const sourceAccounts = Array.isArray(baseStorage.accounts) ? baseStorage.accounts : [];
+  const rawActiveIndex = clampIndex(activeIndexValue, sourceAccounts.length);
+  const activeKey = extractActiveKey(sourceAccounts as unknown[], rawActiveIndex);
@@
-	const validAccounts: AccountMetadataV3[] = [];
-	for (const account of rawAccounts) {
+	const validAccounts: AccountMetadataV3[] = [];
+	for (const account of sourceAccounts) {
 		if (!isStoredAccountCandidate(account)) {
 			continue;
 		}
@@
-    const clampedRawIndex = clampIndex(rawIndex, rawAccounts.length);
-    const familyKey = extractActiveKey(rawAccounts, clampedRawIndex);
+    const clampedRawIndex = clampIndex(rawIndex, sourceAccounts.length);
+    const familyKey = extractActiveKey(sourceAccounts as unknown[], clampedRawIndex);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/storage.ts` around lines 930 - 947, The code builds baseStorage using
fromVersion and migrateV1ToV3 but then iterates rawAccounts, bypassing any
migration/normalization; change the loop to iterate baseStorage.accounts (or the
normalized accounts array on AccountStorageV3) when populating validAccounts so
decryptAccountSensitiveFields runs on migrated data, keeping the existing
try/catch that converts errors via toStorageDecryptError and preserving
validAccounts accumulation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@index.ts`:
- Around line 1725-1732: The fallback for safeAuditResource currently returns
the raw url when new URL(url) throws, which can leak query/hash into auditLog
(see auditLog in lib/audit.ts); update the fallback in the safeAuditResource
IIFE (and the analogous spot around line 1739) to strip query and hash from the
raw string before returning (e.g., remove everything from the first '?' or '#'
onward) so only origin+pathname or sanitized path is logged; ensure the
sanitized value is what gets passed to auditLog/resource.

In `@lib/background-jobs.ts`:
- Around line 54-57: getDelayMs currently uses a deterministic exponential
backoff which causes worker synchronization; modify getDelayMs in
lib/background-jobs.ts to compute the capped exponential base delay (as it does
now) then apply a randomized jitter of about ±20% (e.g., multiply by 1 +
randomBetween(-0.2, 0.2)) and return an integer ms to avoid synchronized retries
on 429/EBUSY. Keep the existing cap logic and ensure jitter is applied after
capping; update tests in test/background-jobs.test.ts to include a
concurrent-retry assertion that verifies returned delays for the same attempt
can differ (i.e., show jitter is present) while preserving existing behavior for
EPERM/other cases.
- Around line 39-47: The token-redaction regex in sanitizeErrorMessage currently
uses [A-Z0-9._-]+ which misses base64 chars like /, +, =; update the
token-capturing patterns in sanitizeErrorMessage (the
/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi and the Bearer
pattern) to use a non-whitespace matcher such as \S+ so tokens are consumed
until whitespace (e.g., /\b(?:access|refresh|id)?_?token(?:=|:)?\s*\S+/gi and
/\b(Bearer)\s+\S+/gi), and add a unit test in the "redacts sensitive error text"
test to assert that tokens with base64-like suffixes (containing /, +, =) are
fully redacted.

In `@lib/data-retention.ts`:
- Line 21: The RETRYABLE_RETENTION_CODES set is missing "ENOTEMPTY", causing
production code to skip retries while tests expect it; update the
RETRYABLE_RETENTION_CODES constant to include "ENOTEMPTY" (matching the test's
RETRYABLE_REMOVE_CODES) so the retry logic that checks this set (used around the
fs.rmdir call) will treat ENOTEMPTY as transient and retry accordingly.

In `@lib/file-lock.ts`:
- Around line 69-89: When write or close fails while acquiring the lock in
lib/file-lock.ts (the block using handle.writeFile and handle.close that sets
writeError/closeError), perform a best-effort unlink of the lock file before
rethrowing the error to avoid leaving a stale lock on disk; do the same for the
analogous block around lines 154-175, and add a vitest regression that simulates
both a write failure and a close failure during acquisition to assert the lock
file is removed (no orphan lock) even when errors are thrown.

In `@lib/unified-settings.ts`:
- Around line 265-272: The current pattern around
acquireFileLockSync/lock.release in unified-settings (used with
UNIFIED_SETTINGS_LOCK_PATH and SETTINGS_LOCK_OPTIONS and wrapping
readSettingsRecordSync/writeSettingsRecordSync) can propagate lock.release
errors (EPERM/EBUSY) and fail after a successful write; update all call sites
(the blocks around the lock.release calls) to catch errors from lock.release(),
treat EPERM/EBUSY as non-fatal by logging a warning and swallowing them while
rethrowing unexpected errors, ensuring the writeSettingsRecordSync result is not
rolled back; also add vitest tests in test/unified-settings.test.ts that
stub/mock lock.release to throw EPERM and EBUSY and assert the write succeeds
and the lock file does not leave the process in a failing state.

In `@scripts/license-policy-check.js`:
- Around line 48-54: The exact-token denylist check is missing common SPDX
variants (e.g., "gpl-2.0-only", "gpl-2.0-or-later") causing blocked licenses to
slip through; update the matching logic that uses extractLicenseTokens so that
for each token you also derive and check a normalized base SPDX id (strip
suffixes like "-only" and "-or-later", and any trailing "+") against the
denylist (i.e., when evaluating in the function that performs the denylist
lookup near the token match), ensuring you test both the original token and the
stripped/base form before allowing it.

In `@test/background-jobs.test.ts`:
- Around line 141-160: The test currently asserts that the thrown error contains
PII ("person@example.com"); change the assertion to avoid checking for secrets
by expecting a generic failure instead—e.g., assert that
runBackgroundJobWithRetry (from runBackgroundJobWithRetry /
getBackgroundJobDlqPath) rejects (using .rejects.toThrow() with no PII-specific
message, or .rejects.toThrowError()/toBeRejected) and keep any checks about
redaction focused on logs/DLQ contents rather than the thrown error text; update
the assertion in the failing test block so it no longer requires the email/token
to appear.

In `@test/data-retention.test.ts`:
- Around line 101-156: Add deterministic Vitest cases exercising the remaining
retry branches in enforceDataRetention: simulate transient EPERM/EACCES/EAGAIN
errors on directory operations (fs.stat/fs.rmdir) and a transient
EBUSY/EACCES-like error on single-file removal (fs.unlink) to verify the retry
logic in the functions implemented in lib/data-retention.ts (refer to
enforceDataRetention and its internal directory-entry pruning and single-file
unlink paths). For each test, spy/mock the specific fs method (stat, rmdir,
unlink) to throw the transient error once for the target path then succeed,
assert that the retry flag was hit, assert the file/dir is ultimately removed
(expect ENOENT) and restore spies in finally blocks to keep tests deterministic
and isolated.

---

Outside diff comments:
In `@lib/storage.ts`:
- Around line 930-947: The code builds baseStorage using fromVersion and
migrateV1ToV3 but then iterates rawAccounts, bypassing any
migration/normalization; change the loop to iterate baseStorage.accounts (or the
normalized accounts array on AccountStorageV3) when populating validAccounts so
decryptAccountSensitiveFields runs on migrated data, keeping the existing
try/catch that converts errors via toStorageDecryptError and preserving
validAccounts accumulation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d85dbde0-bd54-490e-b59f-2b0c10f91215

📥 Commits

Reviewing files that changed from the base of the PR and between 1f516d1 and 0c48823.

⛔ Files ignored due to path filters (1)
  • package-lock.json is excluded by !**/package-lock.json
📒 Files selected for processing (20)
  • .github/workflows/ci.yml
  • .github/workflows/supply-chain.yml
  • docs/development/TESTING.md
  • docs/runbooks/operations.md
  • index.ts
  • lib/background-jobs.ts
  • lib/codex-manager.ts
  • lib/data-retention.ts
  • lib/file-lock.ts
  • lib/storage.ts
  • lib/unified-settings.ts
  • package.json
  • scripts/license-policy-check.js
  • test/authorization.test.ts
  • test/background-jobs.test.ts
  • test/codex-manager-cli.test.ts
  • test/data-retention.test.ts
  • test/file-lock.test.ts
  • test/storage.test.ts
  • test/unified-settings.test.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (3)
docs/**

⚙️ CodeRabbit configuration file

keep README, SECURITY, and docs consistent with actual CLI flags and workflows. whenever behavior changes, require updated upgrade notes and mention new npm scripts.

Files:

  • docs/runbooks/operations.md
  • docs/development/TESTING.md
test/**

⚙️ CodeRabbit configuration file

tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions.

Files:

  • test/authorization.test.ts
  • test/codex-manager-cli.test.ts
  • test/data-retention.test.ts
  • test/background-jobs.test.ts
  • test/storage.test.ts
  • test/file-lock.test.ts
  • test/unified-settings.test.ts
lib/**

⚙️ CodeRabbit configuration file

focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.

Files:

  • lib/background-jobs.ts
  • lib/storage.ts
  • lib/unified-settings.ts
  • lib/file-lock.ts
  • lib/codex-manager.ts
  • lib/data-retention.ts
🧬 Code graph analysis (8)
index.ts (2)
lib/logger.ts (1)
  • logWarn (341-346)
lib/audit.ts (1)
  • auditLog (123-153)
test/codex-manager-cli.test.ts (2)
lib/codex-manager.ts (1)
  • runCodexMultiAuthCli (4529-4610)
scripts/codex.js (1)
  • runCodexMultiAuthCli (501-501)
test/data-retention.test.ts (1)
lib/data-retention.ts (2)
  • RetentionPolicy (6-12)
  • enforceDataRetention (119-146)
lib/storage.ts (3)
lib/file-lock.ts (1)
  • acquireFileLock (57-123)
lib/storage/migrations.ts (1)
  • AccountMetadataV3 (40-57)
lib/utils.ts (1)
  • isRecord (11-13)
test/background-jobs.test.ts (1)
lib/background-jobs.ts (2)
  • runBackgroundJobWithRetry (85-129)
  • getBackgroundJobDlqPath (81-83)
test/storage.test.ts (1)
lib/storage.ts (2)
  • saveAccounts (1326-1331)
  • loadAccounts (1006-1008)
lib/data-retention.ts (1)
lib/utils.ts (1)
  • sleep (65-67)
test/unified-settings.test.ts (1)
lib/unified-settings.ts (2)
  • getUnifiedSettingsPath (229-231)
  • saveUnifiedPluginConfig (285-296)
🪛 LanguageTool
docs/runbooks/operations.md

[uncategorized] ~28-~28: The official name of this software platform is spelled with a capital “H”.
Context: ...ce (for example weekly-YYYYMMDD or CI ${{ github.run_id }}) - remove previous key af...

(GITHUB)

🔇 Additional comments (17)
docs/runbooks/operations.md (1)

27-28: good update for rotation idempotency workflow.

this command and stable run-id guidance are clear and consistent with the weekly ops flow in docs/runbooks/operations.md (line 27 and line 28).

as per coding guidelines, docs/**: keep README, SECURITY, and docs consistent with actual CLI flags and workflows. whenever behavior changes, require updated upgrade notes and mention new npm scripts.

lib/data-retention.ts (1)

67-69: good retry coverage across retention io callsites

wrapping readdir/stat/unlink/rmdir and single-file pruning paths is consistent in lib/data-retention.ts:67 (line 67), lib/data-retention.ts:81 (line 81), and lib/data-retention.ts:106 (line 106). this meaningfully hardens transient io failures.

Also applies to: 81-92, 106-110

test/data-retention.test.ts (1)

7-27: cleanup retry helper looks solid and deterministic

test/data-retention.test.ts:7 (line 7) and test/data-retention.test.ts:45 (line 45) improve teardown stability under transient windows fs lock errors while preserving deterministic assertions.

Also applies to: 45-45

lib/background-jobs.ts (1)

104-104: good hardening on error sanitization wiring.

lib/background-jobs.ts:104 and lib/background-jobs.ts:119 consistently sanitize error strings before dlq/log output.

Also applies to: 119-119

package.json (1)

69-69: no issue: workflow correctly installs dev deps before sbom script

the sbom workflow in .github/workflows/supply-chain.yml:72 runs npm ci without --omit=dev, so devDependencies (including cyclonedx-npm from package.json:107) are installed before the sbom script executes at line 75. the setup is correct.

lib/codex-manager.ts (1)

1521-1521: the guard at lib/codex-manager.ts:1521 is solid and already has regression coverage.

test/codex-manager-cli.test.ts:2123 creates a cursor by encoding "12junk" as base64, which decodes to non-digits and correctly triggers the "Invalid --cursor value" error at test/codex-manager-cli.test.ts:2132. this covers the exact scenario the regex guard protects against.

docs/development/TESTING.md (1)

53-60: docs update is aligned with the new local gate

this section clearly documents the ordering change and the added npm run audit:ci / npm run license:check checks.

As per coding guidelines, docs/**: "whenever behavior changes, require updated upgrade notes and mention new npm scripts."

.github/workflows/supply-chain.yml (1)

11-13: nice cleanup on supply-chain workflow config

centralizing the denylist and using npm run sbom improves maintainability and keeps policy settings in one place.

Also applies to: 32-32, 75-75

.github/workflows/ci.yml (1)

55-60: ci flow update looks good

artifact handoff plus needs: test makes the coverage gate ordering explicit, and npm run test:coverage is clearer for coverage intent.

Also applies to: 68-68, 74-78, 90-90

test/authorization.test.ts (1)

59-68: good test isolation hardening

using full auth-env snapshot/restore makes these vitest cases deterministic and prevents cross-test env bleed in test/authorization.test.ts:59.

Also applies to: 72-84, 88-100, 127-127

test/codex-manager-cli.test.ts (2)

2053-2064: good pagination contract assertions.

test/codex-manager-cli.test.ts:2053 and test/codex-manager-cli.test.ts:2082 now pin schemaVersion, pagination.cursor, and pagination.pageSize, which tightens the json api contract for cursor paging.

Also applies to: 2082-2093


2103-2147: good malformed cursor regression coverage.

test/codex-manager-cli.test.ts:2103 validates both malformed decoded payloads and invalid base64 input with explicit exit-code and error assertions.

test/file-lock.test.ts (1)

88-122: worker contention path now exercises the real lock api cleanly.

test/file-lock.test.ts:88test/file-lock.test.ts:122 and test/file-lock.test.ts:134 validate lock acquisition/release through the transpiled module in separate processes, then assert no lost/duplicate writes under contention.

Also applies to: 134-166

test/unified-settings.test.ts (2)

6-25: cleanup retry helper is a good stability addition.

test/unified-settings.test.ts:6 and test/unified-settings.test.ts:44 improve teardown reliability for transient windows remove errors (ebusy/eperm/enotempty).

Also applies to: 44-44


249-302: good lock-before-read regression test.

test/unified-settings.test.ts:249 verifies saveUnifiedPluginConfig reads settings while the file lock is held, which directly covers the new lock ordering in lib/unified-settings.ts.

lib/storage.ts (1)

350-359: lock release fallback handling is cleaner now.

lib/storage.ts:350 and lib/storage.ts:368 now keep best-effort cleanup observable and only trigger forced fallback on non-enoent release failures.

Also applies to: 368-378

test/storage.test.ts (1)

680-714: strong regression coverage for decrypt failures and lock edge cases.

test/storage.test.ts:680 plus test/storage.test.ts:1579test/storage.test.ts:1676 cover fail-fast decrypt behavior, lock.release() fallback on windows-style errors, and concurrent save lock serialization.

Also applies to: 1579-1676

Comment on lines +1725 to +1732
const safeAuditResource = (() => {
try {
const parsed = new URL(url);
return `${parsed.origin}${parsed.pathname}`;
} catch {
return url;
}
})();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

sanitize fallback resource when url parsing fails

line 1730 falls back to raw url. if parsing fails, query params can be written to audit logs because auditLog persists resource verbatim (lib/audit.ts:122). strip query/hash in the fallback path too.

proposed fix
-								const safeAuditResource = (() => {
+								const safeAuditResource = (() => {
 									try {
 										const parsed = new URL(url);
 										return `${parsed.origin}${parsed.pathname}`;
 									} catch {
-										return url;
+										const raw = String(url);
+										return raw.split("?")[0]?.split("#")[0] ?? "unknown";
 									}
 								})();

As per coding guidelines, lib/**: "check for logging that leaks tokens or emails."

Also applies to: 1739-1739

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@index.ts` around lines 1725 - 1732, The fallback for safeAuditResource
currently returns the raw url when new URL(url) throws, which can leak
query/hash into auditLog (see auditLog in lib/audit.ts); update the fallback in
the safeAuditResource IIFE (and the analogous spot around line 1739) to strip
query and hash from the raw string before returning (e.g., remove everything
from the first '?' or '#' onward) so only origin+pathname or sanitized path is
logged; ensure the sanitized value is what gets passed to auditLog/resource.

Comment on lines +39 to +47
function sanitizeErrorMessage(message: string): string {
return message
.replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "***REDACTED***")
.replace(
/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi,
"token=***REDACTED***",
)
.replace(/\b(Bearer)\s+[A-Z0-9._-]+\b/gi, "$1 ***REDACTED***");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

head -60 lib/background-jobs.ts | tail -30

Repository: ndycode/codex-multi-auth

Length of output: 1146


🏁 Script executed:

wc -l lib/background-jobs.ts

Repository: ndycode/codex-multi-auth

Length of output: 94


🏁 Script executed:

sed -n '35,120p' lib/background-jobs.ts

Repository: ndycode/codex-multi-auth

Length of output: 2887


🏁 Script executed:

fd -e test.ts -e test.js -e spec.ts | head -20

Repository: ndycode/codex-multi-auth

Length of output: 607


🏁 Script executed:

rg -l "sanitizeErrorMessage" --type ts

Repository: ndycode/codex-multi-auth

Length of output: 90


🏁 Script executed:

cat test/background-jobs.test.ts

Repository: ndycode/codex-multi-auth

Length of output: 5755


🏁 Script executed:

python3 << 'PY'
import re

# Current patterns
current_token = re.compile(r'\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)', re.I)
current_bearer = re.compile(r'\b(Bearer)\s+[A-Z0-9._-]+\b', re.I)

# Proposed patterns
proposed_token = re.compile(r'\b(?:access|refresh|id)?_?token(?:=|:)?\s*([^\s,;]+)', re.I)
proposed_bearer = re.compile(r'\b(Bearer)\s+[^\s,;]+\b', re.I)

# Test cases with base64-like tokens (containing / + =)
test_cases = [
    "refresh_token=abc/def+ghi==",
    "access_token: xyz+123/abc=",
    "id_token=something+more/data==",
    "Authorization: Bearer abc/xyz+123==",
    "error with token=valid-value_123",
    "Bearer token_abc.def-ghi",
]

print("CURRENT PATTERNS:")
for test in test_cases:
    result = current_token.sub("token=***REDACTED***", test)
    result = current_bearer.sub(r"\1 ***REDACTED***", result)
    leaked = "✗ LEAKED" if any(c in result for c in ['/', '+', '=']) and "token" in test else "✓ OK"
    print(f"{leaked}: {test} → {result}")

print("\nPROPOSED PATTERNS:")
for test in test_cases:
    result = proposed_token.sub("token=***REDACTED***", test)
    result = proposed_bearer.sub(r"\1 ***REDACTED***", result)
    leaked = "✗ LEAKED" if any(c in result for c in ['/', '+', '=']) and ("token" in test or "Bearer" in test) else "✓ OK"
    print(f"{leaked}: {test} → {result}")
PY

Repository: ndycode/codex-multi-auth

Length of output: 970


token redaction regex leaks base64-like secret suffixes.

lib/background-jobs.ts:43-45 uses [A-Z0-9._-]+ to match tokens, which stops at /, +, and = characters. this means tokens like refresh_token=abc/def+ghi== would only redact the prefix, leaking the base64 tail.

the test in test/background-jobs.test.ts:106 covers email and token redaction but uses simple tokens (sk_test_123, rt_456) without base64 chars, so it doesn't catch this. add a test case with base64-like tokens to the "redacts sensitive error text" test, then apply the fix to match everything until whitespace:

 function sanitizeErrorMessage(message: string): string {
 	return message
 		.replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "***REDACTED***")
 		.replace(
-			/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi,
+			/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([^\s]+)/gi,
 			"token=***REDACTED***",
 		)
-		.replace(/\b(Bearer)\s+[A-Z0-9._-]+\b/gi, "$1 ***REDACTED***");
+		.replace(/\b(Bearer)\s+([^\s]+)/gi, "$1 ***REDACTED***");
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/background-jobs.ts` around lines 39 - 47, The token-redaction regex in
sanitizeErrorMessage currently uses [A-Z0-9._-]+ which misses base64 chars like
/, +, =; update the token-capturing patterns in sanitizeErrorMessage (the
/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi and the Bearer
pattern) to use a non-whitespace matcher such as \S+ so tokens are consumed
until whitespace (e.g., /\b(?:access|refresh|id)?_?token(?:=|:)?\s*\S+/gi and
/\b(Bearer)\s+\S+/gi), and add a unit test in the "redacts sensitive error text"
test to assert that tokens with base64-like suffixes (containing /, +, =) are
fully redacted.

Comment on lines +54 to +57
const statusCode = (error as { statusCode?: unknown } | undefined)?.statusCode;
if (typeof statusCode === "number" && statusCode === 429) {
return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "*.ts" | grep -E "(background|job)" | head -20

Repository: ndycode/codex-multi-auth

Length of output: 123


🏁 Script executed:

cat -n lib/background-jobs.ts | head -100

Repository: ndycode/codex-multi-auth

Length of output: 3990


🏁 Script executed:

rg -n "429|EBUSY|EPERM|getDelayMs|runBackgroundJobWithRetry" --type=ts lib/

Repository: ndycode/codex-multi-auth

Length of output: 3935


🏁 Script executed:

fd -e test.ts -e spec.ts -o -e test.js -o -e spec.js | xargs rg -l "429|EBUSY|EPERM|background.*job|getDelayMs" 2>/dev/null

Repository: ndycode/codex-multi-auth

Length of output: 2227


🏁 Script executed:

cat -n test/background-jobs.test.ts

Repository: ndycode/codex-multi-auth

Length of output: 7015


🏁 Script executed:

rg -n "delay|jitter|timing|sleep" test/background-jobs.test.ts

Repository: ndycode/codex-multi-auth

Length of output: 50


🏁 Script executed:

rg -n "getDelayMs" test/background-jobs.test.ts

Repository: ndycode/codex-multi-auth

Length of output: 50


add jitter to exponential backoff in lib/background-jobs.ts:49-50 to avoid worker synchronization on 429/EBUSY.

getDelayMs uses deterministic 2^attempt backoff, which causes parallel workers retrying after 429/EBUSY to synchronize and amplify throttling. regression tests exist (test/background-jobs.test.ts:111-139 for 429, test/background-jobs.test.ts:47-70 for EBUSY, test/background-jobs.test.ts:72-109 for EPERM), but they don't verify delay timing. add randomized jitter (e.g., ±20% of capped delay) to getDelayMs and add a test that verifies jitter is present for concurrent retry scenarios.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/background-jobs.ts` around lines 54 - 57, getDelayMs currently uses a
deterministic exponential backoff which causes worker synchronization; modify
getDelayMs in lib/background-jobs.ts to compute the capped exponential base
delay (as it does now) then apply a randomized jitter of about ±20% (e.g.,
multiply by 1 + randomBetween(-0.2, 0.2)) and return an integer ms to avoid
synchronized retries on 429/EBUSY. Keep the existing cap logic and ensure jitter
is applied after capping; update tests in test/background-jobs.test.ts to
include a concurrent-retry assertion that verifies returned delays for the same
attempt can differ (i.e., show jitter is present) while preserving existing
behavior for EPERM/other cases.

quotaCacheDays: 14,
dlqDays: 30,
};
const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get the relevant code sections
echo "=== lib/data-retention.ts ===" 
sed -n '15,90p' lib/data-retention.ts || echo "file not found"

echo -e "\n=== test/data-retention.test.ts existence and error code mentions ===" 
if [ -f test/data-retention.test.ts ]; then
  wc -l test/data-retention.test.ts
  rg -n 'ENOTEMPTY|EBUSY|EPERM|EACCES|EAGAIN|rmdir|RETRYABLE' test/data-retention.test.ts || echo "no matches found"
else
  echo "test file not found"
fi

echo -e "\n=== Searching all lib files for RETRYABLE_RETENTION_CODES usage ===" 
rg -n 'RETRYABLE_RETENTION_CODES' lib/ || echo "no matches"

Repository: ndycode/codex-multi-auth

Length of output: 3614


add enotempty to retryable retention codes; tests already expect this behavior

lib/data-retention.ts:21 excludes ENOTEMPTY from RETRYABLE_RETENTION_CODES, but test/data-retention.test.ts:7 already defines RETRYABLE_REMOVE_CODES with ENOTEMPTY included. fs.rmdir at lib/data-retention.ts:84 can transiently fail with ENOTEMPTY under concurrent writers on windows; production code skips retry while tests expect it to be handled, creating test/impl parity gap.

proposed fix
-const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]);
+const RETRYABLE_RETENTION_CODES = new Set([
+	"EBUSY",
+	"EPERM",
+	"EACCES",
+	"EAGAIN",
+	"ENOTEMPTY",
+]);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]);
const RETRYABLE_RETENTION_CODES = new Set([
"EBUSY",
"EPERM",
"EACCES",
"EAGAIN",
"ENOTEMPTY",
]);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/data-retention.ts` at line 21, The RETRYABLE_RETENTION_CODES set is
missing "ENOTEMPTY", causing production code to skip retries while tests expect
it; update the RETRYABLE_RETENTION_CODES constant to include "ENOTEMPTY"
(matching the test's RETRYABLE_REMOVE_CODES) so the retry logic that checks this
set (used around the fs.rmdir call) will treat ENOTEMPTY as transient and retry
accordingly.

Comment on lines +69 to +89
let writeError: unknown;
let closeError: unknown;
try {
await handle.writeFile(
`${JSON.stringify({ pid: process.pid, acquiredAt: Date.now() })}\n`,
"utf8",
);
} catch (error) {
writeError = error;
}
try {
await handle.close();
} catch (error) {
closeError = error;
}
if (writeError !== undefined) {
throw writeError;
}
if (closeError !== undefined) {
throw closeError;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

cleanup lock file when write/close fails during acquisition

line 85 and line 171 can throw after a failed write/close while leaving the lock file on disk. that creates a stale lock and blocks other writers (concurrency risk), especially noisy on windows eprem/ebusy paths in lib/file-lock.ts:69 and lib/file-lock.ts:154.

please unlink best-effort before rethrowing, and add a vitest regression that forces write failure + close failure to verify no orphan lock remains.

proposed fix
-			if (writeError !== undefined) {
-				throw writeError;
-			}
-			if (closeError !== undefined) {
-				throw closeError;
-			}
+			if (writeError !== undefined || closeError !== undefined) {
+				try {
+					await fs.unlink(path);
+				} catch (cleanupError) {
+					const cleanupCode = (cleanupError as NodeJS.ErrnoException).code;
+					if (cleanupCode !== "ENOENT") {
+						// keep original acquisition error precedence
+					}
+				}
+				throw writeError ?? closeError;
+			}
@@
-			if (writeError !== undefined) {
-				throw writeError;
-			}
-			if (closeError !== undefined) {
-				throw closeError;
-			}
+			if (writeError !== undefined || closeError !== undefined) {
+				try {
+					unlinkSync(path);
+				} catch (cleanupError) {
+					const cleanupCode = (cleanupError as NodeJS.ErrnoException).code;
+					if (cleanupCode !== "ENOENT") {
+						// keep original acquisition error precedence
+					}
+				}
+				throw writeError ?? closeError;
+			}

As per coding guidelines, lib/**: "focus on auth rotation, windows filesystem IO, and concurrency."

Also applies to: 154-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/file-lock.ts` around lines 69 - 89, When write or close fails while
acquiring the lock in lib/file-lock.ts (the block using handle.writeFile and
handle.close that sets writeError/closeError), perform a best-effort unlink of
the lock file before rethrowing the error to avoid leaving a stale lock on disk;
do the same for the analogous block around lines 154-175, and add a vitest
regression that simulates both a write failure and a close failure during
acquisition to assert the lock file is removed (no orphan lock) even when errors
are thrown.

Comment on lines +265 to +272
const lock = acquireFileLockSync(UNIFIED_SETTINGS_LOCK_PATH, SETTINGS_LOCK_OPTIONS);
try {
const record = readSettingsRecordSync() ?? {};
record.pluginConfig = { ...pluginConfig };
writeSettingsRecordSync(record);
} finally {
lock.release();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

handle lock release failures without failing successful writes.

lib/unified-settings.ts:271, lib/unified-settings.ts:293, and lib/unified-settings.ts:341 propagate lock.release() errors directly. on windows, eperm/ebusy here can fail the command after data is already persisted and strand settings.json.lock.

proposed fix
+function releaseUnifiedSettingsLockSync(lockPath: string, release: () => void): void {
+  try {
+    release();
+  } catch (error) {
+    const code = (error as NodeJS.ErrnoException).code;
+    if (code !== "ENOENT") {
+      try {
+        unlinkSync(lockPath);
+      } catch {
+        // best-effort lock cleanup.
+      }
+    }
+  }
+}
+
+async function releaseUnifiedSettingsLockAsync(
+  lockPath: string,
+  release: () => Promise<void>,
+): Promise<void> {
+  try {
+    await release();
+  } catch (error) {
+    const code = (error as NodeJS.ErrnoException).code;
+    if (code !== "ENOENT") {
+      try {
+        await fs.rm(lockPath, { force: true });
+      } catch {
+        // best-effort lock cleanup.
+      }
+    }
+  }
+}
+
 export function saveUnifiedPluginConfigSync(pluginConfig: JsonRecord): void {
 	const lock = acquireFileLockSync(UNIFIED_SETTINGS_LOCK_PATH, SETTINGS_LOCK_OPTIONS);
 	try {
 		const record = readSettingsRecordSync() ?? {};
 		record.pluginConfig = { ...pluginConfig };
 		writeSettingsRecordSync(record);
 	} finally {
-		lock.release();
+		releaseUnifiedSettingsLockSync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
 	}
 }
@@
 		try {
 			const record = (await readSettingsRecordAsync()) ?? {};
 			record.pluginConfig = { ...pluginConfig };
 			await writeSettingsRecordAsync(record);
 		} finally {
-			await lock.release();
+			await releaseUnifiedSettingsLockAsync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
 		}
@@
 		try {
 			const record = (await readSettingsRecordAsync()) ?? {};
 			record.dashboardDisplaySettings = { ...dashboardDisplaySettings };
 			await writeSettingsRecordAsync(record);
 		} finally {
-			await lock.release();
+			await releaseUnifiedSettingsLockAsync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
 		}

also add vitest regressions for lock.release() throwing ebusy/eperm in test/unified-settings.test.ts.

As per coding guidelines lib/**: focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios.

Also applies to: 287-294, 335-342

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@lib/unified-settings.ts` around lines 265 - 272, The current pattern around
acquireFileLockSync/lock.release in unified-settings (used with
UNIFIED_SETTINGS_LOCK_PATH and SETTINGS_LOCK_OPTIONS and wrapping
readSettingsRecordSync/writeSettingsRecordSync) can propagate lock.release
errors (EPERM/EBUSY) and fail after a successful write; update all call sites
(the blocks around the lock.release calls) to catch errors from lock.release(),
treat EPERM/EBUSY as non-fatal by logging a warning and swallowing them while
rethrowing unexpected errors, ensuring the writeSettingsRecordSync result is not
rolled back; also add vitest tests in test/unified-settings.test.ts that
stub/mock lock.release to throw EPERM and EBUSY and assert the write succeeds
and the lock file does not leave the process in a failing state.

Comment on lines +48 to +54
function extractLicenseTokens(rawLicense) {
return rawLicense
.toUpperCase()
.split(/[^A-Z0-9.-]+/)
.map((value) => value.trim())
.filter((value) => value.length > 0);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

denylist matching regressed for spdx variants

the new exact token match misses denied variants like gpl-2.0-only and gpl-2.0-or-later in scripts/license-policy-check.js:70. this can let blocked licenses pass.

proposed fix
-	const tokens = new Set(extractLicenseTokens(normalized));
+	const tokens = extractLicenseTokens(normalized);
 	for (const denied of denyList) {
-		if (tokens.has(denied)) {
+		if (tokens.some((token) => token === denied || token.startsWith(`${denied}-`))) {
 			violations.push(`${name}@${version} (${rawLicense})`);
 			break;
 		}
 	}

Also applies to: 68-71

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/license-policy-check.js` around lines 48 - 54, The exact-token
denylist check is missing common SPDX variants (e.g., "gpl-2.0-only",
"gpl-2.0-or-later") causing blocked licenses to slip through; update the
matching logic that uses extractLicenseTokens so that for each token you also
derive and check a normalized base SPDX id (strip suffixes like "-only" and
"-or-later", and any trailing "+") against the denylist (i.e., when evaluating
in the function that performs the denylist lookup near the token match),
ensuring you test both the original token and the stripped/base form before
allowing it.

Comment on lines +141 to +160
it("redacts sensitive error text in dead-letter entries and warning logs", async () => {
vi.resetModules();
const warnMock = vi.fn();
vi.doMock("../lib/logger.js", () => ({
logWarn: warnMock,
}));
try {
const { runBackgroundJobWithRetry, getBackgroundJobDlqPath } =
await import("../lib/background-jobs.js");
await expect(
runBackgroundJobWithRetry({
name: "test.retry-sensitive-error",
task: async () => {
throw new Error(
"network failed for person@example.com Bearer sk_test_123 refresh_token=rt_456",
);
},
maxAttempts: 1,
}),
).rejects.toThrow("person@example.com");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

avoid asserting that thrown errors contain pii/secrets

line 160 currently requires person@example.com to be present in the thrown error. that bakes in a sensitive-data leak contract in test/background-jobs.test.ts:160. this should assert a generic failure instead, and keep secret-bearing strings out of externally surfaced errors.

proposed test adjustment
-			).rejects.toThrow("person@example.com");
+			).rejects.toThrow("network failed");

As per coding guidelines, lib/**: "check for logging that leaks tokens or emails."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/background-jobs.test.ts` around lines 141 - 160, The test currently
asserts that the thrown error contains PII ("person@example.com"); change the
assertion to avoid checking for secrets by expecting a generic failure
instead—e.g., assert that runBackgroundJobWithRetry (from
runBackgroundJobWithRetry / getBackgroundJobDlqPath) rejects (using
.rejects.toThrow() with no PII-specific message, or
.rejects.toThrowError()/toBeRejected) and keep any checks about redaction
focused on logs/DLQ contents rather than the thrown error text; update the
assertion in the failing test block so it no longer requires the email/token to
appear.

Comment on lines +101 to +156
it("retries transient EBUSY during directory entry retention pruning", async () => {
const { enforceDataRetention } = await import("../lib/data-retention.js");
const oldDate = new Date(Date.now() - 3 * 24 * 60 * 60_000);
const logsDir = join(tempDir, "logs");
const nestedDir = join(logsDir, "nested");
const staleLog = join(nestedDir, "stale.log");

await fs.mkdir(nestedDir, { recursive: true });
await fs.writeFile(staleLog, "old", "utf8");
await fs.utimes(staleLog, oldDate, oldDate);

const originalStat = fs.stat.bind(fs);
const statSpy = vi.spyOn(fs, "stat");
let statBusyInjected = false;
statSpy.mockImplementation(async (path, options) => {
if (!statBusyInjected && path === staleLog) {
statBusyInjected = true;
const error = new Error("busy") as NodeJS.ErrnoException;
error.code = "EBUSY";
throw error;
}
return originalStat(path, options as { bigint?: boolean });
});

const originalRmdir = fs.rmdir.bind(fs);
const rmdirSpy = vi.spyOn(fs, "rmdir");
let rmdirBusyInjected = false;
rmdirSpy.mockImplementation(async (path) => {
if (!rmdirBusyInjected && path === nestedDir) {
rmdirBusyInjected = true;
const error = new Error("busy") as NodeJS.ErrnoException;
error.code = "EBUSY";
throw error;
}
return originalRmdir(path);
});

try {
const policy: RetentionPolicy = {
logDays: 1,
cacheDays: 90,
flaggedDays: 90,
quotaCacheDays: 90,
dlqDays: 90,
};
const result = await enforceDataRetention(policy);
expect(result.removedLogs).toBe(1);
expect(statBusyInjected).toBe(true);
expect(rmdirBusyInjected).toBe(true);
await expect(fs.stat(staleLog)).rejects.toMatchObject({ code: "ENOENT" });
await expect(fs.stat(nestedDir)).rejects.toMatchObject({ code: "ENOENT" });
} finally {
statSpy.mockRestore();
rmdirSpy.mockRestore();
}
});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

expand regression coverage to all new retry branches

test/data-retention.test.ts:101 (line 101) only validates transient EBUSY in directory pruning. new branches in lib/data-retention.ts:21 (line 21) and lib/data-retention.ts:104 (line 104) are still untested (EPERM/EACCES/EAGAIN, plus single-file unlink retry path). this leaves a windows edge-case and concurrency-race regression gap.

suggested test additions
+it.each(["EPERM", "EACCES", "EAGAIN"] as const)(
+	"retries transient %s during state-file pruning",
+	async (code) => {
+		// create stale quota/flagged file
+		// inject one-time error on fs.unlink for that file
+		// assert enforceDataRetention succeeds and file is removed
+	},
+);
+
+it("throws after max retries for persistent EBUSY", async () => {
+	// inject persistent EBUSY on fs.stat or fs.rmdir
+	// assert enforceDataRetention rejects with code EBUSY
+});

as per coding guidelines test/**: "tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/data-retention.test.ts` around lines 101 - 156, Add deterministic Vitest
cases exercising the remaining retry branches in enforceDataRetention: simulate
transient EPERM/EACCES/EAGAIN errors on directory operations (fs.stat/fs.rmdir)
and a transient EBUSY/EACCES-like error on single-file removal (fs.unlink) to
verify the retry logic in the functions implemented in lib/data-retention.ts
(refer to enforceDataRetention and its internal directory-entry pruning and
single-file unlink paths). For each test, spy/mock the specific fs method (stat,
rmdir, unlink) to throw the transient error once for the target path then
succeed, assert that the retry flag was hit, assert the file/dir is ultimately
removed (expect ENOENT) and restore spies in finally blocks to keep tests
deterministic and isolated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant