feat: complete enterprise hardening and storage lock reliability#34
feat: complete enterprise hardening and storage lock reliability#34ndycode wants to merge 5 commits intofeat/enterprise-hardeningfrom
Conversation
- cleanup stale/dead process lock artifacts before acquiring account lock - ensure lock release always attempts fallback cleanup - keep clearAccounts/saveTransactions serialized across file and memory locks Co-authored-by: Codex <noreply@openai.com>
📝 WalkthroughSummaryThis PR implements enterprise hardening with critical storage lock reliability improvements to resolve race conditions and file system failures in multi-process scenarios. While the changes are major in scope and affect core storage architecture, comprehensive regression tests (1990+ lines in storage tests alone) validate the three-layer lock serialization, encryption failure handling, and concurrent save scenarios, mitigating data-loss and corruption risks. Key Architectural ChangesThree-Layer Lock Serialization: Introduces deterministic lock acquisition order (in-process Stale Lock Cleanup: Implements file-based recovery from crashed processes with 120-second timeout and PID liveness checks. When Encryption-First Error Handling: Introduces Best-Effort Lock Release: Adds Data Protection & Test Coverage
Public API ChangesBreaking Changes to Lock Signatures: New Exports: Security & Operational Hardening
Risk AssessmentBlockers: None identified; stale lock cleanup only triggers on non-ENOENT errors, and encryption failures are explicitly handled without silent fallbacks. Data Loss Risks: Mitigated by WAL journaling, rotating backups, and explicit test coverage for concurrent scenarios and recovery paths. Breaking Changes: Lock API signatures require updates in all call sites (already tested in updated worker harness). walkthroughintroduces serialized, platform-aware storage locking and error handling. adds changes
sequence diagram(s)sequenceDiagram
autonumber
participant caller as "caller"
participant inproc as "in-process mutex\n(rgba(70,130,180,0.5))"
participant filequeue as "file queue / file-lock\n(rgba(34,139,34,0.5))"
participant storagemtx as "storage mutex\n(rgba(218,165,32,0.5))"
participant fs as "filesystem\n(rgba(128,0,128,0.5))"
caller->>inproc: request serialized lock
inproc->>filequeue: enqueue / acquire file lock
filequeue->>fs: create/acquire lock file
filequeue-->>inproc: lock acquired
inproc->>storagemtx: acquire storage mutex
storagemtx-->>caller: perform storage mutation
caller->>storagemtx: release storage mutex
storagemtx-->>inproc: signal release
inproc->>filequeue: release file lock
filequeue->>fs: remove lock file (or leave for fallback)
filequeue-->>inproc: fully released
estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes notes for review
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
Ensure account-storage mutations keep deterministic ordering while preserving the historical file-lock before in-process mutex acquisition sequence. Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/storage.ts`:
- Around line 382-386: withAccountFileLock currently calls
cleanupDeadProcessStorageLock(lockPath) before calling
acquireFileLock(lockPath,...), which creates a race where two processes can both
remove a stale lock then race to create it; instead remove the external cleanup
call and rely on acquireFileLock to handle stale-lock removal (or move the
cleanup logic into acquireFileLock). Update withAccountFileLock to stop calling
cleanupDeadProcessStorageLock and ensure acquireFileLock (in lib/file-lock.ts)
invokes removeIfStale/remove-or-cleanup atomically so stale locks are handled
inside acquireFileLock rather than externally.
- Around line 373-379: The catch block around the storage read should not call
releaseStorageLockFallback when the error is an EBUSY from
fs.readFile/JSON.parse; update the catch in the function that reads the lock
(the block that currently inspects (error as NodeJS.ErrnoException).code) to
explicitly handle error.code === "ENOENT" (return) and error.code === "EBUSY"
(return or rethrow as appropriate) and only call
releaseStorageLockFallback(lockPath) for other error types (e.g., parse errors).
Reference the existing catch scope that uses (error as
NodeJS.ErrnoException).code and the releaseStorageLockFallback(lockPath) call to
implement this conditional behavior.
- Around line 123-139: The comment claiming "file lock -> in-process mutex"
contradicts the implementation in withStorageSerializedFileLock (which currently
calls withAccountFileMutex → withAccountFileLock → withStorageLock), so either
update the comment to state the actual acquisition order (in-process mutex ->
file lock -> storage lock) or reorder the calls to match the comment;
specifically, either change the comment near withStorageSerializedFileLock to
reflect the real sequence or change the implementation to call
withAccountFileLock(path, () => withAccountFileMutex(() => withStorageLock(fn)))
so the acquisition becomes file lock -> in-process mutex -> storage lock (and
ensure variable/closure usage still compiles).
- Around line 340-346: The catch block in releaseStorageLockFallback currently
swallows all errors from fs.rm; update it to log a debug-level message including
the lockPath and the caught error so failed cleanup is observable (e.g., use the
existing logger or processLogger if available), while preserving the
"best-effort" behavior by not rethrowing; ensure you reference
releaseStorageLockFallback and the fs.rm call so the log includes both the path
and error details.
- Around line 355-362: When detecting and removing a stale lock (the branch that
computes lockPid and isDeadProcess using process.kill) and the separate
age-based cleanup branch, add observability logs using the existing logger
(e.g., processLogger or storage logger used elsewhere in this module) that
record the action and context: include the lock path/name, lockPid, whether
isDeadProcess was true, the lock age (timestamp or computed age), and a concise
reason ("stale: dead PID" or "stale: age threshold exceeded"). Place one log
right before or immediately after the dead-process cleanup path (where
isDeadProcess is true) and another log in the age-based cleanup path to make
both events visible to operators. Ensure the log messages are structured and
include these fields so they match ops runbook expectations.
- Around line 390-401: The current finally always calls
releaseStorageLockFallback(lockPath) even when lock.release() succeeds; change
the flow so the fallback is only invoked when lock.release() throws or otherwise
fails: try calling await lock.release() and on catch (error) check the error
code like in the existing block, log the warning, then call await
releaseStorageLockFallback(lockPath) from inside that catch (or when a boolean
“failedToRelease” flag is set) so the fallback only runs on failure; update
references to lock.release, releaseStorageLockFallback, lockPath and preserve
the ENOENT handling/logging behavior.
- Around line 348-380: Add explicit debug/info/error logs inside
cleanupDeadProcessStorageLock to record: when the lock file is read and parsed
(include pid and acquiredAt), when process.kill(pid, 0) indicates the process is
alive (skipping cleanup) or throws ESRCH (releasing lock), when EPERM or other
errors occur (and whether lock is stale), when
releaseStorageLockFallback(lockPath) is invoked, and when ENOENT or JSON parse
errors are encountered; use the existing processLogger (or the module logger)
and include lockPath and relevant metadata in each message. Also add vitest
regression tests for cleanupDeadProcessStorageLock covering: (1) process exists
and lock is fresh — assert no releaseStorageLockFallback call, (2) process dead
(mock process.kill to throw ESRCH) — assert releaseStorageLockFallback called,
(3) process exists but process.kill throws EPERM — assert no release unless lock
is stale (simulate acquiredAt older than
ACCOUNT_STORAGE_LOCK_OPTIONS.staleAfterMs), and (4) missing or corrupted lock
file — assert fallback release or ENOENT path; mock fs.readFile, process.kill,
and releaseStorageLockFallback to verify log calls and behavior.
ℹ️ Review info
Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0d15b434-50c2-4cf6-8218-c272deab5dd3
📒 Files selected for processing (1)
lib/storage.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (1)
lib/**
⚙️ CodeRabbit configuration file
focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.
Files:
lib/storage.ts
🧬 Code graph analysis (1)
lib/storage.ts (1)
lib/file-lock.ts (1)
acquireFileLock(57-107)
🔇 Additional comments (2)
lib/storage.ts (2)
112-112: lgtm - mutex declaration is straightforward.the
accountFileMutexfollows the same pattern asstorageMutexat line 111.
1318-1321: regression tests do exist for the locking pattern—they just weren't cited in the pr.test/storage.test.ts:175 directly exercises
withAccountStorageTransactionwith concurrent promise.all calls and verifies serialization without data loss. test/storage-async.test.ts:295 covers the mutex behavior with rapid sequential saves and concurrent writes. test/storage.test.ts:1407+ tests ebusy/eperm retry logic, covering windows filesystem edge cases. these tests validate the lock-order changes inwithAccountStorageTransaction,saveAccounts, andclearAccountsat lib/storage.ts:1318, 1334, 1346.> Likely an incorrect or invalid review comment.
Drop pre-acquire dead-process lock cleanup and only run fallback lock-file deletion when lock.release() fails. Also align lock-order comment and add debug observability for fallback cleanup failures. Co-authored-by: Codex <noreply@openai.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@lib/storage.ts`:
- Around line 123-138: The new file-lock serialization helpers
withAccountFileMutex and withStorageSerializedFileLock and their use in
saveAccounts lack tests covering lock.release() error paths; add unit tests in
test/storage.test.ts that mock the lock.release() implementation to throw EBUSY
and EPERM and assert that releaseStorageLockFallback is invoked and saveAccounts
still completes (no uncaught rejection), add a test that simulates concurrent
saveAccounts calls to verify ordering under lock contention (use mocked locks
that delay release) and add a Windows-specific test case where lock.release()
throws EPERM to validate the fallback cleanup path; reference the functions
withAccountFileMutex, withStorageSerializedFileLock, saveAccounts,
releaseStorageLockFallback and the lock.release() call when adding the mocks and
assertions.
ℹ️ Review info
Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 3d9bb546-70b4-4fee-b0a7-5413a4e724b5
📒 Files selected for processing (1)
lib/storage.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (1)
lib/**
⚙️ CodeRabbit configuration file
focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.
Files:
lib/storage.ts
🧬 Code graph analysis (1)
lib/storage.ts (1)
lib/file-lock.ts (1)
acquireFileLock(57-107)
🔇 Additional comments (7)
lib/storage.ts (7)
112-112: lgtm: in-process mutex matches existingstorageMutexpattern.the
accountFileMutexatlib/storage.ts:112andwithAccountFileMutexatlib/storage.ts:123-130mirror the existingstorageMutex/withStorageLockimplementation. pattern is correct for serializing async operations within a single process.Also applies to: 123-130
132-138: lock order comment now matches implementation.
lib/storage.ts:133-134correctly states acquisition order: file-queue mutex → file lock → storage mutex. code at lines 135-137 does exactly that:withAccountFileMutex→withAccountFileLock→withStorageLock. past review comment resolved.
339-348: fallback now has debug observability as requested.
lib/storage.ts:343-346logs cleanup failures at debug level with path and error. addresses past review comment about silent swallowing.
350-370: release fallback only fires on failure—double cleanup fixed.
lib/storage.ts:357-367now only callsreleaseStorageLockFallbackwhenlock.release()throws (and it's notENOENT). this addresses the past review about unconditional double cleanup.one edge case worth noting: if
lock.release()throwsEBUSY(windows antivirus holding the file),releaseStorageLockFallbackusesfs.rm({ force: true })which will also likely fail withEBUSYand log at debug level. lock file may remain until next acquire cleans it viaremoveIfStaleinlib/file-lock.ts:91-93. this is probably acceptable for "best-effort" but could leave stale locks on heavily-contested windows systems.
1285-1288:withAccountStorageTransactionnow uses serialized lock path.
lib/storage.ts:1285-1288routes throughwithStorageSerializedFileLock, ensuring consistent acquisition order withsaveAccountsandclearAccounts. this eliminates the possibility of lock-order deadlocks between these paths.
1299-1304:saveAccountsuses serialized lock—consistent with transaction path.
lib/storage.ts:1301-1303wrapssaveAccountsUnlockedinwithStorageSerializedFileLock. matches the acquisition order used bywithAccountStorageTransaction.
1311-1336:clearAccountsuses serialized lock—consistent with other mutation paths.
lib/storage.ts:1313-1335wraps the entire clear operation inwithStorageSerializedFileLock. all three mutation paths (withAccountStorageTransaction,saveAccounts,clearAccounts) now share the same lock acquisition sequence.
Add regression tests for lock.release() EBUSY/EPERM fallback cleanup and concurrent save lock serialization to close outstanding PR feedback. Co-authored-by: Codex <noreply@openai.com>
|
@greptile pls review |
Address unresolved review threads by hardening audit/log redaction, lock scope, cursor validation, retention retries, and storage decrypt behavior; align workflows/docs/scripts; and add regression tests for all actionable items. Co-authored-by: Codex <noreply@openai.com>
| const fullPath = join(path, entry.name); | ||
| try { |
There was a problem hiding this comment.
wrapping entire recursive call in withRetentionIoRetry amplifies retries and under-counts removals
pruneDirectoryByAge is recursive. wrapping the whole call in withRetentionIoRetry means a single transient EBUSY on any file anywhere in the subtree causes the entire tree-traversal to restart from scratch (up to 5 times per nesting level → 5^depth max attempts on deep trees).
this also breaks the removed counter: files successfully deleted in an aborted pass appear as ENOENT in the retry pass (correctly skipped) but their removal is not counted in the final return value, causing the caller to under-report removals.
the throw error added at line ~102 further means a persistently-locked file (e.g., av-locked log on windows that never clears in 5 retries) now aborts the entire prune cycle rather than skipping that file as the old code did.
recommend applying withRetentionIoRetry only around the individual leaf i/o calls (stat, unlink, rmdir) rather than the recursive call, and restoring the per-file skip behaviour for non-retryable errors:
// apply retry only to the leaf ops — not to the entire recursive call
removed += await pruneDirectoryByAge(fullPath, maxAgeMs);
const childEntries = await withRetentionIoRetry(() => fs.readdir(fullPath));and in the per-file catch block, log and continue rather than re-throw for non-ENOENT errors so one locked file cannot abort the whole cycle.
Prompt To Fix With AI
This is a comment left during a code review.
Path: lib/data-retention.ts
Line: 78-79
Comment:
**wrapping entire recursive call in `withRetentionIoRetry` amplifies retries and under-counts removals**
`pruneDirectoryByAge` is recursive. wrapping the whole call in `withRetentionIoRetry` means a single transient `EBUSY` on any file anywhere in the subtree causes the entire tree-traversal to restart from scratch (up to 5 times per nesting level → 5^depth max attempts on deep trees).
this also breaks the `removed` counter: files successfully deleted in an aborted pass appear as `ENOENT` in the retry pass (correctly skipped) but their removal is not counted in the final return value, causing the caller to under-report removals.
the `throw error` added at line ~102 further means a persistently-locked file (e.g., av-locked log on windows that never clears in 5 retries) now aborts the entire prune cycle rather than skipping that file as the old code did.
recommend applying `withRetentionIoRetry` only around the individual leaf i/o calls (`stat`, `unlink`, `rmdir`) rather than the recursive call, and restoring the per-file skip behaviour for non-retryable errors:
```typescript
// apply retry only to the leaf ops — not to the entire recursive call
removed += await pruneDirectoryByAge(fullPath, maxAgeMs);
const childEntries = await withRetentionIoRetry(() => fs.readdir(fullPath));
```
and in the per-file catch block, log and `continue` rather than re-throw for non-ENOENT errors so one locked file cannot abort the whole cycle.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lib/storage.ts (1)
930-947:⚠️ Potential issue | 🟠 Majorv1 migration output is bypassed when rebuilding normalized accounts.
lib/storage.ts:930computesbaseStorageviamigrateV1ToV3, butlib/storage.ts:936iteratesrawAccountsinstead ofbaseStorage.accounts. this can skip account-level migration normalization before decrypt/dedupe.proposed fix
- const rawActiveIndex = clampIndex(activeIndexValue, rawAccounts.length); - const activeKey = extractActiveKey(rawAccounts, rawActiveIndex); + const sourceAccounts = Array.isArray(baseStorage.accounts) ? baseStorage.accounts : []; + const rawActiveIndex = clampIndex(activeIndexValue, sourceAccounts.length); + const activeKey = extractActiveKey(sourceAccounts as unknown[], rawActiveIndex); @@ - const validAccounts: AccountMetadataV3[] = []; - for (const account of rawAccounts) { + const validAccounts: AccountMetadataV3[] = []; + for (const account of sourceAccounts) { if (!isStoredAccountCandidate(account)) { continue; } @@ - const clampedRawIndex = clampIndex(rawIndex, rawAccounts.length); - const familyKey = extractActiveKey(rawAccounts, clampedRawIndex); + const clampedRawIndex = clampIndex(rawIndex, sourceAccounts.length); + const familyKey = extractActiveKey(sourceAccounts as unknown[], clampedRawIndex);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/storage.ts` around lines 930 - 947, The code builds baseStorage using fromVersion and migrateV1ToV3 but then iterates rawAccounts, bypassing any migration/normalization; change the loop to iterate baseStorage.accounts (or the normalized accounts array on AccountStorageV3) when populating validAccounts so decryptAccountSensitiveFields runs on migrated data, keeping the existing try/catch that converts errors via toStorageDecryptError and preserving validAccounts accumulation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@index.ts`:
- Around line 1725-1732: The fallback for safeAuditResource currently returns
the raw url when new URL(url) throws, which can leak query/hash into auditLog
(see auditLog in lib/audit.ts); update the fallback in the safeAuditResource
IIFE (and the analogous spot around line 1739) to strip query and hash from the
raw string before returning (e.g., remove everything from the first '?' or '#'
onward) so only origin+pathname or sanitized path is logged; ensure the
sanitized value is what gets passed to auditLog/resource.
In `@lib/background-jobs.ts`:
- Around line 54-57: getDelayMs currently uses a deterministic exponential
backoff which causes worker synchronization; modify getDelayMs in
lib/background-jobs.ts to compute the capped exponential base delay (as it does
now) then apply a randomized jitter of about ±20% (e.g., multiply by 1 +
randomBetween(-0.2, 0.2)) and return an integer ms to avoid synchronized retries
on 429/EBUSY. Keep the existing cap logic and ensure jitter is applied after
capping; update tests in test/background-jobs.test.ts to include a
concurrent-retry assertion that verifies returned delays for the same attempt
can differ (i.e., show jitter is present) while preserving existing behavior for
EPERM/other cases.
- Around line 39-47: The token-redaction regex in sanitizeErrorMessage currently
uses [A-Z0-9._-]+ which misses base64 chars like /, +, =; update the
token-capturing patterns in sanitizeErrorMessage (the
/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi and the Bearer
pattern) to use a non-whitespace matcher such as \S+ so tokens are consumed
until whitespace (e.g., /\b(?:access|refresh|id)?_?token(?:=|:)?\s*\S+/gi and
/\b(Bearer)\s+\S+/gi), and add a unit test in the "redacts sensitive error text"
test to assert that tokens with base64-like suffixes (containing /, +, =) are
fully redacted.
In `@lib/data-retention.ts`:
- Line 21: The RETRYABLE_RETENTION_CODES set is missing "ENOTEMPTY", causing
production code to skip retries while tests expect it; update the
RETRYABLE_RETENTION_CODES constant to include "ENOTEMPTY" (matching the test's
RETRYABLE_REMOVE_CODES) so the retry logic that checks this set (used around the
fs.rmdir call) will treat ENOTEMPTY as transient and retry accordingly.
In `@lib/file-lock.ts`:
- Around line 69-89: When write or close fails while acquiring the lock in
lib/file-lock.ts (the block using handle.writeFile and handle.close that sets
writeError/closeError), perform a best-effort unlink of the lock file before
rethrowing the error to avoid leaving a stale lock on disk; do the same for the
analogous block around lines 154-175, and add a vitest regression that simulates
both a write failure and a close failure during acquisition to assert the lock
file is removed (no orphan lock) even when errors are thrown.
In `@lib/unified-settings.ts`:
- Around line 265-272: The current pattern around
acquireFileLockSync/lock.release in unified-settings (used with
UNIFIED_SETTINGS_LOCK_PATH and SETTINGS_LOCK_OPTIONS and wrapping
readSettingsRecordSync/writeSettingsRecordSync) can propagate lock.release
errors (EPERM/EBUSY) and fail after a successful write; update all call sites
(the blocks around the lock.release calls) to catch errors from lock.release(),
treat EPERM/EBUSY as non-fatal by logging a warning and swallowing them while
rethrowing unexpected errors, ensuring the writeSettingsRecordSync result is not
rolled back; also add vitest tests in test/unified-settings.test.ts that
stub/mock lock.release to throw EPERM and EBUSY and assert the write succeeds
and the lock file does not leave the process in a failing state.
In `@scripts/license-policy-check.js`:
- Around line 48-54: The exact-token denylist check is missing common SPDX
variants (e.g., "gpl-2.0-only", "gpl-2.0-or-later") causing blocked licenses to
slip through; update the matching logic that uses extractLicenseTokens so that
for each token you also derive and check a normalized base SPDX id (strip
suffixes like "-only" and "-or-later", and any trailing "+") against the
denylist (i.e., when evaluating in the function that performs the denylist
lookup near the token match), ensuring you test both the original token and the
stripped/base form before allowing it.
In `@test/background-jobs.test.ts`:
- Around line 141-160: The test currently asserts that the thrown error contains
PII ("person@example.com"); change the assertion to avoid checking for secrets
by expecting a generic failure instead—e.g., assert that
runBackgroundJobWithRetry (from runBackgroundJobWithRetry /
getBackgroundJobDlqPath) rejects (using .rejects.toThrow() with no PII-specific
message, or .rejects.toThrowError()/toBeRejected) and keep any checks about
redaction focused on logs/DLQ contents rather than the thrown error text; update
the assertion in the failing test block so it no longer requires the email/token
to appear.
In `@test/data-retention.test.ts`:
- Around line 101-156: Add deterministic Vitest cases exercising the remaining
retry branches in enforceDataRetention: simulate transient EPERM/EACCES/EAGAIN
errors on directory operations (fs.stat/fs.rmdir) and a transient
EBUSY/EACCES-like error on single-file removal (fs.unlink) to verify the retry
logic in the functions implemented in lib/data-retention.ts (refer to
enforceDataRetention and its internal directory-entry pruning and single-file
unlink paths). For each test, spy/mock the specific fs method (stat, rmdir,
unlink) to throw the transient error once for the target path then succeed,
assert that the retry flag was hit, assert the file/dir is ultimately removed
(expect ENOENT) and restore spies in finally blocks to keep tests deterministic
and isolated.
---
Outside diff comments:
In `@lib/storage.ts`:
- Around line 930-947: The code builds baseStorage using fromVersion and
migrateV1ToV3 but then iterates rawAccounts, bypassing any
migration/normalization; change the loop to iterate baseStorage.accounts (or the
normalized accounts array on AccountStorageV3) when populating validAccounts so
decryptAccountSensitiveFields runs on migrated data, keeping the existing
try/catch that converts errors via toStorageDecryptError and preserving
validAccounts accumulation.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d85dbde0-bd54-490e-b59f-2b0c10f91215
⛔ Files ignored due to path filters (1)
package-lock.jsonis excluded by!**/package-lock.json
📒 Files selected for processing (20)
.github/workflows/ci.yml.github/workflows/supply-chain.ymldocs/development/TESTING.mddocs/runbooks/operations.mdindex.tslib/background-jobs.tslib/codex-manager.tslib/data-retention.tslib/file-lock.tslib/storage.tslib/unified-settings.tspackage.jsonscripts/license-policy-check.jstest/authorization.test.tstest/background-jobs.test.tstest/codex-manager-cli.test.tstest/data-retention.test.tstest/file-lock.test.tstest/storage.test.tstest/unified-settings.test.ts
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Greptile Review
🧰 Additional context used
📓 Path-based instructions (3)
docs/**
⚙️ CodeRabbit configuration file
keep README, SECURITY, and docs consistent with actual CLI flags and workflows. whenever behavior changes, require updated upgrade notes and mention new npm scripts.
Files:
docs/runbooks/operations.mddocs/development/TESTING.md
test/**
⚙️ CodeRabbit configuration file
tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions.
Files:
test/authorization.test.tstest/codex-manager-cli.test.tstest/data-retention.test.tstest/background-jobs.test.tstest/storage.test.tstest/file-lock.test.tstest/unified-settings.test.ts
lib/**
⚙️ CodeRabbit configuration file
focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios. check for logging that leaks tokens or emails.
Files:
lib/background-jobs.tslib/storage.tslib/unified-settings.tslib/file-lock.tslib/codex-manager.tslib/data-retention.ts
🧬 Code graph analysis (8)
index.ts (2)
lib/logger.ts (1)
logWarn(341-346)lib/audit.ts (1)
auditLog(123-153)
test/codex-manager-cli.test.ts (2)
lib/codex-manager.ts (1)
runCodexMultiAuthCli(4529-4610)scripts/codex.js (1)
runCodexMultiAuthCli(501-501)
test/data-retention.test.ts (1)
lib/data-retention.ts (2)
RetentionPolicy(6-12)enforceDataRetention(119-146)
lib/storage.ts (3)
lib/file-lock.ts (1)
acquireFileLock(57-123)lib/storage/migrations.ts (1)
AccountMetadataV3(40-57)lib/utils.ts (1)
isRecord(11-13)
test/background-jobs.test.ts (1)
lib/background-jobs.ts (2)
runBackgroundJobWithRetry(85-129)getBackgroundJobDlqPath(81-83)
test/storage.test.ts (1)
lib/storage.ts (2)
saveAccounts(1326-1331)loadAccounts(1006-1008)
lib/data-retention.ts (1)
lib/utils.ts (1)
sleep(65-67)
test/unified-settings.test.ts (1)
lib/unified-settings.ts (2)
getUnifiedSettingsPath(229-231)saveUnifiedPluginConfig(285-296)
🪛 LanguageTool
docs/runbooks/operations.md
[uncategorized] ~28-~28: The official name of this software platform is spelled with a capital “H”.
Context: ...ce (for example weekly-YYYYMMDD or CI ${{ github.run_id }}) - remove previous key af...
(GITHUB)
🔇 Additional comments (17)
docs/runbooks/operations.md (1)
27-28: good update for rotation idempotency workflow.this command and stable run-id guidance are clear and consistent with the weekly ops flow in
docs/runbooks/operations.md(line 27 and line 28).as per coding guidelines,
docs/**: keep README, SECURITY, and docs consistent with actual CLI flags and workflows. whenever behavior changes, require updated upgrade notes and mention new npm scripts.lib/data-retention.ts (1)
67-69: good retry coverage across retention io callsiteswrapping
readdir/stat/unlink/rmdirand single-file pruning paths is consistent inlib/data-retention.ts:67(line 67),lib/data-retention.ts:81(line 81), andlib/data-retention.ts:106(line 106). this meaningfully hardens transient io failures.Also applies to: 81-92, 106-110
test/data-retention.test.ts (1)
7-27: cleanup retry helper looks solid and deterministic
test/data-retention.test.ts:7(line 7) andtest/data-retention.test.ts:45(line 45) improve teardown stability under transient windows fs lock errors while preserving deterministic assertions.Also applies to: 45-45
lib/background-jobs.ts (1)
104-104: good hardening on error sanitization wiring.
lib/background-jobs.ts:104andlib/background-jobs.ts:119consistently sanitize error strings before dlq/log output.Also applies to: 119-119
package.json (1)
69-69: no issue: workflow correctly installs dev deps before sbom scriptthe sbom workflow in
.github/workflows/supply-chain.yml:72runsnpm ciwithout--omit=dev, so devDependencies (including cyclonedx-npm frompackage.json:107) are installed before the sbom script executes at line 75. the setup is correct.lib/codex-manager.ts (1)
1521-1521: the guard atlib/codex-manager.ts:1521is solid and already has regression coverage.
test/codex-manager-cli.test.ts:2123creates a cursor by encoding"12junk"as base64, which decodes to non-digits and correctly triggers the "Invalid --cursor value" error attest/codex-manager-cli.test.ts:2132. this covers the exact scenario the regex guard protects against.docs/development/TESTING.md (1)
53-60: docs update is aligned with the new local gatethis section clearly documents the ordering change and the added
npm run audit:ci/npm run license:checkchecks.As per coding guidelines,
docs/**: "whenever behavior changes, require updated upgrade notes and mention new npm scripts.".github/workflows/supply-chain.yml (1)
11-13: nice cleanup on supply-chain workflow configcentralizing the denylist and using
npm run sbomimproves maintainability and keeps policy settings in one place.Also applies to: 32-32, 75-75
.github/workflows/ci.yml (1)
55-60: ci flow update looks goodartifact handoff plus
needs: testmakes the coverage gate ordering explicit, andnpm run test:coverageis clearer for coverage intent.Also applies to: 68-68, 74-78, 90-90
test/authorization.test.ts (1)
59-68: good test isolation hardeningusing full auth-env snapshot/restore makes these vitest cases deterministic and prevents cross-test env bleed in
test/authorization.test.ts:59.Also applies to: 72-84, 88-100, 127-127
test/codex-manager-cli.test.ts (2)
2053-2064: good pagination contract assertions.
test/codex-manager-cli.test.ts:2053andtest/codex-manager-cli.test.ts:2082now pinschemaVersion,pagination.cursor, andpagination.pageSize, which tightens the json api contract for cursor paging.Also applies to: 2082-2093
2103-2147: good malformed cursor regression coverage.
test/codex-manager-cli.test.ts:2103validates both malformed decoded payloads and invalid base64 input with explicit exit-code and error assertions.test/file-lock.test.ts (1)
88-122: worker contention path now exercises the real lock api cleanly.
test/file-lock.test.ts:88–test/file-lock.test.ts:122andtest/file-lock.test.ts:134validate lock acquisition/release through the transpiled module in separate processes, then assert no lost/duplicate writes under contention.Also applies to: 134-166
test/unified-settings.test.ts (2)
6-25: cleanup retry helper is a good stability addition.
test/unified-settings.test.ts:6andtest/unified-settings.test.ts:44improve teardown reliability for transient windows remove errors (ebusy/eperm/enotempty).Also applies to: 44-44
249-302: good lock-before-read regression test.
test/unified-settings.test.ts:249verifiessaveUnifiedPluginConfigreads settings while the file lock is held, which directly covers the new lock ordering inlib/unified-settings.ts.lib/storage.ts (1)
350-359: lock release fallback handling is cleaner now.
lib/storage.ts:350andlib/storage.ts:368now keep best-effort cleanup observable and only trigger forced fallback on non-enoentrelease failures.Also applies to: 368-378
test/storage.test.ts (1)
680-714: strong regression coverage for decrypt failures and lock edge cases.
test/storage.test.ts:680plustest/storage.test.ts:1579–test/storage.test.ts:1676cover fail-fast decrypt behavior,lock.release()fallback on windows-style errors, and concurrent save lock serialization.Also applies to: 1579-1676
| const safeAuditResource = (() => { | ||
| try { | ||
| const parsed = new URL(url); | ||
| return `${parsed.origin}${parsed.pathname}`; | ||
| } catch { | ||
| return url; | ||
| } | ||
| })(); |
There was a problem hiding this comment.
sanitize fallback resource when url parsing fails
line 1730 falls back to raw url. if parsing fails, query params can be written to audit logs because auditLog persists resource verbatim (lib/audit.ts:122). strip query/hash in the fallback path too.
proposed fix
- const safeAuditResource = (() => {
+ const safeAuditResource = (() => {
try {
const parsed = new URL(url);
return `${parsed.origin}${parsed.pathname}`;
} catch {
- return url;
+ const raw = String(url);
+ return raw.split("?")[0]?.split("#")[0] ?? "unknown";
}
})();As per coding guidelines, lib/**: "check for logging that leaks tokens or emails."
Also applies to: 1739-1739
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@index.ts` around lines 1725 - 1732, The fallback for safeAuditResource
currently returns the raw url when new URL(url) throws, which can leak
query/hash into auditLog (see auditLog in lib/audit.ts); update the fallback in
the safeAuditResource IIFE (and the analogous spot around line 1739) to strip
query and hash from the raw string before returning (e.g., remove everything
from the first '?' or '#' onward) so only origin+pathname or sanitized path is
logged; ensure the sanitized value is what gets passed to auditLog/resource.
| function sanitizeErrorMessage(message: string): string { | ||
| return message | ||
| .replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "***REDACTED***") | ||
| .replace( | ||
| /\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi, | ||
| "token=***REDACTED***", | ||
| ) | ||
| .replace(/\b(Bearer)\s+[A-Z0-9._-]+\b/gi, "$1 ***REDACTED***"); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -60 lib/background-jobs.ts | tail -30Repository: ndycode/codex-multi-auth
Length of output: 1146
🏁 Script executed:
wc -l lib/background-jobs.tsRepository: ndycode/codex-multi-auth
Length of output: 94
🏁 Script executed:
sed -n '35,120p' lib/background-jobs.tsRepository: ndycode/codex-multi-auth
Length of output: 2887
🏁 Script executed:
fd -e test.ts -e test.js -e spec.ts | head -20Repository: ndycode/codex-multi-auth
Length of output: 607
🏁 Script executed:
rg -l "sanitizeErrorMessage" --type tsRepository: ndycode/codex-multi-auth
Length of output: 90
🏁 Script executed:
cat test/background-jobs.test.tsRepository: ndycode/codex-multi-auth
Length of output: 5755
🏁 Script executed:
python3 << 'PY'
import re
# Current patterns
current_token = re.compile(r'\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)', re.I)
current_bearer = re.compile(r'\b(Bearer)\s+[A-Z0-9._-]+\b', re.I)
# Proposed patterns
proposed_token = re.compile(r'\b(?:access|refresh|id)?_?token(?:=|:)?\s*([^\s,;]+)', re.I)
proposed_bearer = re.compile(r'\b(Bearer)\s+[^\s,;]+\b', re.I)
# Test cases with base64-like tokens (containing / + =)
test_cases = [
"refresh_token=abc/def+ghi==",
"access_token: xyz+123/abc=",
"id_token=something+more/data==",
"Authorization: Bearer abc/xyz+123==",
"error with token=valid-value_123",
"Bearer token_abc.def-ghi",
]
print("CURRENT PATTERNS:")
for test in test_cases:
result = current_token.sub("token=***REDACTED***", test)
result = current_bearer.sub(r"\1 ***REDACTED***", result)
leaked = "✗ LEAKED" if any(c in result for c in ['/', '+', '=']) and "token" in test else "✓ OK"
print(f"{leaked}: {test} → {result}")
print("\nPROPOSED PATTERNS:")
for test in test_cases:
result = proposed_token.sub("token=***REDACTED***", test)
result = proposed_bearer.sub(r"\1 ***REDACTED***", result)
leaked = "✗ LEAKED" if any(c in result for c in ['/', '+', '=']) and ("token" in test or "Bearer" in test) else "✓ OK"
print(f"{leaked}: {test} → {result}")
PYRepository: ndycode/codex-multi-auth
Length of output: 970
token redaction regex leaks base64-like secret suffixes.
lib/background-jobs.ts:43-45 uses [A-Z0-9._-]+ to match tokens, which stops at /, +, and = characters. this means tokens like refresh_token=abc/def+ghi== would only redact the prefix, leaking the base64 tail.
the test in test/background-jobs.test.ts:106 covers email and token redaction but uses simple tokens (sk_test_123, rt_456) without base64 chars, so it doesn't catch this. add a test case with base64-like tokens to the "redacts sensitive error text" test, then apply the fix to match everything until whitespace:
function sanitizeErrorMessage(message: string): string {
return message
.replace(/[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}/gi, "***REDACTED***")
.replace(
- /\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi,
+ /\b(?:access|refresh|id)?_?token(?:=|:)?\s*([^\s]+)/gi,
"token=***REDACTED***",
)
- .replace(/\b(Bearer)\s+[A-Z0-9._-]+\b/gi, "$1 ***REDACTED***");
+ .replace(/\b(Bearer)\s+([^\s]+)/gi, "$1 ***REDACTED***");
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/background-jobs.ts` around lines 39 - 47, The token-redaction regex in
sanitizeErrorMessage currently uses [A-Z0-9._-]+ which misses base64 chars like
/, +, =; update the token-capturing patterns in sanitizeErrorMessage (the
/\b(?:access|refresh|id)?_?token(?:=|:)?\s*([A-Z0-9._-]+)/gi and the Bearer
pattern) to use a non-whitespace matcher such as \S+ so tokens are consumed
until whitespace (e.g., /\b(?:access|refresh|id)?_?token(?:=|:)?\s*\S+/gi and
/\b(Bearer)\s+\S+/gi), and add a unit test in the "redacts sensitive error text"
test to assert that tokens with base64-like suffixes (containing /, +, =) are
fully redacted.
| const statusCode = (error as { statusCode?: unknown } | undefined)?.statusCode; | ||
| if (typeof statusCode === "number" && statusCode === 429) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "*.ts" | grep -E "(background|job)" | head -20Repository: ndycode/codex-multi-auth
Length of output: 123
🏁 Script executed:
cat -n lib/background-jobs.ts | head -100Repository: ndycode/codex-multi-auth
Length of output: 3990
🏁 Script executed:
rg -n "429|EBUSY|EPERM|getDelayMs|runBackgroundJobWithRetry" --type=ts lib/Repository: ndycode/codex-multi-auth
Length of output: 3935
🏁 Script executed:
fd -e test.ts -e spec.ts -o -e test.js -o -e spec.js | xargs rg -l "429|EBUSY|EPERM|background.*job|getDelayMs" 2>/dev/nullRepository: ndycode/codex-multi-auth
Length of output: 2227
🏁 Script executed:
cat -n test/background-jobs.test.tsRepository: ndycode/codex-multi-auth
Length of output: 7015
🏁 Script executed:
rg -n "delay|jitter|timing|sleep" test/background-jobs.test.tsRepository: ndycode/codex-multi-auth
Length of output: 50
🏁 Script executed:
rg -n "getDelayMs" test/background-jobs.test.tsRepository: ndycode/codex-multi-auth
Length of output: 50
add jitter to exponential backoff in lib/background-jobs.ts:49-50 to avoid worker synchronization on 429/EBUSY.
getDelayMs uses deterministic 2^attempt backoff, which causes parallel workers retrying after 429/EBUSY to synchronize and amplify throttling. regression tests exist (test/background-jobs.test.ts:111-139 for 429, test/background-jobs.test.ts:47-70 for EBUSY, test/background-jobs.test.ts:72-109 for EPERM), but they don't verify delay timing. add randomized jitter (e.g., ±20% of capped delay) to getDelayMs and add a test that verifies jitter is present for concurrent retry scenarios.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/background-jobs.ts` around lines 54 - 57, getDelayMs currently uses a
deterministic exponential backoff which causes worker synchronization; modify
getDelayMs in lib/background-jobs.ts to compute the capped exponential base
delay (as it does now) then apply a randomized jitter of about ±20% (e.g.,
multiply by 1 + randomBetween(-0.2, 0.2)) and return an integer ms to avoid
synchronized retries on 429/EBUSY. Keep the existing cap logic and ensure jitter
is applied after capping; update tests in test/background-jobs.test.ts to
include a concurrent-retry assertion that verifies returned delays for the same
attempt can differ (i.e., show jitter is present) while preserving existing
behavior for EPERM/other cases.
| quotaCacheDays: 14, | ||
| dlqDays: 30, | ||
| }; | ||
| const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Get the relevant code sections
echo "=== lib/data-retention.ts ==="
sed -n '15,90p' lib/data-retention.ts || echo "file not found"
echo -e "\n=== test/data-retention.test.ts existence and error code mentions ==="
if [ -f test/data-retention.test.ts ]; then
wc -l test/data-retention.test.ts
rg -n 'ENOTEMPTY|EBUSY|EPERM|EACCES|EAGAIN|rmdir|RETRYABLE' test/data-retention.test.ts || echo "no matches found"
else
echo "test file not found"
fi
echo -e "\n=== Searching all lib files for RETRYABLE_RETENTION_CODES usage ==="
rg -n 'RETRYABLE_RETENTION_CODES' lib/ || echo "no matches"Repository: ndycode/codex-multi-auth
Length of output: 3614
add enotempty to retryable retention codes; tests already expect this behavior
lib/data-retention.ts:21 excludes ENOTEMPTY from RETRYABLE_RETENTION_CODES, but test/data-retention.test.ts:7 already defines RETRYABLE_REMOVE_CODES with ENOTEMPTY included. fs.rmdir at lib/data-retention.ts:84 can transiently fail with ENOTEMPTY under concurrent writers on windows; production code skips retry while tests expect it to be handled, creating test/impl parity gap.
proposed fix
-const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]);
+const RETRYABLE_RETENTION_CODES = new Set([
+ "EBUSY",
+ "EPERM",
+ "EACCES",
+ "EAGAIN",
+ "ENOTEMPTY",
+]);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const RETRYABLE_RETENTION_CODES = new Set(["EBUSY", "EPERM", "EACCES", "EAGAIN"]); | |
| const RETRYABLE_RETENTION_CODES = new Set([ | |
| "EBUSY", | |
| "EPERM", | |
| "EACCES", | |
| "EAGAIN", | |
| "ENOTEMPTY", | |
| ]); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/data-retention.ts` at line 21, The RETRYABLE_RETENTION_CODES set is
missing "ENOTEMPTY", causing production code to skip retries while tests expect
it; update the RETRYABLE_RETENTION_CODES constant to include "ENOTEMPTY"
(matching the test's RETRYABLE_REMOVE_CODES) so the retry logic that checks this
set (used around the fs.rmdir call) will treat ENOTEMPTY as transient and retry
accordingly.
| let writeError: unknown; | ||
| let closeError: unknown; | ||
| try { | ||
| await handle.writeFile( | ||
| `${JSON.stringify({ pid: process.pid, acquiredAt: Date.now() })}\n`, | ||
| "utf8", | ||
| ); | ||
| } catch (error) { | ||
| writeError = error; | ||
| } | ||
| try { | ||
| await handle.close(); | ||
| } catch (error) { | ||
| closeError = error; | ||
| } | ||
| if (writeError !== undefined) { | ||
| throw writeError; | ||
| } | ||
| if (closeError !== undefined) { | ||
| throw closeError; | ||
| } |
There was a problem hiding this comment.
cleanup lock file when write/close fails during acquisition
line 85 and line 171 can throw after a failed write/close while leaving the lock file on disk. that creates a stale lock and blocks other writers (concurrency risk), especially noisy on windows eprem/ebusy paths in lib/file-lock.ts:69 and lib/file-lock.ts:154.
please unlink best-effort before rethrowing, and add a vitest regression that forces write failure + close failure to verify no orphan lock remains.
proposed fix
- if (writeError !== undefined) {
- throw writeError;
- }
- if (closeError !== undefined) {
- throw closeError;
- }
+ if (writeError !== undefined || closeError !== undefined) {
+ try {
+ await fs.unlink(path);
+ } catch (cleanupError) {
+ const cleanupCode = (cleanupError as NodeJS.ErrnoException).code;
+ if (cleanupCode !== "ENOENT") {
+ // keep original acquisition error precedence
+ }
+ }
+ throw writeError ?? closeError;
+ }
@@
- if (writeError !== undefined) {
- throw writeError;
- }
- if (closeError !== undefined) {
- throw closeError;
- }
+ if (writeError !== undefined || closeError !== undefined) {
+ try {
+ unlinkSync(path);
+ } catch (cleanupError) {
+ const cleanupCode = (cleanupError as NodeJS.ErrnoException).code;
+ if (cleanupCode !== "ENOENT") {
+ // keep original acquisition error precedence
+ }
+ }
+ throw writeError ?? closeError;
+ }As per coding guidelines, lib/**: "focus on auth rotation, windows filesystem IO, and concurrency."
Also applies to: 154-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/file-lock.ts` around lines 69 - 89, When write or close fails while
acquiring the lock in lib/file-lock.ts (the block using handle.writeFile and
handle.close that sets writeError/closeError), perform a best-effort unlink of
the lock file before rethrowing the error to avoid leaving a stale lock on disk;
do the same for the analogous block around lines 154-175, and add a vitest
regression that simulates both a write failure and a close failure during
acquisition to assert the lock file is removed (no orphan lock) even when errors
are thrown.
| const lock = acquireFileLockSync(UNIFIED_SETTINGS_LOCK_PATH, SETTINGS_LOCK_OPTIONS); | ||
| try { | ||
| const record = readSettingsRecordSync() ?? {}; | ||
| record.pluginConfig = { ...pluginConfig }; | ||
| writeSettingsRecordSync(record); | ||
| } finally { | ||
| lock.release(); | ||
| } |
There was a problem hiding this comment.
handle lock release failures without failing successful writes.
lib/unified-settings.ts:271, lib/unified-settings.ts:293, and lib/unified-settings.ts:341 propagate lock.release() errors directly. on windows, eperm/ebusy here can fail the command after data is already persisted and strand settings.json.lock.
proposed fix
+function releaseUnifiedSettingsLockSync(lockPath: string, release: () => void): void {
+ try {
+ release();
+ } catch (error) {
+ const code = (error as NodeJS.ErrnoException).code;
+ if (code !== "ENOENT") {
+ try {
+ unlinkSync(lockPath);
+ } catch {
+ // best-effort lock cleanup.
+ }
+ }
+ }
+}
+
+async function releaseUnifiedSettingsLockAsync(
+ lockPath: string,
+ release: () => Promise<void>,
+): Promise<void> {
+ try {
+ await release();
+ } catch (error) {
+ const code = (error as NodeJS.ErrnoException).code;
+ if (code !== "ENOENT") {
+ try {
+ await fs.rm(lockPath, { force: true });
+ } catch {
+ // best-effort lock cleanup.
+ }
+ }
+ }
+}
+
export function saveUnifiedPluginConfigSync(pluginConfig: JsonRecord): void {
const lock = acquireFileLockSync(UNIFIED_SETTINGS_LOCK_PATH, SETTINGS_LOCK_OPTIONS);
try {
const record = readSettingsRecordSync() ?? {};
record.pluginConfig = { ...pluginConfig };
writeSettingsRecordSync(record);
} finally {
- lock.release();
+ releaseUnifiedSettingsLockSync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
}
}
@@
try {
const record = (await readSettingsRecordAsync()) ?? {};
record.pluginConfig = { ...pluginConfig };
await writeSettingsRecordAsync(record);
} finally {
- await lock.release();
+ await releaseUnifiedSettingsLockAsync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
}
@@
try {
const record = (await readSettingsRecordAsync()) ?? {};
record.dashboardDisplaySettings = { ...dashboardDisplaySettings };
await writeSettingsRecordAsync(record);
} finally {
- await lock.release();
+ await releaseUnifiedSettingsLockAsync(UNIFIED_SETTINGS_LOCK_PATH, () => lock.release());
}also add vitest regressions for lock.release() throwing ebusy/eperm in test/unified-settings.test.ts.
As per coding guidelines lib/**: focus on auth rotation, windows filesystem IO, and concurrency. verify every change cites affected tests (vitest) and that new queues handle EBUSY/429 scenarios.
Also applies to: 287-294, 335-342
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@lib/unified-settings.ts` around lines 265 - 272, The current pattern around
acquireFileLockSync/lock.release in unified-settings (used with
UNIFIED_SETTINGS_LOCK_PATH and SETTINGS_LOCK_OPTIONS and wrapping
readSettingsRecordSync/writeSettingsRecordSync) can propagate lock.release
errors (EPERM/EBUSY) and fail after a successful write; update all call sites
(the blocks around the lock.release calls) to catch errors from lock.release(),
treat EPERM/EBUSY as non-fatal by logging a warning and swallowing them while
rethrowing unexpected errors, ensuring the writeSettingsRecordSync result is not
rolled back; also add vitest tests in test/unified-settings.test.ts that
stub/mock lock.release to throw EPERM and EBUSY and assert the write succeeds
and the lock file does not leave the process in a failing state.
| function extractLicenseTokens(rawLicense) { | ||
| return rawLicense | ||
| .toUpperCase() | ||
| .split(/[^A-Z0-9.-]+/) | ||
| .map((value) => value.trim()) | ||
| .filter((value) => value.length > 0); | ||
| } |
There was a problem hiding this comment.
denylist matching regressed for spdx variants
the new exact token match misses denied variants like gpl-2.0-only and gpl-2.0-or-later in scripts/license-policy-check.js:70. this can let blocked licenses pass.
proposed fix
- const tokens = new Set(extractLicenseTokens(normalized));
+ const tokens = extractLicenseTokens(normalized);
for (const denied of denyList) {
- if (tokens.has(denied)) {
+ if (tokens.some((token) => token === denied || token.startsWith(`${denied}-`))) {
violations.push(`${name}@${version} (${rawLicense})`);
break;
}
}Also applies to: 68-71
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/license-policy-check.js` around lines 48 - 54, The exact-token
denylist check is missing common SPDX variants (e.g., "gpl-2.0-only",
"gpl-2.0-or-later") causing blocked licenses to slip through; update the
matching logic that uses extractLicenseTokens so that for each token you also
derive and check a normalized base SPDX id (strip suffixes like "-only" and
"-or-later", and any trailing "+") against the denylist (i.e., when evaluating
in the function that performs the denylist lookup near the token match),
ensuring you test both the original token and the stripped/base form before
allowing it.
| it("redacts sensitive error text in dead-letter entries and warning logs", async () => { | ||
| vi.resetModules(); | ||
| const warnMock = vi.fn(); | ||
| vi.doMock("../lib/logger.js", () => ({ | ||
| logWarn: warnMock, | ||
| })); | ||
| try { | ||
| const { runBackgroundJobWithRetry, getBackgroundJobDlqPath } = | ||
| await import("../lib/background-jobs.js"); | ||
| await expect( | ||
| runBackgroundJobWithRetry({ | ||
| name: "test.retry-sensitive-error", | ||
| task: async () => { | ||
| throw new Error( | ||
| "network failed for person@example.com Bearer sk_test_123 refresh_token=rt_456", | ||
| ); | ||
| }, | ||
| maxAttempts: 1, | ||
| }), | ||
| ).rejects.toThrow("person@example.com"); |
There was a problem hiding this comment.
avoid asserting that thrown errors contain pii/secrets
line 160 currently requires person@example.com to be present in the thrown error. that bakes in a sensitive-data leak contract in test/background-jobs.test.ts:160. this should assert a generic failure instead, and keep secret-bearing strings out of externally surfaced errors.
proposed test adjustment
- ).rejects.toThrow("person@example.com");
+ ).rejects.toThrow("network failed");As per coding guidelines, lib/**: "check for logging that leaks tokens or emails."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/background-jobs.test.ts` around lines 141 - 160, The test currently
asserts that the thrown error contains PII ("person@example.com"); change the
assertion to avoid checking for secrets by expecting a generic failure
instead—e.g., assert that runBackgroundJobWithRetry (from
runBackgroundJobWithRetry / getBackgroundJobDlqPath) rejects (using
.rejects.toThrow() with no PII-specific message, or
.rejects.toThrowError()/toBeRejected) and keep any checks about redaction
focused on logs/DLQ contents rather than the thrown error text; update the
assertion in the failing test block so it no longer requires the email/token to
appear.
| it("retries transient EBUSY during directory entry retention pruning", async () => { | ||
| const { enforceDataRetention } = await import("../lib/data-retention.js"); | ||
| const oldDate = new Date(Date.now() - 3 * 24 * 60 * 60_000); | ||
| const logsDir = join(tempDir, "logs"); | ||
| const nestedDir = join(logsDir, "nested"); | ||
| const staleLog = join(nestedDir, "stale.log"); | ||
|
|
||
| await fs.mkdir(nestedDir, { recursive: true }); | ||
| await fs.writeFile(staleLog, "old", "utf8"); | ||
| await fs.utimes(staleLog, oldDate, oldDate); | ||
|
|
||
| const originalStat = fs.stat.bind(fs); | ||
| const statSpy = vi.spyOn(fs, "stat"); | ||
| let statBusyInjected = false; | ||
| statSpy.mockImplementation(async (path, options) => { | ||
| if (!statBusyInjected && path === staleLog) { | ||
| statBusyInjected = true; | ||
| const error = new Error("busy") as NodeJS.ErrnoException; | ||
| error.code = "EBUSY"; | ||
| throw error; | ||
| } | ||
| return originalStat(path, options as { bigint?: boolean }); | ||
| }); | ||
|
|
||
| const originalRmdir = fs.rmdir.bind(fs); | ||
| const rmdirSpy = vi.spyOn(fs, "rmdir"); | ||
| let rmdirBusyInjected = false; | ||
| rmdirSpy.mockImplementation(async (path) => { | ||
| if (!rmdirBusyInjected && path === nestedDir) { | ||
| rmdirBusyInjected = true; | ||
| const error = new Error("busy") as NodeJS.ErrnoException; | ||
| error.code = "EBUSY"; | ||
| throw error; | ||
| } | ||
| return originalRmdir(path); | ||
| }); | ||
|
|
||
| try { | ||
| const policy: RetentionPolicy = { | ||
| logDays: 1, | ||
| cacheDays: 90, | ||
| flaggedDays: 90, | ||
| quotaCacheDays: 90, | ||
| dlqDays: 90, | ||
| }; | ||
| const result = await enforceDataRetention(policy); | ||
| expect(result.removedLogs).toBe(1); | ||
| expect(statBusyInjected).toBe(true); | ||
| expect(rmdirBusyInjected).toBe(true); | ||
| await expect(fs.stat(staleLog)).rejects.toMatchObject({ code: "ENOENT" }); | ||
| await expect(fs.stat(nestedDir)).rejects.toMatchObject({ code: "ENOENT" }); | ||
| } finally { | ||
| statSpy.mockRestore(); | ||
| rmdirSpy.mockRestore(); | ||
| } | ||
| }); |
There was a problem hiding this comment.
expand regression coverage to all new retry branches
test/data-retention.test.ts:101 (line 101) only validates transient EBUSY in directory pruning. new branches in lib/data-retention.ts:21 (line 21) and lib/data-retention.ts:104 (line 104) are still untested (EPERM/EACCES/EAGAIN, plus single-file unlink retry path). this leaves a windows edge-case and concurrency-race regression gap.
suggested test additions
+it.each(["EPERM", "EACCES", "EAGAIN"] as const)(
+ "retries transient %s during state-file pruning",
+ async (code) => {
+ // create stale quota/flagged file
+ // inject one-time error on fs.unlink for that file
+ // assert enforceDataRetention succeeds and file is removed
+ },
+);
+
+it("throws after max retries for persistent EBUSY", async () => {
+ // inject persistent EBUSY on fs.stat or fs.rmdir
+ // assert enforceDataRetention rejects with code EBUSY
+});as per coding guidelines test/**: "tests must stay deterministic and use vitest. demand regression cases that reproduce concurrency bugs, token refresh races, and windows filesystem behavior. reject changes that mock real secrets or skip assertions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/data-retention.test.ts` around lines 101 - 156, Add deterministic Vitest
cases exercising the remaining retry branches in enforceDataRetention: simulate
transient EPERM/EACCES/EAGAIN errors on directory operations (fs.stat/fs.rmdir)
and a transient EBUSY/EACCES-like error on single-file removal (fs.unlink) to
verify the retry logic in the functions implemented in lib/data-retention.ts
(refer to enforceDataRetention and its internal directory-entry pruning and
single-file unlink paths). For each test, spy/mock the specific fs method (stat,
rmdir, unlink) to throw the transient error once for the target path then
succeed, assert that the retry flag was hit, assert the file/dir is ultimately
removed (expect ENOENT) and restore spies in finally blocks to keep tests
deterministic and isolated.
Summary
Validation
note: greptile review for oc-chatgpt-multi-auth. cite files like
lib/foo.ts:123. confirm regression tests + windows concurrency/token redaction coverage.Greptile Summary
this pr applies a broad enterprise hardening baseline across storage concurrency, token redaction, data retention resilience, supply-chain gating, and operational runbooks. the core storage change introduces a three-layer lock ordering (
accountFileMutex→ file lock →storageMutex) viawithStorageSerializedFileLock, which makes cross-process save ordering deterministic and explicit. accompanying changes addEDECRYPTfast-fail,releaseStorageLockFallback, url-stripping in audit logs, error-message sanitization in background jobs, and a token-set license check fix.key issues found:
lib/file-lock.ts— whenhandle.writeFilethrows (e.g. windowsEPERMfrom av-scanner), the lock file is left on disk empty/partial. stale-detection cannot parse its pid and skips the unlink, blocking all further acquisitions for up to 120 s. both the async and sync paths are affected. no vitest coverage for this branch.lib/data-retention.ts—withRetentionIoRetrywraps the entire recursivepruneDirectoryByAgecall rather than individual leaf i/o ops, causing exponential retry amplification on deep trees and under-counting removals across aborted passes. the newthrow erroron non-enoent errors also changes behaviour from skip-and-continue to abort-entire-cycle, which can stop a full retention run if one file is persistently av-locked on windows.lib/unified-settings.ts— sync temp path uses no random suffix (${pid}.${Date.now()}.tmp), creating a cross-process collision risk for same-millisecond writes; the async path already uses a random suffix.lib/background-jobs.ts—sanitizeErrorMessagetoken regex character class[A-Z0-9._-]misses standard base64 chars+and/, allowing partial token leakage for non-url-safe tokens.Confidence Score: 2/5
Important Files Changed
Sequence Diagram
sequenceDiagram participant C as Caller participant QM as accountFileMutex<br/>(in-process queue) participant FL as acquireFileLock<br/>(cross-process .lock file) participant SM as storageMutex<br/>(in-process queue) participant FS as Filesystem C->>QM: withAccountFileMutex() QM-->>C: queued (waits for prev) C->>FL: acquireFileLock(path.lock) Note over FL,FS: wx open → write PID → close FL-->>C: lock handle C->>SM: withStorageLock() SM-->>C: queued (waits for prev) C->>FS: loadAccountsInternal / saveAccountsUnlocked FS-->>C: result C->>SM: release storageMutex C->>FL: lock.release() [unlink .lock] alt release throws non-ENOENT FL-->>C: error C->>FS: releaseStorageLockFallback (fs.rm --force) end C->>QM: release accountFileMutexComments Outside Diff (2)
General comment
orphaned lock file on write failure — windows filesystem deadlock risk
when
handle.writeFilethrows (common on windows due to av-scanner holding the fd briefly —EPERM), the file was already created byfs.open("wx"). the error path closes the handle and throws, but never unlinks the file. subsequentacquireFileLockcalls hitEEXISTand fall into the stale-detection path, which tries toJSON.parsethe lock file. an empty or partially-written file causes the parse to throw, and the stale cleanup skips the unlink — leaving the lock stuck for the fullstaleAfterMs(120 s).same problem exists in
acquireFileLockSyncat the equivalent block.no vitest coverage for this branch — add a test that injects
EPERMonhandle.writeFileand asserts the lock file is removed before the next attempt.Prompt To Fix With AI
General comment
write happens outside the file lock in the refactored
writeSettingsRecordSyncthe old code held the file lock for the entire temp-write → rename sequence. the refactored
writeSettingsRecordSyncwrites the temp file and renames before any lock is acquired — the lock is now only acquired in the callers (saveUnifiedPluginConfigSync, etc.). sincewriteSettingsRecordSyncis called inside the locked block in the callers, the write is actually protected today.however, the rename loop in
writeSettingsRecordSynchappens while the lock is held, which is correct. but the temp path for the sync variant is${path}.${process.pid}.${Date.now()}.tmp— no random suffix. two same-process sync writes occurring at the same millisecond timestamp would overwrite each other's temp file. the file lock serialises same-process callers, but a cross-process scenario (multiple terminals starting at the same ms) could clobber the temp file before rename. the async variant already uses a random suffix — apply the same pattern here for consistency and safety.Prompt To Fix With AI
Last reviewed commit: 0c48823
Context used:
dashboard- What: Every code change must explain how it defends against Windows filesystem concurrency bugs and ... (source)