-
Notifications
You must be signed in to change notification settings - Fork 35
Comparing changes
Open a pull request
base repository: couchbase/indexing
base: master
head repository: couchbase/indexing
compare: morpheus
- 15 commits
- 20 files changed
- 8 contributors
Commits on Nov 19, 2025
-
[BP to 8.0.1] MB-69145 Fix alternate shardId population for alter index
BP from MB-68572 When alter index is performed on an index with alternate shardIds, the replica alternate shardIds are formed based on the alternate shardIds in the defn passed to replica repair. E.g., if replica repair is formed based on the definition containing partitions 1,2 - then for the definition containing partitions 3,4 alternate shardIds for partns: 1, 2 will be used. This will lead to the partitions being built with empty alternate shardIds. Change-Id: I2aa446632fbcf88cc94c0615c3c85e399a8e42f4
Configuration menu - View commit details
-
Copy full SHA for 350d885 - Browse repository at this point
Copy the full SHA 350d885View commit details
Commits on Nov 21, 2025
-
[BP 8.0.1] MB-69234: Add mechanism to retry security context initiali…
…sation on error BP MB-68943 Currently if any error is encountered while executing the function inside the sync.Once primitive, the primitive is marked as Done even though the function inside fails. So when the secuirty context for client is being initialised if any error occurs, the security context is Not Set and It is never tried again due to sync.Once being marked as done. fix: introduced a new type which is very similar to sync.Once but with a key difference that it is aware of the result (error) from the function inside of it, to mark itself done only when it succeeds. For any error it is not marked done and can be invoked again. Change-Id: I12bf1395187515edb65917bd4e7fedfb1797eeac
Configuration menu - View commit details
-
Copy full SHA for 80ca8c9 - Browse repository at this point
Copy the full SHA 80ca8c9View commit details -
[BP 8.0.1] MB-69232: Dont transition TT to Ready when cancel is calle…
…d during Merge BP MB-68558 During startShardRecovery, when the indexes are Recovered, and the function is waiting for Index state as ACTIVE. During the default loop, all the instances are iterated through, if the instances reach ACTIVE state but at the time of finishSuccess() the instId is already deleted from the processedInsts list, there is no guarantee that the merge of destTokenToMergeOrReady went through. This can lead to the TT to be transitioned to the next state even though merge won't happen. If it transitions, the cleanup will not be done by the Destination node To prevent the Transfer token from moving to the next state, errgroup is used to capture any errors returned by the merge observer go-routines. Presence of error in any one of the go-routine signifies that either rebalance was cancelled or done. The errorgroup waits for all the go-routine to be done as well checks for any returned error. For any returned errors, don't proceed further Change-Id: Ie13a85a4096a025a1e34c5770a24f46801a494d9
Configuration menu - View commit details
-
Copy full SHA for 185d95a - Browse repository at this point
Copy the full SHA 185d95aView commit details -
[BP 8.0.1] MB-69175: Improve debuggability in RestoreShardDone
BP MB-68906 Currently any response/error returned by the storage engine is not logged by GSI. This means that certain information about the book keeping by storage layers can be missed in the logs for debugging Change-Id: I741dc074a57a586043f22f0157256d3cd6d2e4ee
Configuration menu - View commit details
-
Copy full SHA for ce6fb9d - Browse repository at this point
Copy the full SHA ce6fb9dView commit details
Commits on Nov 25, 2025
-
[BP to 8.0.1] MB-69391 update graph status during merge/prune
BP from MB-69229 During rebalance, Bhive proxies were merging into their real instances without copying the in-memory BhiveGraphStatus, and pruned partitions left behind stale entries. This led to mismatches where checkDDLInProgress would see stale or missing graph status. Changes: copy each partition’s graph-ready flag from the source proxy when merging, and delete the entry when pruning, so the in-memory map always matches the local partition set. Change-Id: I79e04944a46950aefbbcc857d9e13ba8fd858fdb
Configuration menu - View commit details
-
Copy full SHA for 9b86108 - Browse repository at this point
Copy the full SHA 9b86108View commit details -
[BP to 8.0.1] MB-69390 return DDL in progress for bhive graph build
BP from MB-68576 Since Bhive graph build is a CPU intense task, it can lead to CPU saturation when done in conjunction with rebalance. For this, rebalancer already makes rebalance wait till Bhice graph build is completed. Instead of rebalance being stuck for long time (several hours for large indexes), it is better to reject rebalance by considering it as DDL in progress. Since there isnt a instance-wide GraphBuildPhase like there is a TrainingPhase, I did think of introducing that as it wouldve made this check cleaner and could be used elsewhere. But didnt want to add a new field and all transitions for just this one thing. Change-Id: If115aefa84f2040f417c1889e792edaaf71aa595
Configuration menu - View commit details
-
Copy full SHA for afdf91c - Browse repository at this point
Copy the full SHA afdf91cView commit details
Commits on Dec 11, 2025
-
MB-69465 [BP 8.0.1]: Add config to toggle skiplist node padding
Toggling this config requires indexer restart to apply it to existing indexes since skiplist node layout cannot be changed easily at runtime. Change-Id: Ib7a9bce8967b6f0298685159da99ca0a8326d7b4 (cherry picked from commit 7a60aa9)
Configuration menu - View commit details
-
Copy full SHA for 3168596 - Browse repository at this point
Copy the full SHA 3168596View commit details
Commits on Dec 22, 2025
-
MB-69944:[BP 8.0.1] Monitor indexer failover in projector and close conn
* Backport of MB-65760 * Projector is timing out around 15 minutes after indexer failover as dataport client held a stale TCP connection. When the indexer node failed over or restarted uncleanly, no FIN/RST was sent, leaving the connection half-open. * This can happend when nodes come back with new IP after failover * The client continued to use the old TCP stream until the OS TCP stack exhausted its retransmission attempts. This caused delayed error detection and prolonged recovery. * This change adds proper detection and cleanup of half-open TCP connections so projector reconnects immediately after failover or server restart. Change-Id: Ice7ea750636ea59b4b49a3d98fec0a3e6e25b50f (cherry picked from commit 84b3963)
Configuration menu - View commit details
-
Copy full SHA for f7ba7ef - Browse repository at this point
Copy the full SHA f7ba7efView commit details
Commits on Dec 24, 2025
-
MB-69925 Add Missing # TYPE Annotations for Metrics in handleMetricsHigh
The populateMetrics() function was outputting Prometheus metrics without the required #TYPE declarations,causing inconsistency with handleMetrics() which correctly outputs both the metric type and value. fix: added # TYPE declarations to the populateMetrics() and populateIsDivergingReplicaStat() functions Change-Id: I62a1d30886a0f305b5cd36dca48485c10115db91
Configuration menu - View commit details
-
Copy full SHA for 184d752 - Browse repository at this point
Copy the full SHA 184d752View commit details
Commits on Dec 25, 2025
-
MB-69953 [BP 8.0.1] Implement dynamic timer update during stream merge
Backport of MB-64242 This change allows timekeeper to slow down the MAINT_STREAM vs INIT_STREAM and facilitate the stream merge. Generally, it is expected that MAINT_STREAM runs slower than the INIT_STREAM during catchup phase, as it is processing more number of indexes. But it is possible with collection based data modelling that MAINT_STREAM is handling indexes of a collection with very low workload, while INIT_STREAM has index(es) with high workload. During catchup phase, INIT_STREAM may not be able to catchup in such a case. This patch adds the ability for timekeeper to identify long running stream merges and slow down the MAINT_STREAM. As MAINT_STREAM actively serves scans, slowing down the stream will add to scan latency. So this action is taken incrementally and up to a configurable max interval. New configs: timekeeper.mergePhase.maxTimerInterval (default: 100ms, 0 disables) timekeeper.mergePhase.tsQueueThreshold (default: 500) timekeeper.maintStream.forcedDelay (default: 0) The last config allows to set external throttling of MAINT_STREAM in case automatic throttling is not sufficient. Change-Id: I71bca2810de85277516d9c4e229f1f5318f3ad80
Configuration menu - View commit details
-
Copy full SHA for 66b619a - Browse repository at this point
Copy the full SHA 66b619aView commit details
Commits on Jan 8, 2026
-
[BP to 8.0.1] MB-69908 Skip vector sampling for indexes with recovere…
…d codebooks BP from MB-68640 During shard-based rebalance, vector index codebooks are transferred along with index data. The destination node recovers the codebook from disk. However, the training flow still fetched sample vectors from KV even though they were never used - wasting resources. Changes: - Add instHasTrainedCodebook() to check if any slice has a trained codebook using slice.IsTrained() (avoids memory allocation) - Skip sampling when defn-level reuse or a trained slice can supply a codebook; only call FetchSampleVectorsForIndexes for instances that actually need it - Fix: Set successMap to nil on early return to avoid duplicate reporting in retry scenarios Cases handled: - Rebalance/Resume: trained slice on this instance, its real inst, or other instances of the defn lets all defn instances skip sampling - ALTER INDEX: trained active replicas of the defn, or trained slices on this instance, let the defn skip sampling - Mixed batch: if sampling fails, codebook-backed instances continue; if all skip sampling, no KV fetch occurs Change-Id: I8ca06178d3b4d718525cc52850e0b5657a89d925
Configuration menu - View commit details
-
Copy full SHA for 63951c3 - Browse repository at this point
Copy the full SHA 63951c3View commit details
Commits on Jan 12, 2026
-
[BP to 8.0.1] MB-69909 Reject merge if source/target instances are in…
… training phase BP from MB-68544 If source or target index instances are in training phase, the reject the merge. Indexer will trigger merge after training is done for those instances (if the keyspace is idle, then indexer needs to force a merge after training is done to properly merge the erroneous instances). After training, if source and target do not agree on the same training state, then fail rebalance. This can happen in the following case: 1. Real instance got build request and training initiated 2. A proxy (p1) moved to the node and it started build 3. Real instance did not find any documents and marked for training error 4. While proxy p1 training is in progress, another proxy p2 started training where both realInst and p2 have started to train 5. p1 moves to TRAINING_NOT_STARTED but the training for realInst and p2 succeeds now. So, they will be moved to TRAINING_COMPLETED state 6. In this case, we can not merge realInst and proxy (p1) and rebalance would fail. Although this is a rare race condition, it is better to handle it. A retry of rebalance should fix the issue as the documents are added between in the middle of training different proxies. Change-Id: I01c704a9eb0f2431304d426536dab84fc8ca941a
Configuration menu - View commit details
-
Copy full SHA for 1e8e45a - Browse repository at this point
Copy the full SHA 1e8e45aView commit details
Commits on Jan 14, 2026
-
MB-69935 call restoreShardDone immediately post recovery
[BP to 8.0.1] bg: the expectation with restoreShardDone is that it will called post recovery of all indexes; this signals plasma that the required indexes are done with restore and it can proceed with cleaning up dead instances, starting LSS cleaners etc; indexes whose shards are not called with restoreShardDone should not be used if indexer crashes and bootstap recovery happens primarily because it is expected that such indexes are not rebal active and will be cleaned up; this is not true for non-empty node batched indexes leading to the bug of shard corruption exp: to fix the symantic issues in calling the restoreShardDone API call it once recovery of all indexes is complete before we transition to the next state; this gurantees that either the indexes which could be recovered post a crash are from a shard which was marked done or those indexes will get deleted asmpt: plasma assumes that GSI will not recover indexes whose shards are not called with restoreShardDone before crashes; Change-Id: I1b75a34e62ce529d407faee301a2cb12cdcbc873 Signed-off-by: Dhruvil Shah <dsdruvil8@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for d5e84ae - Browse repository at this point
Copy the full SHA d5e84aeView commit details
Commits on Jan 21, 2026
-
MB-69935 dont call restoreShardDone out of rebalance context
[BP to 8.0.1] bg: as described in the ticket, we should not be calling restoreShardDone out of rebalance context; the last call which remains for the same is RestoreAndUnlockShard which is called during rebalance cleanup to restore shards which are locked for recovery and pending ready; it can no longer happen we have shards which are ready/pending ready with restoreShardDone not called; all shards not having undergone restoreShardDone are expected to be dropped; exp: restoreShardDone is not required to be called in cleanup; we can have shards locked for recovery but we cannot have shards which have not undergone `restoreShardDone`; hence only call unlock shards for such cases; tests: existing functional test in CI `TestVectorIndexShardRebalance/TestRebalanceCancelIndexerAfterRecovery` already tests for the behaviour we are aiming for; Change-Id: I366942977417c5f58d80d8e2bfc1b43a15bae3fe Signed-off-by: Dhruvil Shah <dsdruvil8@gmail.com>
Configuration menu - View commit details
-
Copy full SHA for 23c172c - Browse repository at this point
Copy the full SHA 23c172cView commit details -
[BP 8.0.1] MB-69906: Log a warning and suppress error to storage manager
Whenever shard manager of bhive is not initialised (this can be due to no Bhive index has been created on node) an error log will be printed. fix: For the storage not initialised error, print a warning and suprress the error when returned to the storage manager. Suppressing the error should not any side effect, since the bhiveShards slice returned would be nil. Change-Id: I46d29d66a402aed57086cd76992221959e14ab75 (cherry picked from commit 7366285)
Configuration menu - View commit details
-
Copy full SHA for 1a5950b - Browse repository at this point
Copy the full SHA 1a5950bView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff master...morpheus