Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: couchbase/indexing
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: master
Choose a base ref
...
head repository: couchbase/indexing
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: morpheus
Choose a head ref
Checking mergeability… Don’t worry, you can still create the pull request.
  • 15 commits
  • 20 files changed
  • 8 contributors

Commits on Nov 19, 2025

  1. [BP to 8.0.1] MB-69145 Fix alternate shardId population for alter index

    BP from MB-68572
    
    When alter index is performed on an index with alternate shardIds,
    the replica alternate shardIds are formed based on the alternate
    shardIds in the defn passed to replica repair.
    
    E.g., if replica repair is formed based on the definition containing
    partitions 1,2 - then for the definition containing partitions 3,4
    alternate shardIds for partns: 1, 2 will be used. This will lead to
    the partitions being built with empty alternate shardIds.
    
    Change-Id: I2aa446632fbcf88cc94c0615c3c85e399a8e42f4
    varunv-cb authored and Nischal1729 committed Nov 19, 2025
    Configuration menu
    Copy the full SHA
    350d885 View commit details
    Browse the repository at this point in the history

Commits on Nov 21, 2025

  1. [BP 8.0.1] MB-69234: Add mechanism to retry security context initiali…

    …sation on error
    
    BP MB-68943
    
    Currently if any error is encountered while executing the function
    inside the sync.Once primitive, the primitive is marked as Done even
    though the function inside fails. So when the secuirty context for
    client is being initialised if any error occurs, the security context is
    Not Set and It is never tried again due to sync.Once being marked as
    done.
    
    fix: introduced a new type which is very similar to sync.Once but with a
    key difference that it is aware of the result (error) from the function
    inside of it, to mark itself done only when it succeeds. For any error
    it is not marked done and can be invoked again.
    
    Change-Id: I12bf1395187515edb65917bd4e7fedfb1797eeac
    shivanshrustagi committed Nov 21, 2025
    Configuration menu
    Copy the full SHA
    80ca8c9 View commit details
    Browse the repository at this point in the history
  2. [BP 8.0.1] MB-69232: Dont transition TT to Ready when cancel is calle…

    …d during Merge
    
    BP MB-68558
    
    During startShardRecovery, when the indexes are Recovered, and the
    function is waiting for Index state as ACTIVE. During the default loop,
    all the instances are iterated through, if the instances reach ACTIVE
    state but at the time of finishSuccess() the instId is already deleted
    from the processedInsts list, there is no guarantee that the merge of
    destTokenToMergeOrReady went through. This can lead to the TT to be
    transitioned to the next state even though merge won't happen.
    If it transitions, the cleanup will not be done by the Destination node
    
    To prevent the Transfer token from moving to the next state, errgroup is
    used to capture any errors returned by the merge observer go-routines.
    Presence of error in any one of the go-routine signifies that either
    rebalance was cancelled or done.
    The errorgroup waits for all the go-routine to be done as well checks
    for any returned error. For any returned errors, don't proceed further
    
    Change-Id: Ie13a85a4096a025a1e34c5770a24f46801a494d9
    shivanshrustagi committed Nov 21, 2025
    Configuration menu
    Copy the full SHA
    185d95a View commit details
    Browse the repository at this point in the history
  3. [BP 8.0.1] MB-69175: Improve debuggability in RestoreShardDone

    BP MB-68906
    
    Currently any response/error returned by the storage engine is not
    logged by GSI. This means that certain information about the book
    keeping by storage layers can be missed in the logs for debugging
    
    Change-Id: I741dc074a57a586043f22f0157256d3cd6d2e4ee
    shivanshrustagi committed Nov 21, 2025
    Configuration menu
    Copy the full SHA
    ce6fb9d View commit details
    Browse the repository at this point in the history

Commits on Nov 25, 2025

  1. [BP to 8.0.1] MB-69391 update graph status during merge/prune

    BP from MB-69229
    
    During rebalance, Bhive proxies were merging into their real instances without copying the in-memory BhiveGraphStatus, and pruned partitions left behind stale entries. This led to mismatches where checkDDLInProgress would see stale or missing graph status.
    
    Changes: copy each partition’s graph-ready flag from the source proxy when merging, and delete the entry when pruning, so the in-memory map always matches the local partition set.
    
    Change-Id: I79e04944a46950aefbbcc857d9e13ba8fd858fdb
    Nischal1729 committed Nov 25, 2025
    Configuration menu
    Copy the full SHA
    9b86108 View commit details
    Browse the repository at this point in the history
  2. [BP to 8.0.1] MB-69390 return DDL in progress for bhive graph build

    BP from MB-68576
    
    Since Bhive graph build is a CPU intense task, it can lead to CPU saturation when done in conjunction with rebalance. For this, rebalancer already makes rebalance wait till Bhice graph build is completed. Instead of rebalance being stuck for long time (several hours for large indexes), it is better to reject rebalance by considering it as DDL in progress.
    
    Since there isnt a instance-wide GraphBuildPhase like there is a TrainingPhase, I did think of introducing that as it wouldve made this check cleaner and could be used elsewhere. But didnt want to add a new field and all transitions for just this one thing.
    
    Change-Id: If115aefa84f2040f417c1889e792edaaf71aa595
    Nischal1729 committed Nov 25, 2025
    Configuration menu
    Copy the full SHA
    afdf91c View commit details
    Browse the repository at this point in the history

Commits on Dec 11, 2025

  1. MB-69465 [BP 8.0.1]: Add config to toggle skiplist node padding

    Toggling this config requires indexer restart
    to apply it to existing indexes since skiplist
    node layout cannot be changed easily at runtime.
    
    Change-Id: Ib7a9bce8967b6f0298685159da99ca0a8326d7b4
    (cherry picked from commit 7a60aa9)
    akhilmd committed Dec 11, 2025
    Configuration menu
    Copy the full SHA
    3168596 View commit details
    Browse the repository at this point in the history

Commits on Dec 22, 2025

  1. MB-69944:[BP 8.0.1] Monitor indexer failover in projector and close conn

    * Backport of MB-65760
    * Projector is timing out around 15 minutes after indexer failover as
      dataport client held a stale TCP connection. When the indexer node
      failed over or restarted uncleanly, no FIN/RST was sent, leaving the
      connection half-open.
    * This can happend when nodes come back with new IP after failover
    * The client continued to use the old TCP stream until the OS TCP stack
      exhausted its retransmission attempts. This caused delayed
      error detection and prolonged recovery.
    * This change adds proper detection and cleanup of half-open TCP
      connections so projector reconnects immediately after failover or server
      restart.
    
    Change-Id: Ice7ea750636ea59b4b49a3d98fec0a3e6e25b50f
    (cherry picked from commit 84b3963)
    ksaikrishnateja committed Dec 22, 2025
    Configuration menu
    Copy the full SHA
    f7ba7ef View commit details
    Browse the repository at this point in the history

Commits on Dec 24, 2025

  1. MB-69925 Add Missing # TYPE Annotations for Metrics in handleMetricsHigh

    The populateMetrics() function was outputting Prometheus metrics
    without the required #TYPE declarations,causing inconsistency with handleMetrics() which correctly
    outputs both the metric type and value.
    
    fix:
    added # TYPE declarations to the populateMetrics() and populateIsDivergingReplicaStat() functions
    
    Change-Id: I62a1d30886a0f305b5cd36dca48485c10115db91
    varuni7 committed Dec 24, 2025
    Configuration menu
    Copy the full SHA
    184d752 View commit details
    Browse the repository at this point in the history

Commits on Dec 25, 2025

  1. MB-69953 [BP 8.0.1] Implement dynamic timer update during stream merge

    Backport of MB-64242
    
    This change allows timekeeper to slow down the MAINT_STREAM
    vs INIT_STREAM and facilitate the stream merge.
    Generally, it is expected that MAINT_STREAM runs slower than
    the INIT_STREAM during catchup phase, as it is processing
    more number of indexes. But it is possible with collection
    based data modelling that MAINT_STREAM is handling indexes
    of a collection with very low workload, while INIT_STREAM
    has index(es) with high workload. During catchup phase,
    INIT_STREAM may not be able to catchup in such a case.
    
    This patch adds the ability for timekeeper to identify long
    running stream merges and slow down the MAINT_STREAM.
    As MAINT_STREAM actively serves scans, slowing down the stream
    will add to scan latency. So this action is taken incrementally
    and up to a configurable max interval.
    
    New configs:
    timekeeper.mergePhase.maxTimerInterval (default: 100ms, 0 disables)
    timekeeper.mergePhase.tsQueueThreshold (default: 500)
    timekeeper.maintStream.forcedDelay (default: 0)
    
    The last config allows to set external throttling of MAINT_STREAM in
    case automatic throttling is not sufficient.
    
    Change-Id: I71bca2810de85277516d9c4e229f1f5318f3ad80
    deepkaran committed Dec 25, 2025
    Configuration menu
    Copy the full SHA
    66b619a View commit details
    Browse the repository at this point in the history

Commits on Jan 8, 2026

  1. [BP to 8.0.1] MB-69908 Skip vector sampling for indexes with recovere…

    …d codebooks
    
    BP from MB-68640
    
    During shard-based rebalance, vector index codebooks are transferred
    along with index data. The destination node recovers the codebook from
    disk. However, the training flow still fetched sample vectors from KV
    even though they were never used - wasting resources.
    
    Changes:
    - Add instHasTrainedCodebook() to check if any slice has a trained
      codebook using slice.IsTrained() (avoids memory allocation)
    - Skip sampling when defn-level reuse or a trained slice can supply
      a codebook; only call FetchSampleVectorsForIndexes for instances
      that actually need it
    - Fix: Set successMap to nil on early return to avoid duplicate
      reporting in retry scenarios
    
    Cases handled:
    - Rebalance/Resume: trained slice on this instance, its real inst,
      or other instances of the defn lets all defn instances skip sampling
    - ALTER INDEX: trained active replicas of the defn, or trained slices
      on this instance, let the defn skip sampling
    - Mixed batch: if sampling fails, codebook-backed instances continue;
      if all skip sampling, no KV fetch occurs
    
    Change-Id: I8ca06178d3b4d718525cc52850e0b5657a89d925
    Nischal1729 committed Jan 8, 2026
    Configuration menu
    Copy the full SHA
    63951c3 View commit details
    Browse the repository at this point in the history

Commits on Jan 12, 2026

  1. [BP to 8.0.1] MB-69909 Reject merge if source/target instances are in…

    … training phase
    
    BP from MB-68544
    
    If source or target index instances are in training phase,
    the reject the merge. Indexer will trigger merge after
    training is done for those instances (if the keyspace is idle,
    then indexer needs to force a merge after training is done
    to properly merge the erroneous instances).
    
    After training, if source and target do not agree on the same
    training state, then fail rebalance. This can happen in the
    following case:
    
    1. Real instance got build request and training initiated
    2. A proxy (p1) moved to the node and it started build
    3. Real instance did not find any documents and marked
       for training error
    4. While proxy p1 training is in progress, another proxy
       p2 started training where both realInst and p2 have
       started to train
    5. p1 moves to TRAINING_NOT_STARTED but the training for
       realInst and p2 succeeds now. So, they will be moved
       to TRAINING_COMPLETED state
    6. In this case, we can not merge realInst and proxy (p1)
       and rebalance would fail. Although this is a rare race
       condition, it is better to handle it.
    
    A retry of rebalance should fix the issue as the documents
    are added between in the middle of training different proxies.
    
    Change-Id: I01c704a9eb0f2431304d426536dab84fc8ca941a
    varunv-cb authored and Nischal1729 committed Jan 12, 2026
    Configuration menu
    Copy the full SHA
    1e8e45a View commit details
    Browse the repository at this point in the history

Commits on Jan 14, 2026

  1. MB-69935 call restoreShardDone immediately post recovery

    [BP to 8.0.1]
    
    bg:
    the expectation with restoreShardDone is that it will called
    post recovery of all indexes; this signals plasma that the
    required indexes are done with restore and it can proceed
    with cleaning up dead instances, starting LSS cleaners etc;
    indexes whose shards are not called with restoreShardDone
    should not be used if indexer crashes and bootstap recovery
    happens primarily because it is expected that such indexes
    are not rebal active and will be cleaned up; this is not
    true for non-empty node batched indexes leading to the bug
    of shard corruption
    
    exp:
    to fix the symantic issues in calling the restoreShardDone API
    call it once recovery of all indexes is complete before we
    transition to the next state; this gurantees that either the
    indexes which could be recovered post a crash are from a shard
    which was marked done or those indexes will get deleted
    
    asmpt:
    plasma assumes that GSI will not recover indexes whose shards
    are not called with restoreShardDone before crashes;
    
    Change-Id: I1b75a34e62ce529d407faee301a2cb12cdcbc873
    Signed-off-by: Dhruvil Shah <dsdruvil8@gmail.com>
    NightWing1998 committed Jan 14, 2026
    Configuration menu
    Copy the full SHA
    d5e84ae View commit details
    Browse the repository at this point in the history

Commits on Jan 21, 2026

  1. MB-69935 dont call restoreShardDone out of rebalance context

    [BP to 8.0.1]
    
    bg: as described in the ticket, we should not be calling
    restoreShardDone out of rebalance context; the last call
    which remains for the same is RestoreAndUnlockShard which
    is called during rebalance cleanup to restore shards which
    are locked for recovery and pending ready; it can no longer
    happen we have shards which are ready/pending ready with
    restoreShardDone not called; all shards not having undergone
    restoreShardDone are expected to be dropped;
    
    exp:
    restoreShardDone is not required to be called in cleanup; we
    can have shards locked for recovery but we cannot have shards
    which have not undergone `restoreShardDone`; hence only call
    unlock shards for such cases;
    
    tests:
    existing functional test in CI
    `TestVectorIndexShardRebalance/TestRebalanceCancelIndexerAfterRecovery`
    already tests for the behaviour we are aiming for;
    
    Change-Id: I366942977417c5f58d80d8e2bfc1b43a15bae3fe
    Signed-off-by: Dhruvil Shah <dsdruvil8@gmail.com>
    NightWing1998 committed Jan 21, 2026
    Configuration menu
    Copy the full SHA
    23c172c View commit details
    Browse the repository at this point in the history
  2. [BP 8.0.1] MB-69906: Log a warning and suppress error to storage manager

    Whenever shard manager of bhive is not initialised (this can be due to
    no Bhive index has been created on node) an error log will be printed.
    
    fix: For the storage not initialised error, print a warning and suprress
    the error when returned to the storage manager. Suppressing the error
    should not any side effect, since the bhiveShards slice returned would
    be nil.
    
    Change-Id: I46d29d66a402aed57086cd76992221959e14ab75
    (cherry picked from commit 7366285)
    shivanshrustagi committed Jan 21, 2026
    Configuration menu
    Copy the full SHA
    1a5950b View commit details
    Browse the repository at this point in the history
Loading