Skip to content

Conversation

@jscalev
Copy link
Contributor

@jscalev jscalev commented Dec 8, 2025

There were two separate issues that resulted in the re-execution of RunCommand scripts during both the upgrade from 1.3.18 to 1.3.26, and to the downgrade from 1.3.26 to 1.3.18.

For background, RC prevents reruns through a file named mrseq that contains the last sequence number executed. In a previous version, 1.3.17 - which was never placed in production - a regression was checked in where that file was deleted during the extension upgrade.

Normally during extension upgrade, these mrseq files must be migrated from one extension version to the next. However, the deletion occurred during the disable call, which executed before the upgrade. Therefore, the mrseq files were not migrated.

This issue was detected before 1.3.17 was released to production, but unfortunately the code that was actually released as 1.3.18 was the same as 1.3.17.

Besides no longer removing the mrseq files, another fix should have been added to 1.3.18 that "rehydrated" the mrseq files. Because we could not prevent legacy versions from deleting this file, the extension used the .status files - which were still extant - to replace the files.

As mentioned, the bits for 1.3.18 were the same as the faulty 1.3.17. Version 1.3.26 did contain the fixes meant for 1.3.18, but it also had code that restricted this rehydration only to versions it knew were faulty. Since it believed that 1.3.18 was correct, it did not apply the rehydration to this version.

Therefore, when the extension was upgraded from 1.3.18 to 1.3.26, the mrseq files were deleted and never rehydrated. Therefore, the extension believed the commands had never run, and re-ran them.

When downgrading from 1.3.26 to 1.3.18, the issue was different. 1.3.26 correctly did not delete the mrseq files, but the issue occurred later.

The RunCommand extension depends on Guest Agent to tell it the following in the upgrade method.
-The extension version to which we're upgrading
-The extension version from which we're upgrading

Unfortunately, a bug exists in the Linux agent that always provides the higher extension version in the "to" version. Another bug existed in that the extension was reading the wrong environment variable (there's a separate variable indicating which version the agent is calling, which for upgrades was always the higher), but ultimately the main issue is the extension was told that it was upgrading from 1.3.18 to 1.3.26, when the truth was the opposite.

As mentioned, version 1.3.26 correctly did not delete the mrseq files. However, they had already been deleted in the upgrade mentioned previously. That was because it looked for them under 1.3.18. Since these mrseq files had already been deleted, they were not migrated.

The fixes are thus the following:

  • Widen the versions covered for rehydration to include 1.3.18
  • Since a fix to the Guest Agent is too costly and will take a long time to roll out, change the migration logic to look for mrseq files in both version directories to determine the to/from extension versions for upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants