-
Notifications
You must be signed in to change notification settings - Fork 383
docs(hep): harvester upgrade v2 #9161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Zespre Chang <zespre.chang@suse.com>
d736a7f
to
734d03b
Compare
8232e46
to
9437748
Compare
- add upgrade & version crds - add test plan Signed-off-by: Zespre Chang <zespre.chang@suse.com>
9437748
to
55e6217
Compare
Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Signed-off-by: Zespre Chang <zespre.chang@suse.com>
Hey folks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the big and well writen HEP.
A few concerns for open discussion
(1) The alternative solution of upgrade-repo.
(2) The decoupling with Rancher, some hidden dependencies might only be observed on developing/testing stage.
(3) Given Harvester's 4-month release plan, as said in the HEP, it needs more thoughts about the feature delivery stage by stage.
|
||
#### Stage 1 - Transition Period (version N-1 to N upgrades, where N is the debut minor version of Harvester Upgrade V2) | ||
|
||
The inner workings of Upgrade Manager in this stage will be essentially the same as before; the significant difference is that it is built and packaged separately from the primary Harvester artifact. It will also have its own Deployment in contrast to the main Harvester controller manager Deployment. The main concerns will be: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems to import another requriement:
an upgrade matrix management upon Harvester/UpgradeShim <-> Upgrade Manager, to ensure the versions are matching between them
|
||
#### Plan 1 - Static Pods with BackingImage Disk Files | ||
|
||
The main idea is to leverage static pods, Longhorn BackingImage disk files, and hostPath volumes. The `minNumberOfCopies` field of the BackingImage for the ISO image is set to the number of nodes of the cluster (subject to change) to distribute the disk files to all nodes. By mounting the BackingImage disk file directly to a specific path on the host filesystem for each node, static pods on each node can access the ISO image content using the hostPath volume. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
minNumberOfCopies field
: ... not sure it works effectively on massive cluster with say 100 nodes.
1. (Optional) Preloading the Upgrade Manager container image | ||
1. Deploying Upgrade Manager | ||
|
||
### Upgrade Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both plan 1
and plan 2
seem not robust enough
1. Node drain and RKE2 upgrade, corresponds to the **drain** stage | ||
1. OS upgrade, corresponds to the **post-drain** stage | ||
|
||
Since the hook mechanism in Rancher v2prov is no longer usable, Upgrade Manager relies on Plan CRs backed by System Upgrade Controller to execute node-upgrade tasks. One of the hidden advantages of this change is that it allows the operating system upgrade to be separated from the overall upgrade, thus enabling true zero-downtime (for nodes) upgrades. Another one is that it unblocks the way for users to have granular control over the node-upgrade order through node selectors. The new node-upgrade phase looks like the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it allows the operating system upgrade to be separated from the overall upgrade
how to, Harvester or user makes the decision?
simply skip when it is not necessary or selectively do it on another upgrade?
The challenge of switching to SUC Plans is that the node-upgrade phase is no longer executed as a control loop, but as a one-time task. Upgrade Manager needs to reconcile Plans instead of machine-plan Secrets, and there will be no other entities, such as the embedded Rancher, which abstracts the lifecycle management of downstream clusters, to organize the upgrade nuances. | ||
|
||
> [!IMPORTANT] | ||
> The infinite-retry vs. fail-fast will be a key topic for discussion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems no an easy/straightforward decision, we might need to copy those hacks or patches from Rancher to cover various cases
|
||
Another significant aspect is installation. Do we rely on Rancher System Agent to bootstrap clusters? What are the impacts if we drop it entirely? Do we need to develop a new installation method to fill the gap left by the decommissioning of the Rancher System Agent? | ||
|
||
It appears that we do rely on the Rancher System Agent to bootstrap nodes except for the initial one. If we remove the Rancher System Agent, there might not be issues in day 2 management; however, we cannot live without it when it comes to cluster bootstrapping. If we leave it as is, the Rancher System Agent generates numerous logs containing error messages (because it is no longer able to communicate with the Rancher Manager) and does nothing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it is no longer able to communicate with the Rancher Manager
The embeded rancher
deployment is still inside Harvester, what's the reason that it can't communicate, due to below?
Decouple
8E8D
the cluster itself with the embedded Rancher Manager by removing the kubernetesVersion and rkeConfig fields from the local provisioning cluster CR
Problem:
Solution:
Related Issue(s):
#7101
#7112
Test plan:
Additional documentation or context