Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

September 2023 Datacenter Switchover
Closed, ResolvedPublic

Description

This is the meta task for the September 2023 Datacenter switchover (eqiad -> codfw).

Schedule

Switchover

Repooling

Wednesday, September 27th, 2023

Related Objects

Event Timeline

kamila triaged this task as Medium priority.Aug 30 2023, 3:31 PM

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263 started.

Mentioned in SAL (#wikimedia-operations) [2023-09-27T14:08:01Z] <kamila@cumin1001> START - Cookbook sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263

Mentioned in SAL (#wikimedia-operations) [2023-09-27T14:22:31Z] <claime> Repooling eqiad services in progress - T345263

kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263 completed.

Mentioned in SAL (#wikimedia-operations) [2023-09-27T14:29:15Z] <kamila@cumin1001> END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) pool all services in eqiad: Datacenter Switchover: eqiad repool - T345263

Mentioned in SAL (#wikimedia-operations) [2023-09-27T16:09:27Z] <kamila_> Pooled back eqiad for traffic after the DC switchover (T345263)

All disruptive switchover-related work is finished and things are stable. The switchover went smoothly and had minimal user impact, while also uncovering issues that we need to know about in order to improve our infrastructure and processes. The read-only period lasted 2min 22s.

I am extending thanks to everyone involved in the preparation as well as everyone who was present during the switchover, helping monitor or fix issues. A special shoutout goes to everyone that contributed to MultiDC, which massively reduces switchover complexity, and our DBAs, who do the really scary part. Thanks as well to Community Relations who notified communities of the read-only window.

One of the issues we discovered during the switchover was a MW-on-k8s capacity issue in codfw. My colleagues were able to address it very quickly. This is the sort of issue that the switchover process is meant to find, so it worked as intended :-)