My team (Data Platform SRE) is responsible for a large amount of baremetal hardware (~20% of all infra at WMF). As such, hardware errors are fairly common (see T361286 and T367598 , among others). These failures can't be reported from the host itself, as the problem usually makes it unbootable. The typical workflow might be something like this:
- [SRE] team gets ping alert
- [SRE] X amount of time triaging
- [SRE] Y amount of time manually creating a ticket and sending to DC Ops
- [DC Ops] internal routing
- [DC Ops] coordination with service owners
- [DC Ops] implementation
If we leverage the OOB interfaces for alerting, we could cut out steps 1-4 and simply create a ticket for DC Ops with the correct tags.
Dell (and presumably HP/Supermicro) have SNMP, email, syslog, and redfish integrations for alerting .
Creating this ticket to:
- Gauge interest with DC Ops and the larger SRE teams
- If there's enough interest, evaluate OOB alerting features and create an implementation plan.