Detect hardware failures/automatically create tickets for DC Ops
Open, Needs TriagePublic
Actions

Assigned To

None

Authored By

	bking
	Jun 17 2024, 5:10 PM

Description

My team (Data Platform SRE) is responsible for a large amount of baremetal hardware (~20% of all infra at WMF). As such, hardware errors are fairly common (see T361286 and T367598 , among others). These failures can't be reported from the host itself, as the problem usually makes it unbootable. The typical workflow might be something like this:

[SRE] team gets ping alert
[SRE] X amount of time triaging
[SRE] Y amount of time manually creating a ticket and sending to DC Ops
[DC Ops] internal routing
[DC Ops] coordination with service owners
[DC Ops] implementation

If we leverage the OOB interfaces for alerting, we could cut out steps 1-4 and simply create a ticket for DC Ops with the correct tags.

Dell (and presumably HP/Supermicro) have SNMP, email, syslog, and redfish integrations for alerting .

Creating this ticket to:

Gauge interest with DC Ops and the larger SRE teams
If there's enough interest, evaluate OOB alerting features and create an implementation plan.

Related Objects

Mentioned In: T367598: Elastic2099 unresponsive
Mentioned Here: T358691: Hadoop datanode on an-worker1173 is showing errors
T253810: Alert on ECC warnings in SEL
T361286: Fatal error detected on elastic2088
T367598: Elastic2099 unresponsive

Event Timeline

bking created this task.Jun 17 2024, 5:10 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJun 17 2024, 5:10 PM

bking updated the task description. (Show Details)Jun 17 2024, 5:11 PM

bking added a subscriber: Dzahn.Jun 17 2024, 5:14 PM

It seems a duplicate of T253810

bking updated the task description. (Show Details)Jun 17 2024, 5:29 PM

bking renamed this task from Hardware failures: consider alerting via SEL messages to Hardware failures: consider alerting based off SEL messages.Jun 17 2024, 5:42 PM

bking updated the task description. (Show Details)Jun 17 2024, 5:45 PM

Per IRC conversation with @herron ,there is a prometheus redfish exporter that might be worth evaluating.

bking updated the task description. (Show Details)Jun 17 2024, 9:56 PM

What's the end goal here?
If the host is up and running it could alert if any error occurs using local commands/logs, if it goes down you get alerted because it's down.

The BMCs are on a separate network that is reachable only from very specific hosts (cumin and puppetmaster/server) and the management password MUST NOT be saved on any production host.

The end goal is to send hardware failures directly to DC Ops in a ticket instead of having them go through multiple teams for triage, manual ticket creation, etc. I'll rewrite the ticket to make it more clear.

Ideally, each failure would automatically create a ticket for DC Ops in the correct datacenter. This is the experience I'm accustomed to from prior jobs. Based on the number of hardware failures I've experienced personally, I believe this would save a lot of time, but I'll leave it up to the larger SRE teams (particularly DC Ops) to decide whether it's worth the effort.

As far as implementation, I agree that we don't want to store the OOB passwords anywhere. I don't think that will be necessary; this page lists some possible options we could implement .

Looking longer-term I think it'd be generally worthwhile to support more than baseline blackbox checks on the mgmt interfaces and I'm personally open to exploring something like the redfish exporter. But I think this would be a medium to large sized project since AIUI it will involve a decent sized chunk of setup effort on the hardware itself, and offhand I'm also not sure the percentage of hw in the fleet that will support the approach today. I'm also assuming we would run into some hw vendor oddities/bugs along the way.

In terms of near-term actions I think the thing to do right now will be to focus on bringing the host ipmi exporter based alerting approach in T253810 towards completion. like @Volans mentioned it should capture the majority of hardware failure cases and ipmi-sel/ipmi_exporter are deployed in production today.

bking renamed this task from Hardware failures: consider alerting based off SEL messages to Detect hardware failures/automatically create tickets for DC Ops.Jun 18 2024, 2:15 PM

bking updated the task description. (Show Details)

@herron 100% agree that using the current software is more practical/achievable in the short term. If I can help on implementation let me know.

To make a pragmatic suggestion. How about, for right now, we don't worry about long-term, new exporters and making it perfect and just run the command from cumin hosts, where it already works, without having to worry about the password and all that.

It seems like we could just run the command on cumin hosts via NRPE and copy/paste existing Icinga eventhandler logic that we already use for automatic RAID tickets and we effectively have achieved the end goal.

Then we can always make it better in the future.

bking updated the task description. (Show Details)Jun 18 2024, 3:50 PM

bking mentioned this in T367598: Elastic2099 unresponsive.Jun 20 2024, 5:07 PM

In T367790#9903258, @bking wrote:

Ideally, each failure would automatically create a ticket for DC Ops in the correct datacenter. This is the experience I'm accustomed to from prior jobs. Based on the number of hardware failures I've experienced personally, I believe this would save a lot of time, but I'll leave it up to the larger SRE teams (particularly DC Ops) to decide whether it's worth the effort.

I would not be too keen to implement this, personally. Perhaps the Elasticsearch cluster would be a relatively good candidate, but I would certainly be hesitant to take myself out of the loop for automatic maintenance of Hadoop or Blazegraph or Ceph systems, for example. I've already been bitten by an automatic RAID creation ticket on a Hadoop worker (T358691#9590169) where a RAID0 volume was hot-swapped without my knowing about it. This automation didn't save any time.

My experience is that when hardware does fail, it does so in many different ways. and there are so many different failure modes that it is unusual to see exactly the same type of condition twice. Therefore, while I would be happy to see automatic tickets created from SEL entries, I believe that these tickets should be for the service owner to investigate in the first instance, rather than DC Ops.

Detect hardware failures/automatically create tickets for DC OpsOpen, Needs TriagePublicActions

Description

Related Objects

Event Timeline

Detect hardware failures/automatically create tickets for DC Ops
Open, Needs TriagePublic
Actions