Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

BTullis (Ben)
Staff SRE

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Tuesday

  • Clear sailing ahead.

User Details

User Since
Jun 29 2021, 9:56 AM (178 w, 4 d)
Availability
Available
IRC Nick
btullis
LDAP User
Btullis
MediaWiki User
BTullis (WMF) [ Global Accounts ]

Recent Activity

Thu, Nov 28

BTullis renamed T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1 from Upgrade Hadoop to version 3.3.x and Hive to version 3.1.x to Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.
Thu, Nov 28, 11:04 AM · Epic, Data-Engineering, Data-Platform-SRE
BTullis triaged T381087: Resurrect the Hadoop cluster in the analytics project in WMCS as High priority.
Thu, Nov 28, 10:51 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis moved T381087: Resurrect the Hadoop cluster in the analytics project in WMCS from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Thu, Nov 28, 10:50 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis edited projects for T381087: Resurrect the Hadoop cluster in the analytics project in WMCS, added: Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE.
Thu, Nov 28, 10:50 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T381087: Resurrect the Hadoop cluster in the analytics project in WMCS.
Thu, Nov 28, 10:50 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.

Hive version 4.0.1 has been added to bigtop in this patch: https://github.com/apache/bigtop/commit/d8459aabfcc8d64bc3121bedc1b2c55ca3788270
We can probably cherry-pick this into our build.

Thu, Nov 28, 9:45 AM · Epic, Data-Engineering, Data-Platform-SRE

Wed, Nov 27

BTullis added a comment to T380866: Build bigtop 3.3 packages for bullseye and bookworm.

I have built and packaged bigtop 3.3.0 packages for bullseye.

Wed, Nov 27, 1:53 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)

Tue, Nov 26

BTullis closed T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop as Resolved.

Great! Glad to hear that it's working. If you think our docs could be clearer, or if there is anything missing from them, please feel free to let me know, or simply edit them.

Tue, Nov 26, 7:52 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis triaged T380866: Build bigtop 3.3 packages for bullseye and bookworm as High priority.
Tue, Nov 26, 2:21 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis moved T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop from Backlog - operations to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Tue, Nov 26, 1:59 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis claimed T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop.
Tue, Nov 26, 1:58 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis removed a project from T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop: Data-Engineering.
Tue, Nov 26, 1:58 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T379303: Requesting access to analytics-privatedata-users group, sql_lab role, Kerberos Principal for Khantstop.

@Khantstop - could you possibly paste some command output or a screenshot, please?
This is what I see when I go to stat1008 and run kinit

(base) btullis@marlin:~$ ssh stat1008.eqiad.wmnet 
Linux stat1008 5.10.0-30-amd64 #1 SMP Debian 5.10.218-1 (2024-06-01) x86_64
Debian GNU/Linux 11 (bullseye)
stat1008 is a Statistics & Analytics cluster explorer (private data access, no local compute) (statistics::explorer)
stat1008 is statistics::explorer
Bare Metal host on site eqiad and rack A6
This host is capable of Kerberos authentication in the WIKIMEDIA realm.
For more info: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide
The last Puppet run was at Tue Nov 26 13:24:59 UTC 2024 (26 minutes ago). 
Last Puppet commit: (b9a0aff6c5) David Caro - cloudcephmon1004: provision as mon
Debian GNU/Linux 11 auto-installed on Thu May 23 11:03:16 UTC 2024.
Last login: Tue Nov 26 13:51:14 2024 from 2a02:ec80:600:1:185:15:58:6
Tue, Nov 26, 1:58 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis updated the task description for T380866: Build bigtop 3.3 packages for bullseye and bookworm.
Tue, Nov 26, 1:36 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis claimed T380866: Build bigtop 3.3 packages for bullseye and bookworm.
Tue, Nov 26, 1:36 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T380866: Build bigtop 3.3 packages for bullseye and bookworm.
Tue, Nov 26, 1:35 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a comment to T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1.

A point of note is that Hive 3.x has now been classified as EOL as from 2024/10/08 - https://hive.apache.org/general/downloads/

image.png (701×1 px, 107 KB)

The 4.0.1 release is now considered the stable release.

Tue, Nov 26, 12:04 PM · Epic, Data-Engineering, Data-Platform-SRE
BTullis raised the priority of T379748: Draft a project plan for the Hadoop version 3 upgrade from Medium to High.
Tue, Nov 26, 11:12 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Data-Engineering
BTullis raised the priority of T379385: Upgrade Hadoop to version 3.3.6 and Hive to version 4.0.1 from Medium to High.
Tue, Nov 26, 11:12 AM · Epic, Data-Engineering, Data-Platform-SRE
BTullis moved T380835: Alert in need of triage: SmartNotHealthy (instance stat1011:9100) from Backlog - project to Backlog - operations on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Tue, Nov 26, 10:12 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20), sre-alert-triage
BTullis edited projects for T380835: Alert in need of triage: SmartNotHealthy (instance stat1011:9100), added: Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE.
Tue, Nov 26, 10:11 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20), sre-alert-triage

Mon, Nov 25

BTullis added a comment to T380417: Test if an existing conda environment with Spark 3.1.2 clients works fine with Spark 3.5.3.

I added a commit to the airflow-dags pyspark upgrade MR here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/a30add30d519697c59431d940985f135a8586f3c

Mon, Nov 25, 3:03 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Dumps 2.0, Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis closed T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out as Resolved.

Tentatively resolving the issue, but please do feel free to reopen it @mpopov if things are not improved by this.

Mon, Nov 25, 2:56 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis closed T365878: Test whether or not CPU performance governor helps Hadoop Performance as Resolved.

Looks like this is OK. The graphs here are a bit too spiky to see any immediate effect, but I'll be on the lookout for any more evidence of performance gains in the near future.

Mon, Nov 25, 2:56 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis closed T365878: Test whether or not CPU performance governor helps Hadoop Performance, a subtask of T362922: Audit/consider enabling CPU performance governor on DPE SRE-owned hosts, as Resolved.
Mon, Nov 25, 2:55 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis moved T380040: Deploy the spark shuffler version 3.5.3 to the test cluster from Needs Review to To Be Deployed on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Mon, Nov 25, 1:58 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Patch-For-Review
BTullis moved T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out from Blocked/Waiting to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Mon, Nov 25, 1:58 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T365878: Test whether or not CPU performance governor helps Hadoop Performance from In Progress to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Mon, Nov 25, 1:56 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T365878: Test whether or not CPU performance governor helps Hadoop Performance from Blocked/Waiting to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

I have received approval from @Jclark-ctr out of band, so I'll go ahead and deploy this change.

Mon, Nov 25, 1:54 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis updated the task description for T380731: Reboots of Bookworm systems which use 6.1.115.
Mon, Nov 25, 1:47 PM · Infrastructure-Foundations, SRE
BTullis moved T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS from In Progress to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Mon, Nov 25, 1:45 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis closed T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS as Resolved.

I have also failed back the namenode services from an-master1004 back to an-master1003.

btullis@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1004-eqiad-wmnet an-master1003-eqiad-wmnet
Failover to NameNode at an-master1003.eqiad.wmnet/10.64.36.15:8040 successful

So I think we can close this task.

Mon, Nov 25, 1:45 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis closed T379571: Kernel error Server an-redacteddb1001 may have kernel errors as Resolved.

I have removed the remaining resources manually. Some had been removed anyway by purge => true but I think eveything is OK now.

btullis@an-redacteddb1001:~$ sudo rm /etc/ssh/userkeys/root.d/cloud_cumin /lib/systemd/system/prometheus-node-kernel-panic.timer /etc/ferm/conf.d/10_ssh_from_cloudcumin_masters /etc/logrotate.d/prometheus-node-kernel-panic /lib/systemd/system/prometheus-node-kernel-panic.service /etc/rsyslog.d/40-prometheus-node-kernel-panic.conf /usr/local/bin/prometheus-node-kernel-panic 
rm: cannot remove '/etc/ssh/userkeys/root.d/cloud_cumin': No such file or directory
rm: cannot remove '/etc/ferm/conf.d/10_ssh_from_cloudcumin_masters': No such file or directory
rm: cannot remove '/etc/rsyslog.d/40-prometheus-node-kernel-panic.conf': No such file or directory
btullis@an-redacteddb1001:~$ sudo rmdir /var/log/prometheus-node-kernel-panic
btullis@an-redacteddb1001:~$ sudo systemctl daemon-reload 
btullis@an-redacteddb1001:~$
Mon, Nov 25, 1:39 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), cloud-services-team
BTullis claimed T379571: Kernel error Server an-redacteddb1001 may have kernel errors.
Mon, Nov 25, 1:35 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), cloud-services-team
BTullis added a comment to T380417: Test if an existing conda environment with Spark 3.1.2 clients works fine with Spark 3.5.3.

Over at conda-analytics, I don't see anything in the changelog that suggest we officially upgraded Iceberg to 1.3.1, so perhaps this test cluster behavir was a manual change?

CC @BTullis

Mon, Nov 25, 12:24 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Dumps 2.0, Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis renamed T380733: Change the Airflow email From address so that it refers to the instance name instead of the k8s cluster from Change the Airflow email From address so that it refers to the instance nameinstead of the k8s cluster to Change the Airflow email From address so that it refers to the instance name instead of the k8s cluster.
Mon, Nov 25, 12:10 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis created T380733: Change the Airflow email From address so that it refers to the instance name instead of the k8s cluster.
Mon, Nov 25, 12:09 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS.

This morning the number of directories within the analytics user's log directory on HDFS has risen above the previous threshold again, to 1069767.

btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs
     1069767      3112858      2624551317505 /var/log/hadoop-yarn/apps/analytics/logs

The fact that log aggregation is still working shows that the new setting must be working, which is a great sign.

Mon, Nov 25, 11:06 AM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering

Sat, Nov 23

BTullis added a comment to T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS.

In the meantime, the number of directory entries in /var/log/hadoop-yarn/apps/analytics/logs dropped below the magic number of 1048576 to 1038184 so I don't yet know whether or not the configuration change works.

btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs
     1038184      3037830      1959450645456 /var/log/hadoop-yarn/apps/analytics/logs

I've cleaned up the failed DAGs as best I could and I will follow-up with puppet changes on Monday. Currently, puppet is disabled on an-tets-master100[1-2] and an-master100[3-4].

Sat, Nov 23, 4:57 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis lowered the priority of T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS from Unbreak Now! to High.

I had to bump the Java heap on the namenode process from 110 GB to 164 GB before I could reliably fail over to an-master100[3-4].

Sat, Nov 23, 4:54 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis edited projects for T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS, added: Data-Platform-SRE (2024.11.09 - 2024.11.29); removed Data-Platform-SRE.
Sat, Nov 23, 12:41 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis added a comment to T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS.

This didn't work. I've now removed the /etc/hadoop/conf/hdfs-default.xml file on all servers and manually added the configuration property to the end of /etc/hadoop/conf/hdfs-site.xml - then disabled puppet.

Sat, Nov 23, 12:40 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis added a comment to T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS.

I have looked into the options for increasing the limits on the number of files. We don't currently have anything configured, but there is a configuration parameter we can use.
It seems to be defined in the /etc/hadoop/hdfs-default.xml file, which we don't currently manage with puppet. It is available for our version of Hadoop.

Sat, Nov 23, 12:03 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis triaged T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS as Unbreak Now! priority.
Sat, Nov 23, 11:58 AM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering
BTullis created T380674: Log aggregation is failing for the analytics user due to too many files in /var/log/hadoop-yarn/apps/analytics/logs on HDFS.
Sat, Nov 23, 11:58 AM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering

Fri, Nov 22

BTullis moved T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out from In Progress to Blocked/Waiting on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Fri, Nov 22, 6:17 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

Are you OK with this solution of removing or backing up ~/.conda/pkgs on hosts where you are experiencing this issue?
I can ask the same question to @tchin - if you're still experiencing this issue, too.

Fri, Nov 22, 6:17 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

@mpopov - It seems to be something to do with the .conda/pkgs directory in your home directory.

Fri, Nov 22, 6:12 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

I can reproduce this on the terminal.

btullis@stat1010:~$ sudo -u bearloga bash
bearloga@stat1010:/srv/home/btullis$ conda-analytics-clone btullis-T380477
<snip snip>
Installing the conda and conda-libmamba-solver packages to the newly cloned environment: btullis-T380477
Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done
Fri, Nov 22, 6:03 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis closed T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster as Resolved.
Fri, Nov 22, 6:02 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T380566: Increase the number of partitions for the webrequest_frontend topics. from Backlog - project to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Fri, Nov 22, 5:53 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis moved T379023: Create WDQS split endpoints for internal traffic and reconfigure clients to use the new endpoints from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Fri, Nov 22, 3:20 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Wikidata
BTullis added a comment to T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.

Added some guidance to Wikitech here: https://wikitech.wikimedia.org/w/index.php?title=Data_Platform/Systems/Airflow/Developer_guide&oldid=2246773

Fri, Nov 22, 1:28 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.

It's also worth noting that the hive.exec.scratchdir is configurable: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27362070#AdminManualConfiguration-HiveConfigurationVariables
...so we could change that as well.

Fri, Nov 22, 1:18 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.

I've found out what is causing this. There is an easy enough short-term fix, but a few options about how to stop it happening again.

Fri, Nov 22, 1:09 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)

Thu, Nov 21

BTullis added a comment to T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.

Weird. Once again, it works for me on stat1008. I think that this must be something to do with the contents of hour home directory.

btullis@stat1008:~$ spark3-sql
Running /opt/conda-analytics/bin/spark-sql $@
SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark
Using Hadoop client lib jars at 3.2.0, provided by Spark.
PYSPARK_PYTHON=/opt/conda-analytics/bin/python3
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/21 18:10:08 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12006. Attempting port 12007.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12007. Attempting port 12008.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12008. Attempting port 12009.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12009. Attempting port 12010.
24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12010. Attempting port 12011.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4048. Attempting port 4049.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4049. Attempting port 4050.
24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4050. Attempting port 4051.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13006. Attempting port 13007.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13007. Attempting port 13008.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13008. Attempting port 13009.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13009. Attempting port 13010.
24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13010. Attempting port 13011.
ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar
24/11/21 18:10:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path
Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar]
Spark master: local[*], Application Id: local-1732212609631
spark-sql (default)>
Thu, Nov 21, 6:11 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

So the difference would seem to be that in @mpopov's environment we see:

Nov 21 14:53:50 stat1010 jupyterhub-conda-singleuser[493333]: Collecting package metadata (repodata.json): ...working... done
Nov 21 14:53:51 stat1010 jupyterhub-conda-singleuser[493333]: Solving environment: ...working... done
Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/noarch/conda-libmamba-solver-24.7.0-pyhd3eb1b0_0.conda\nThis >
Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/linux-64/certifi-2024.8.30-py310h06a4308_0.conda\nThis comman>
Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/linux-64/ca-certificates-2024.9.24-h06a4308_0.conda\nThis com>

Whereas in mine we see:

Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: Collecting package metadata (repodata.json): ...working... WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found>
Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail.
Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail.
Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail.
Thu, Nov 21, 5:03 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

This succeeded for me.

Thu, Nov 21, 4:57 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.

Currently trying this out on stat1010, creating a new clonled environment and tailing the systemctl unit log file.

btullis@stat1010:~$ journalctl -u jupyter-btullis-singleuser-conda-analytics.service -f
-- Journal begins at Thu 2024-11-21 05:09:27 UTC. --
Nov 21 16:52:35 stat1010 systemd[1]: Started /bin/bash -c cd /home/btullis && exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh __NEW__ --port=48013 --SingleUserNotebookApp.default_url=/lab.
Nov 21 16:52:35 stat1010 jupyterhub-conda-singleuser[540711]: Creating new cloned conda env 2024-11-21T16.52.35_btullis...
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Source:      /opt/conda-analytics
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Destination: /home/btullis/.conda/envs/2024-11-21T16.52.35_btullis
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: The following packages cannot be cloned out of the root environment:
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]:  - conda-forge/linux-64::conda-23.10.0-py310hff52083_1
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]:  - conda-forge/noarch::conda-libmamba-solver-24.7.0-pyhd8ed1ab_0
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Packages: 225
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Files: 1255
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Downloading and Extracting Packages: ...working... done
Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Downloading and Extracting Packages: ...working... done
Nov 21 16:52:40 stat1010 jupyterhub-conda-singleuser[540732]: Preparing transaction: ...working... done
Thu, Nov 21, 4:53 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.

I'm currently unable to reproduce this on stat1008.

Thu, Nov 21, 4:50 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Thu, Nov 21, 4:31 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Thu, Nov 21, 4:31 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis claimed T380477: Jupyter/Conda: spawn new server with 'create and use new cloned env' times out.
Thu, Nov 21, 4:31 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis claimed T380494: java.io.IOException: Permission denied when trying to access the hadoop cluster.
Thu, Nov 21, 4:30 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.

We have created a new file system for the dumps project, so there are some relevant guides here: T352650#10344155

Thu, Nov 21, 3:02 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Ceph
BTullis added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

Currently, they are both assigned the hdd CRUSH rule.

btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule
crush_rule: hdd
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.data crush_rule
crush_rule: hdd

We can change the CRUSH rule associated with the cephfs.dumps.meta pool like this.

btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dumps.meta crush_rule ssd
set pool 18 crush_rule to ssd
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule
crush_rule: ssd
Thu, Nov 21, 2:58 PM · Data-Platform-SRE, Epic, Data Products, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis added a comment to T352650: Migrate current-generation dumps to run from our containerized images.

We have now created a cephfs file system for dumps.

btullis@cephosd1001:~$ sudo ceph fs volume create dumps

We can see that one of the existing standby MDS servers has now been assigned to the active role for this file system.

Volume created successfully (no MDS daemons created)
btullis@cephosd1001:~$ sudo ceph -s
  cluster:
    id:     6d4278e1-ea45-4d29-86fe-85b44c150813
    health: HEALTH_OK
Thu, Nov 21, 2:53 PM · Data-Platform-SRE, Epic, Data Products, MW-on-K8s, Dumps-Generation, Release-Engineering-Team, serviceops
BTullis updated the task description for T374922: Bring an-conf100[4-6] into service to replace an-conf100[1-3].
Thu, Nov 21, 2:42 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis added a project to T371501: Configure DSCP marking for cloudceph* hosts: Ceph.
Thu, Nov 21, 11:41 AM · Ceph, Patch-For-Review, netops, Infrastructure-Foundations, SRE

Wed, Nov 20

BTullis added a comment to T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.

We need to wait until ml-lab1002 has been reimaged and moved to the analytics VLAN, but we can get on with making the file system and the cephx user in the meantime.

Wed, Nov 20, 6:19 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Ceph
BTullis updated the task description for T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.
Wed, Nov 20, 6:18 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Ceph
BTullis added a comment to T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task.

As per T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - we have decided to extend the scope of this from just the stat servers to the 'ml-lab` servers as well.

Wed, Nov 20, 6:17 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Ceph
BTullis triaged T380398: Update airflow-dags so that spark 3.5.3 shuffler is used by default, with an override to use the v 3.1.2 shuffler if required as High priority.
Wed, Nov 20, 6:12 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis moved T380398: Update airflow-dags so that spark 3.5.3 shuffler is used by default, with an override to use the v 3.1.2 shuffler if required from Backlog - project to In Progress on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Wed, Nov 20, 6:12 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis created T380398: Update airflow-dags so that spark 3.5.3 shuffler is used by default, with an override to use the v 3.1.2 shuffler if required.
Wed, Nov 20, 6:12 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis moved T380040: Deploy the spark shuffler version 3.5.3 to the test cluster from In Progress to Needs Review on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Wed, Nov 20, 5:57 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Patch-For-Review
BTullis claimed T380040: Deploy the spark shuffler version 3.5.3 to the test cluster.
Wed, Nov 20, 5:38 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Patch-For-Review
BTullis added a comment to T354733: Create Conda Analytics environment including spark version 3.5.3.

We have a debian package of conda-analytics 0.0.37 available here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/5821/download

Wed, Nov 20, 5:03 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis updated subscribers of T369845: [Refine Refactoring] Refine jobs should be scheduled by Airflow: deployment.
Wed, Nov 20, 10:44 AM · Patch-For-Review, Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis added a comment to T379748: Draft a project plan for the Hadoop version 3 upgrade.

This task may be a useful reference. It was an investigation of what we considered to be unrecoverable and requiring a backup, before the last major operation to HDFS. We ended up with a figure of ~ 400 TB at the time.
T260409: Establish what data must be backed up before the HDFS upgrade

Wed, Nov 20, 10:41 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Data-Engineering

Tue, Nov 19

BTullis added a project to T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph: Data-Platform-SRE (2024.11.09 - 2024.11.29).
Tue, Nov 19, 6:44 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Machine-Learning-Team
BTullis updated subscribers of T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph.
Tue, Nov 19, 6:42 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Machine-Learning-Team
BTullis added a comment to T380039: Create and distribute as assembly file for spark version 3.5.3.

Great. I think I prefer this approach of adding Iceberg to the spark assembly to what we had previously discussed, which was adding Iceberg to the docker image: T336012: Add support for Iceberg to the Spark Docker Image
Am I right in thinking that if we have the iceberg jar in the assembly file, we no longer need it in the container image as well, @xcollazo ?

Tue, Nov 19, 3:55 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a project to T378735: Jupyter and Analytics Client Enhancements Phase 1: enable shared home directories on the stat servers umbrella task: Ceph.
Tue, Nov 19, 12:20 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Ceph
BTullis added a project to T380258: Create an Airflow instance for ML: Data-Platform-SRE.

Thanks @isarantopoulos - I'm sure that we can help you, here.

We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).

Tue, Nov 19, 11:44 AM · Data-Platform-SRE (2024.11.30 - 2024.12.20), Machine-Learning-Team, Data-Platform
BTullis moved T379203: Add CI_RELEASE_TOKEN secret for {name}-maven-release jobs in jenkins from Backlog - operations to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Tue, Nov 19, 11:27 AM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Release-Engineering-Team (Priority Backlog 📥), collaboration-services, GitLab (Administration, Settings & Policy), Java-Scala-Standardization
BTullis added a comment to T375729: Create LDAP groups to use for OIDC permission mapping with corresponding airflow DAG Authors groups .

We did originally discuss a sync and/or cross-checking mechanism, to make sure that users of the POSIX and LDAP groups don't diverge from each other.

Ideally, all non-system users of the POSIX groups should be synchonised with the LDAP groups, with some kind of notification in place of there is a discrepancy.

Tue, Nov 19, 10:36 AM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Infrastructure-Foundations
BTullis added a comment to T367790: Detect hardware failures/automatically create tickets for DC Ops.

Ideally, each failure would automatically create a ticket for DC Ops in the correct datacenter. This is the experience I'm accustomed to from prior jobs. Based on the number of hardware failures I've experienced personally, I believe this would save a lot of time, but I'll leave it up to the larger SRE teams (particularly DC Ops) to decide whether it's worth the effort.

Tue, Nov 19, 10:07 AM · DC-Ops, Data-Platform
BTullis triaged T379990: 403 on http://dumps.wikimedia.org as Medium priority.
Tue, Nov 19, 9:42 AM · Data-Platform-SRE, Data-Platform
BTullis moved T379990: 403 on http://dumps.wikimedia.org from Backlog to SRE on the Data-Platform board.
Tue, Nov 19, 9:40 AM · Data-Platform-SRE, Data-Platform

Mon, Nov 18

BTullis moved T354733: Create Conda Analytics environment including spark version 3.5.3 from In Progress to To Be Deployed on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.
Mon, Nov 18, 1:36 PM · Data-Platform-SRE (2024.11.30 - 2024.12.20)
BTullis closed T378954: Build bigtop 1.5 packages for bookworm as Resolved.
Mon, Nov 18, 1:35 PM · Patch-For-Review, Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T380035: Create Spark docker images for version 3.5.3.

Oh, the build failed on build2001 with the following error:

2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ /usr/src/spark/build/mvn help:evaluate -Dexpression=project.version -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Dhadoop.version=3.3.6
 (drivers.py:106)
2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ grep -v WARNING
 (drivers.py:106)
2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ tail -n 1
++ grep -v INFO
 (drivers.py:106)
2024-11-18 13:26:01 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.18/scala-2.12.18.tgz
 (drivers.py:106)
2024-11-18 13:26:02 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz?action=download
 (drivers.py:106)
2024-11-18 13:26:02 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz.sha512
 (drivers.py:106)
2024-11-18 13:26:03 [docker-pkg-build] INFO - Verifying checksum from /usr/src/spark/build/apache-maven-3.9.6-bin.tar.gz.sha512
 (drivers.py:106)
2024-11-18 13:26:03 [docker-pkg-build] INFO - Using `mvn` from path: /usr/src/spark/build/apache-maven-3.9.6/bin/mvn
 (drivers.py:106)
2024-11-18 13:27:05 [docker-pkg-build] INFO - + VERSION='[ERROR] [Help 2] http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException'
 (drivers.py:106)
2024-11-18 13:27:06 [docker-pkg-build] ERROR - Build command failed with exit code 1: The command '/bin/sh -c ./dev/make-distribution.sh --name wmf --pip --r     -Phive     -Phive-thriftserver     -Pyarn     -Pkubernetes     -Dhadoop.version=3.3.6' returned a non-zero code: 1 (drivers.py:97)
2024-11-18 13:27:06 [docker-pkg-build] ERROR - Building image docker-registry.discovery.wmnet/spark3.5-build:3.5.3-1 failed - check your Dockerfile: Building image docker-registry.discovery.wmnet/spark3.5-build:3.5.3-1 failed (image.py:210)

But it works on my workstation.

Mon, Nov 18, 1:30 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T378854: an-presto1018.eqiad.wmnet: DRAC is down.

That worked, so we're all good.

Mon, Nov 18, 1:26 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), SRE, ops-eqiad, DC-Ops
BTullis added a comment to T380035: Create Spark docker images for version 3.5.3.

I've set off a build with:

root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*spark3.5*'
Mon, Nov 18, 1:25 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis moved T378954: Build bigtop 1.5 packages for bookworm from In Progress to Done on the Data-Platform-SRE (2024.11.09 - 2024.11.29) board.

I'll call this done for now. We can do the testing at a later date.

Mon, Nov 18, 1:16 PM · Patch-For-Review, Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis added a comment to T378954: Build bigtop 1.5 packages for bookworm.

Added the packages to reprepro.

btullis@apt1002:~/deb12$ sudo -i reprepro -C thirdparty/bigtop15 includedeb bookworm-wikimedia `pwd`/*.deb
Exporting indices...
Mon, Nov 18, 1:12 PM · Patch-For-Review, Data-Platform-SRE (2024.11.09 - 2024.11.29)
BTullis closed T376118: Update druid config to automatically drop unused segments as Resolved.

I think that this is all working now. I restarted all daemonswith the new settings and everything started cleanly.

Mon, Nov 18, 12:47 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), Data-Engineering (Q2 2024 October 1st - December 31th)
BTullis added a comment to T378854: an-presto1018.eqiad.wmnet: DRAC is down.

Trying the reimage again with the note from https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: and adding the --force-dhcp-tftp flag.

Mon, Nov 18, 12:40 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), SRE, ops-eqiad, DC-Ops
BTullis added a comment to T378854: an-presto1018.eqiad.wmnet: DRAC is down.

Maybe I spoke too soon. I've had this error twice now, suggesting a failure to pull the boot image with TFTP, or similar.

image.png (726×1 px, 104 KB)

I'll check the NIC firmware version.

Mon, Nov 18, 12:38 PM · Data-Platform-SRE (2024.11.09 - 2024.11.29), SRE, ops-eqiad, DC-Ops