User Details
- User Since
- Jun 29 2021, 9:56 AM (178 w, 4 d)
- Availability
- Available
- IRC Nick
- btullis
- LDAP User
- Btullis
- MediaWiki User
- BTullis (WMF) [ Global Accounts ]
Thu, Nov 28
Hive version 4.0.1 has been added to bigtop in this patch: https://github.com/apache/bigtop/commit/d8459aabfcc8d64bc3121bedc1b2c55ca3788270
We can probably cherry-pick this into our build.
Wed, Nov 27
I have built and packaged bigtop 3.3.0 packages for bullseye.
Tue, Nov 26
Great! Glad to hear that it's working. If you think our docs could be clearer, or if there is anything missing from them, please feel free to let me know, or simply edit them.
@Khantstop - could you possibly paste some command output or a screenshot, please?
This is what I see when I go to stat1008 and run kinit
(base) btullis@marlin:~$ ssh stat1008.eqiad.wmnet Linux stat1008 5.10.0-30-amd64 #1 SMP Debian 5.10.218-1 (2024-06-01) x86_64 Debian GNU/Linux 11 (bullseye) stat1008 is a Statistics & Analytics cluster explorer (private data access, no local compute) (statistics::explorer) stat1008 is statistics::explorer Bare Metal host on site eqiad and rack A6 This host is capable of Kerberos authentication in the WIKIMEDIA realm. For more info: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide The last Puppet run was at Tue Nov 26 13:24:59 UTC 2024 (26 minutes ago). Last Puppet commit: (b9a0aff6c5) David Caro - cloudcephmon1004: provision as mon Debian GNU/Linux 11 auto-installed on Thu May 23 11:03:16 UTC 2024. Last login: Tue Nov 26 13:51:14 2024 from 2a02:ec80:600:1:185:15:58:6
A point of note is that Hive 3.x has now been classified as EOL as from 2024/10/08 - https://hive.apache.org/general/downloads/
The 4.0.1 release is now considered the stable release.
Mon, Nov 25
I added a commit to the airflow-dags pyspark upgrade MR here: https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/commit/a30add30d519697c59431d940985f135a8586f3c
Tentatively resolving the issue, but please do feel free to reopen it @mpopov if things are not improved by this.
Looks like this is OK. The graphs here are a bit too spiky to see any immediate effect, but I'll be on the lookout for any more evidence of performance gains in the near future.
I have received approval from @Jclark-ctr out of band, so I'll go ahead and deploy this change.
I have also failed back the namenode services from an-master1004 back to an-master1003.
btullis@an-master1003:~$ sudo -u hdfs /usr/bin/hdfs haadmin -failover an-master1004-eqiad-wmnet an-master1003-eqiad-wmnet Failover to NameNode at an-master1003.eqiad.wmnet/10.64.36.15:8040 successful
So I think we can close this task.
I have removed the remaining resources manually. Some had been removed anyway by purge => true but I think eveything is OK now.
btullis@an-redacteddb1001:~$ sudo rm /etc/ssh/userkeys/root.d/cloud_cumin /lib/systemd/system/prometheus-node-kernel-panic.timer /etc/ferm/conf.d/10_ssh_from_cloudcumin_masters /etc/logrotate.d/prometheus-node-kernel-panic /lib/systemd/system/prometheus-node-kernel-panic.service /etc/rsyslog.d/40-prometheus-node-kernel-panic.conf /usr/local/bin/prometheus-node-kernel-panic rm: cannot remove '/etc/ssh/userkeys/root.d/cloud_cumin': No such file or directory rm: cannot remove '/etc/ferm/conf.d/10_ssh_from_cloudcumin_masters': No such file or directory rm: cannot remove '/etc/rsyslog.d/40-prometheus-node-kernel-panic.conf': No such file or directory btullis@an-redacteddb1001:~$ sudo rmdir /var/log/prometheus-node-kernel-panic btullis@an-redacteddb1001:~$ sudo systemctl daemon-reload btullis@an-redacteddb1001:~$
This morning the number of directories within the analytics user's log directory on HDFS has risen above the previous threshold again, to 1069767.
btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs 1069767 3112858 2624551317505 /var/log/hadoop-yarn/apps/analytics/logs
The fact that log aggregation is still working shows that the new setting must be working, which is a great sign.
Sat, Nov 23
In the meantime, the number of directory entries in /var/log/hadoop-yarn/apps/analytics/logs dropped below the magic number of 1048576 to 1038184 so I don't yet know whether or not the configuration change works.
btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -count /var/log/hadoop-yarn/apps/analytics/logs 1038184 3037830 1959450645456 /var/log/hadoop-yarn/apps/analytics/logs
I've cleaned up the failed DAGs as best I could and I will follow-up with puppet changes on Monday. Currently, puppet is disabled on an-tets-master100[1-2] and an-master100[3-4].
I had to bump the Java heap on the namenode process from 110 GB to 164 GB before I could reliably fail over to an-master100[3-4].
This didn't work. I've now removed the /etc/hadoop/conf/hdfs-default.xml file on all servers and manually added the configuration property to the end of /etc/hadoop/conf/hdfs-site.xml - then disabled puppet.
I have looked into the options for increasing the limits on the number of files. We don't currently have anything configured, but there is a configuration parameter we can use.
It seems to be defined in the /etc/hadoop/hdfs-default.xml file, which we don't currently manage with puppet. It is available for our version of Hadoop.
Fri, Nov 22
Are you OK with this solution of removing or backing up ~/.conda/pkgs on hosts where you are experiencing this issue?
I can ask the same question to @tchin - if you're still experiencing this issue, too.
@mpopov - It seems to be something to do with the .conda/pkgs directory in your home directory.
I can reproduce this on the terminal.
btullis@stat1010:~$ sudo -u bearloga bash bearloga@stat1010:/srv/home/btullis$ conda-analytics-clone btullis-T380477 <snip snip> Installing the conda and conda-libmamba-solver packages to the newly cloned environment: btullis-T380477 Channels: - defaults Platform: linux-64 Collecting package metadata (repodata.json): done Solving environment: done
Added some guidance to Wikitech here: https://wikitech.wikimedia.org/w/index.php?title=Data_Platform/Systems/Airflow/Developer_guide&oldid=2246773
It's also worth noting that the hive.exec.scratchdir is configurable: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=27362070#AdminManualConfiguration-HiveConfigurationVariables
...so we could change that as well.
I've found out what is causing this. There is an easy enough short-term fix, but a few options about how to stop it happening again.
Thu, Nov 21
Weird. Once again, it works for me on stat1008. I think that this must be something to do with the contents of hour home directory.
btullis@stat1008:~$ spark3-sql Running /opt/conda-analytics/bin/spark-sql $@ SPARK_HOME: /opt/conda-analytics/lib/python3.10/site-packages/pyspark Using Hadoop client lib jars at 3.2.0, provided by Spark. PYSPARK_PYTHON=/opt/conda-analytics/bin/python3 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/11/21 18:10:08 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12000. Attempting port 12001. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12001. Attempting port 12002. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12002. Attempting port 12003. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12003. Attempting port 12004. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12004. Attempting port 12005. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12005. Attempting port 12006. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12006. Attempting port 12007. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12007. Attempting port 12008. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12008. Attempting port 12009. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12009. Attempting port 12010. 24/11/21 18:10:08 WARN Utils: Service 'sparkDriver' could not bind on port 12010. Attempting port 12011. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4045. Attempting port 4046. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4046. Attempting port 4047. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4047. Attempting port 4048. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4048. Attempting port 4049. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4049. Attempting port 4050. 24/11/21 18:10:09 WARN Utils: Service 'SparkUI' could not bind on port 4050. Attempting port 4051. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13000. Attempting port 13001. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13001. Attempting port 13002. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13002. Attempting port 13003. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13003. Attempting port 13004. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13004. Attempting port 13005. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13005. Attempting port 13006. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13006. Attempting port 13007. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13007. Attempting port 13008. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13008. Attempting port 13009. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13009. Attempting port 13010. 24/11/21 18:10:09 WARN Utils: Service 'org.apache.spark.network.netty.NettyBlockTransferService' could not bind on port 13010. Attempting port 13011. ADD JAR file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar 24/11/21 18:10:12 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. Added [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] to class path Added resources: [file:///usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar] Spark master: local[*], Application Id: local-1732212609631 spark-sql (default)>
So the difference would seem to be that in @mpopov's environment we see:
Nov 21 14:53:50 stat1010 jupyterhub-conda-singleuser[493333]: Collecting package metadata (repodata.json): ...working... done Nov 21 14:53:51 stat1010 jupyterhub-conda-singleuser[493333]: Solving environment: ...working... done Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/noarch/conda-libmamba-solver-24.7.0-pyhd3eb1b0_0.conda\nThis > Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/linux-64/certifi-2024.8.30-py310h06a4308_0.conda\nThis comman> Nov 21 14:53:52 stat1010 jupyterhub-conda-singleuser[493333]: RuntimeError('EnforceUnusedAdapter called with url https://repo.anaconda.com/pkgs/main/linux-64/ca-certificates-2024.9.24-h06a4308_0.conda\nThis com>
Whereas in mine we see:
Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: Collecting package metadata (repodata.json): ...working... WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found> Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail. Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail. Nov 21 16:53:01 stat1010 jupyterhub-conda-singleuser[540785]: WARNING conda.conda_libmamba_solver.index:_json_path_to_repo_info(281): No repodata found for channel defaults. Solve will fail.
This succeeded for me.
Currently trying this out on stat1010, creating a new clonled environment and tailing the systemctl unit log file.
btullis@stat1010:~$ journalctl -u jupyter-btullis-singleuser-conda-analytics.service -f -- Journal begins at Thu 2024-11-21 05:09:27 UTC. -- Nov 21 16:52:35 stat1010 systemd[1]: Started /bin/bash -c cd /home/btullis && exec /etc/jupyterhub-conda/jupyterhub-singleuser-conda-env.sh __NEW__ --port=48013 --SingleUserNotebookApp.default_url=/lab. Nov 21 16:52:35 stat1010 jupyterhub-conda-singleuser[540711]: Creating new cloned conda env 2024-11-21T16.52.35_btullis... Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Source: /opt/conda-analytics Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Destination: /home/btullis/.conda/envs/2024-11-21T16.52.35_btullis Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: The following packages cannot be cloned out of the root environment: Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: - conda-forge/linux-64::conda-23.10.0-py310hff52083_1 Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: - conda-forge/noarch::conda-libmamba-solver-24.7.0-pyhd8ed1ab_0 Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Packages: 225 Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Files: 1255 Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Downloading and Extracting Packages: ...working... done Nov 21 16:52:39 stat1010 jupyterhub-conda-singleuser[540732]: Downloading and Extracting Packages: ...working... done Nov 21 16:52:40 stat1010 jupyterhub-conda-singleuser[540732]: Preparing transaction: ...working... done
I'm currently unable to reproduce this on stat1008.
We have created a new file system for the dumps project, so there are some relevant guides here: T352650#10344155
Currently, they are both assigned the hdd CRUSH rule.
btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule crush_rule: hdd btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.data crush_rule crush_rule: hdd
We can change the CRUSH rule associated with the cephfs.dumps.meta pool like this.
btullis@cephosd1001:~$ sudo ceph osd pool set cephfs.dumps.meta crush_rule ssd set pool 18 crush_rule to ssd btullis@cephosd1001:~$ sudo ceph osd pool get cephfs.dumps.meta crush_rule crush_rule: ssd
We have now created a cephfs file system for dumps.
btullis@cephosd1001:~$ sudo ceph fs volume create dumps
We can see that one of the existing standby MDS servers has now been assigned to the active role for this file system.
Volume created successfully (no MDS daemons created) btullis@cephosd1001:~$ sudo ceph -s cluster: id: 6d4278e1-ea45-4d29-86fe-85b44c150813 health: HEALTH_OK
Wed, Nov 20
We need to wait until ml-lab1002 has been reimaged and moved to the analytics VLAN, but we can get on with making the file system and the cephx user in the meantime.
As per T380279: Move Lab machines into analytics net for DL access and switch to homedirs on Ceph - we have decided to extend the scope of this from just the stat servers to the 'ml-lab` servers as well.
We have a debian package of conda-analytics 0.0.37 available here: https://gitlab.wikimedia.org/repos/data-engineering/conda-analytics/-/package_files/5821/download
This task may be a useful reference. It was an investigation of what we considered to be unrecoverable and requiring a backup, before the last major operation to HDFS. We ended up with a figure of ~ 400 TB at the time.
T260409: Establish what data must be backed up before the HDFS upgrade
Tue, Nov 19
Great. I think I prefer this approach of adding Iceberg to the spark assembly to what we had previously discussed, which was adding Iceberg to the docker image: T336012: Add support for Iceberg to the Spark Docker Image
Am I right in thinking that if we have the iceberg jar in the assembly file, we no longer need it in the container image as well, @xcollazo ?
Thanks @isarantopoulos - I'm sure that we can help you, here.
We would like our own instance on the dse-k8s cluster similar to what is set up for other teams (assuming that dse-k8s is the suggested way moving forward).
We did originally discuss a sync and/or cross-checking mechanism, to make sure that users of the POSIX and LDAP groups don't diverge from each other.
Ideally, all non-system users of the POSIX groups should be synchonised with the LDAP groups, with some kind of notification in place of there is a discrepancy.
Mon, Nov 18
Oh, the build failed on build2001 with the following error:
2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ /usr/src/spark/build/mvn help:evaluate -Dexpression=project.version -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Dhadoop.version=3.3.6 (drivers.py:106) 2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ grep -v WARNING (drivers.py:106) 2024-11-18 13:26:01 [docker-pkg-build] INFO - ++ tail -n 1 ++ grep -v INFO (drivers.py:106) 2024-11-18 13:26:01 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.18/scala-2.12.18.tgz (drivers.py:106) 2024-11-18 13:26:02 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://www.apache.org/dyn/closer.lua/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz?action=download (drivers.py:106) 2024-11-18 13:26:02 [docker-pkg-build] INFO - exec: curl --silent --show-error -L https://archive.apache.org/dist/maven/maven-3/3.9.6/binaries/apache-maven-3.9.6-bin.tar.gz.sha512 (drivers.py:106) 2024-11-18 13:26:03 [docker-pkg-build] INFO - Verifying checksum from /usr/src/spark/build/apache-maven-3.9.6-bin.tar.gz.sha512 (drivers.py:106) 2024-11-18 13:26:03 [docker-pkg-build] INFO - Using `mvn` from path: /usr/src/spark/build/apache-maven-3.9.6/bin/mvn (drivers.py:106) 2024-11-18 13:27:05 [docker-pkg-build] INFO - + VERSION='[ERROR] [Help 2] http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException' (drivers.py:106) 2024-11-18 13:27:06 [docker-pkg-build] ERROR - Build command failed with exit code 1: The command '/bin/sh -c ./dev/make-distribution.sh --name wmf --pip --r -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Dhadoop.version=3.3.6' returned a non-zero code: 1 (drivers.py:97) 2024-11-18 13:27:06 [docker-pkg-build] ERROR - Building image docker-registry.discovery.wmnet/spark3.5-build:3.5.3-1 failed - check your Dockerfile: Building image docker-registry.discovery.wmnet/spark3.5-build:3.5.3-1 failed (image.py:210)
But it works on my workstation.
That worked, so we're all good.
I've set off a build with:
root@build2001:/srv/images/production-images# /srv/deployment/docker-pkg/venv/bin/docker-pkg -c /etc/production-images/config.yaml build images/ --select '*spark3.5*'
I'll call this done for now. We can do the testing at a later date.
Added the packages to reprepro.
btullis@apt1002:~/deb12$ sudo -i reprepro -C thirdparty/bigtop15 includedeb bookworm-wikimedia `pwd`/*.deb Exporting indices...
I think that this is all working now. I restarted all daemonswith the new settings and everything started cleanly.
Trying the reimage again with the note from https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation#Urgent_Firmware_Revision_Notices: and adding the --force-dhcp-tftp flag.
Maybe I spoke too soon. I've had this error twice now, suggesting a failure to pull the boot image with TFTP, or similar.
I'll check the NIC firmware version.