User Details
- User Since
- Jun 7 2021, 2:32 AM (179 w, 10 h)
- Availability
- Available
- LDAP User
- Vivian Rook
- MediaWiki User
- VRook (WMF) [ Global Accounts ]
Fri, Nov 8
I've tried to spread out cluster rebuilds some since my last comment. Haven't heard similar issues since then so that may well have been the issue. Please reopen if seen again.
Thu, Nov 7
Wed, Nov 6
Tue, Nov 5
I don't see that file on either cloudcontrol1005.eqiad.wmnet or cloudcontrol1007.eqiad.wmnet
Looks like it took
openstack server show 40560d4a-6b06-49be-bfcd-2565666ef95d No Server found for 40560d4a-6b06-49be-bfcd-2565666ef95d
Do we feel that running openstack server delete 40560d4a-6b06-49be-bfcd-2565666ef95d would be safe?
I believe 40560d4a-6b06-49be-bfcd-2565666ef95d is our system:
Things that are good to know. I'll see what I can find
Mon, Nov 4
Thu, Oct 31
Wed, Oct 30
Large files have been removed.
Confirmed on nfs host (paws-nfs-1.paws.eqiad1.wikimedia.cloud):
rook@paws-nfs-1:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 0 2.0G 0% /dev tmpfs 394M 484K 393M 1% /run /dev/sda1 20G 3.9G 15G 21% / tmpfs 2.0G 0 2.0G 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 1.0G 0 1.0G 0% /var/lib/nginx /dev/sda15 124M 11M 114M 9% /boot/efi /dev/sdb 1.5T 1.4T 0 100% /srv/paws tmpfs 394M 0 394M 0% /run/user/0 tmpfs 394M 0 394M 0% /run/user/38011
Mon, Oct 28
I believe PII such as email addresses, password hashes, and IPs is scrubbed by the replicas? Quarry isn't a system I think of as having PII in it. All the data it queries is public, I think.
I appreciate the commentary. Though none of it gets at the central issue of PII, and the reality that quarry is not designed to keep data in perpetuity. Data persistence is an expensive process and not being applied to quarry as such we're one system crash from the data being gone regardless. There are the current download options to export query results. If additional export options are desired, patches are welcome.
Fri, Oct 25
Mon, Oct 21
It is possible that you were encountering the three hour time limit for analytics searches. If there was some lag it could have increased your query time from what looks like an hour to later. I'm unsure of how additional data could be provided, though it may be possible. Likely though it is easier to check https://replag.toolforge.org/ for lag which if there was much would suggest long running queries may not complete.
Fri, Oct 18
Oct 11 2024
It's probably better to generate it using the deploy.sh script from the bastion. The only tofu thing that it should attempt doing is creating the kube.config file for the current cluster. That way we shouldn't have to keep track of the various files in /opt
It's been awhile since I've looked at that code. When I worked on it it was to have the stopped status appear when someone manually presses the "stop" button, I thought I added it just for that, but maybe it existed for something else as well. So, ideally, the status doesn't mean to indicate that the query failed or otherwise couldn't run, just that it was stopped while it was running by the user. Are you seeing your queries end with this status without pressing the stop button?
Sep 30 2024
Quarry is working again. Though I didn't have time to investigate what is happening so this may happen again. Opening T375997 to investigate the underlying issue.
Looks like k8s is having trouble with garbage collection
Warning FreeDiskSpaceFailed 118s (x159 over 13h) kubelet Failed to garbage collect required amount of images. Attempted to free 4281512755 bytes, but only found 0 bytes eligible to free.
Sep 27 2024
Could you try this again? I seem to be able to log in on my account. It may be the case that quay.io was offline when you tried to connect to paws.
Sep 12 2024
Sep 11 2024
Oh some good answers here https://etherpad.wikimedia.org/p/rooks-questions-to-alex
Oh sorry "Oh some good answers here https://etherpad.wikimedia.org/p/rooks-questions-to-alex" was meant for another ticket. Please disregard.
Trying to improve my understanding of how deploys are done in prod. Asking around in a few places, mostly related to what is happening here and @bd808 knows a lot of things, so figured I would ask here too.
Current understanding of deployments is: Code is updated, then helm is updated in https://gerrit.wikimedia.org/r/operations/deployment-charts After an update to the chart is merged, deploys can be done from a deployment server either through deployment.eqiad.wmnet or directly. Moving to the desired service cd /srv/deployment-charts/helmfile.d/services/<service> and running helmfile helmfile -e <env> -i apply
Helmfiles seem to reference /etc/helmfile-defaults/general-{{ .Environment.Name }}.yaml files, which appear to contain environment variables for each environment. Though I'm not sure where these files themselves are generated.
Is my basic understanding of how helm is deployed correct?
Can any deployment server deploy to any environment? deploy1003.eqiad.wmnet could deploy to codfw?
Is there a way to deploy all the projects to an environment?
I'm also not sure where cluster access is granted for helmfile to make updates to k8s. Where is that managed?
Would it be feasible to add environment variables for beta cluster, and k8s connection information. Then have the deployment server be able to deploy to beta cluster? The hope being that beta cluster could benefit from any updates that are made to production, without having to track them independently of how production updates are being tracked.
Sep 9 2024
I apologize I have yet to understand the interest in old data. The above seems to be suggesting that if the data is retained for 90 days, it would be copied all over the web to be read later. Results that are over 5 years old are probably safe to remove. I mostly only understand the desire to have a result set that is easy to access with curl or other tools, such that one doesn't have to put a query inside of their project. But all data diminishes in value as it ages, this runs contrary to what is being described above. Rerunning a query 4 times a year such that it has (Now more valuable fresh) data seems a minor effort by comparison to copying the data somewhere so that the decaying data can be referenced forever. I do feel that I'm missing something, why would anyone want old data, when they can have new data?
The issue is not one of size, or suspicion that people may think the data is fresh, but the data itself. Periodically there are tickets opened regarding data that has been removed from the wikis but remains in results from quarry. The information having being removed due to being destructive in varying degrees to the people who had it removed. Involving things like identifying information for people who have edited about particular conflicts or other things that may lead to them being the target of investigation in parts of the world. We don't want this data living in quarry, and presumably for each request we get to scrub a particular result there are likely many more that go unnoticed by those they might identify (Who are likely not aware that quarry exists).
Sep 6 2024
I've created a T372498 branch on the https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/ repo. The primary changes were to introduce multiple datacenter logic. Mostly to be able to test without fear of upsetting anything that may be happening in eqiad1 deployment-prep though with a view that in the distant future a "prod" datacenter could be added. I took out the ssh keys for the cluster, as I haven't found them useful, the times that I have used them is to dig around in the control node in a situation where it is deploying but the worker nodes are not. Though all the logs in the control node appear to be in parts of openstack. Additionally if access to a control or worker node is needed post cluster deploy it is possible with:
kubectl debug node/<node name> -it --image=ubuntu chroot /host/
If you're finding other uses of the ssh key please let me know.
Sep 5 2024
I agree moving from a docker/systemd approach to a k8s approach is better, both to better resemble prod and to be a more expected framework. And that the first step should be to to just update what is already in deployment-prep to look more like production, rather than add anything new to deployment-prep, and the general conversation of is deployment-prep the right place for things that are not already in it or would elsewhere be a better option is a second day conversation.
My hoped for end result is that we can apply the helm charts from https://gerrit.wikimedia.org/g/operations/deployment-charts to this cluster.
Is ingress separate from the HAProxy setup in your case? If not what is the desired ingress setup?
While higher availability is nice, is it valuable in this case? PAWS, quarry and superset all run without HAProxy nodes, and I have yet to identify an instance where the service was not accessible because the instance dns was connecting to had failed. In addition regardless of how much the networking is redundant, magnum in our current setup only has one control node. Perhaps waiting until octavia is available is the better method? As it allows for cloud native load balancing/redundancy, and should allow multiple control nodes.
Sep 3 2024
Created tofuinfratest project. Added to metrics infra (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS):
INSERT INTO alerts VALUES (NULL, 240, 'ApplyFailed', 'tofu_apply{instance="tf-bastion", job="node", project="tofuinfratest"} != 0', '0m', 'critical', '{"summary": "Tofu failed to apply/create the resources on {{ $labels.instance }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed"}'); INSERT INTO alerts VALUES (NULL, 240, 'DestroyFailed', 'tofu_destroy{instance="tf-bastion", job="node", project="tofuinfratest"} != 0', '0m', 'critical', '{"summary": "Tofu failed to destroy the resources on {{ $labels.instance }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed"}');
Test apply alert successful
As magnum is further understood a default template seems somewhat ill advisable. Some parts of a template would be the same, namely the bits out of the magnum documentation describing how to setup a cluster of a given k8s version. Others will be different, such as image and volume size, potentially network type. This could be reopened if views change.
Aug 30 2024
Aug 29 2024
Since 2024-04-02 it would appear that superset has had 103 unique users and quarry has had 917 unique users. Between the two there is an overlap of 70 users.
Aug 28 2024
May be caused by T362401
In another log:
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 172.183.131.107 via cp1108 cp1108, Varnish XID 305039307<br>Upstream caches: cp1108 int<br>Error: 429, Too many requests at Wed, 28 Aug 2024 20:24:50 GMT</code></p>