Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

rook (Vivian Rook)
User

Today

  • Clear sailing ahead.

Tomorrow

  • Clear sailing ahead.

Wednesday

  • Clear sailing ahead.

User Details

User Since
Jun 7 2021, 2:32 AM (179 w, 10 h)
Availability
Available
LDAP User
Vivian Rook
MediaWiki User
VRook (WMF) [ Global Accounts ]

Recent Activity

Fri, Nov 8

rook changed the status of T379400: Upgrade jupyter chart from Open to Stalled.
Fri, Nov 8, 6:15 PM · PAWS
rook created T379400: Upgrade jupyter chart.
Fri, Nov 8, 6:15 PM · PAWS
rook closed T188684: PAWS kills active users servers that are not connected to a user session as Resolved.
Fri, Nov 8, 6:07 PM · PAWS, Upstream
rook added a comment to T188684: PAWS kills active users servers that are not connected to a user session.

I've tried to spread out cluster rebuilds some since my last comment. Haven't heard similar issues since then so that may well have been the issue. Please reopen if seen again.

Fri, Nov 8, 6:07 PM · PAWS, Upstream

Thu, Nov 7

rook closed T378978: update build-and-push as Resolved.
Thu, Nov 7, 3:11 PM · Quarry
rook added a comment to T378978: update build-and-push.

https://github.com/toolforge/quarry/pull/71

Thu, Nov 7, 3:10 PM · Quarry
rook closed T373528: unused dns proxies? as Resolved.
Thu, Nov 7, 2:56 PM · Quarry
rook closed T373134: PR usually not posting to phabricator as Declined.
Thu, Nov 7, 1:04 PM · Quarry, PAWS

Wed, Nov 6

rook added a comment to T379076: Remove tf-infra-test project.

@rook We also have tf-infra-dev in codfw, should that one be deleted as well?

Wed, Nov 6, 11:53 AM · cloud-services-team, Cloud-VPS

Tue, Nov 5

rook added a comment to T379076: Remove tf-infra-test project.

I don't see that file on either cloudcontrol1005.eqiad.wmnet or cloudcontrol1007.eqiad.wmnet

Tue, Nov 5, 7:17 PM · cloud-services-team, Cloud-VPS
rook added a comment to T379076: Remove tf-infra-test project.

Looks like it took

openstack server show 40560d4a-6b06-49be-bfcd-2565666ef95d
No Server found for 40560d4a-6b06-49be-bfcd-2565666ef95d
Tue, Nov 5, 7:09 PM · cloud-services-team, Cloud-VPS
rook added a comment to T379076: Remove tf-infra-test project.

Do we feel that running openstack server delete 40560d4a-6b06-49be-bfcd-2565666ef95d would be safe?

Tue, Nov 5, 6:39 PM · cloud-services-team, Cloud-VPS
rook added a comment to T379076: Remove tf-infra-test project.

I believe 40560d4a-6b06-49be-bfcd-2565666ef95d is our system:

Tue, Nov 5, 6:38 PM · cloud-services-team, Cloud-VPS
rook added a comment to T379076: Remove tf-infra-test project.

Things that are good to know. I'll see what I can find

Tue, Nov 5, 6:15 PM · cloud-services-team, Cloud-VPS
rook renamed T379076: Remove tf-infra-test project from Remove tofu-infra-test project to Remove tf-infra-test project.
Tue, Nov 5, 2:19 PM · cloud-services-team, Cloud-VPS
rook created T379076: Remove tf-infra-test project.
Tue, Nov 5, 2:10 PM · cloud-services-team, Cloud-VPS

Mon, Nov 4

rook closed T378977: Update build-and-push as Resolved.
Mon, Nov 4, 3:02 PM · PAWS
rook closed T348873: update github action, a subtask of T378977: Update build-and-push, as Resolved.
Mon, Nov 4, 2:27 PM · PAWS
rook closed T348873: update github action, a subtask of T378978: update build-and-push, as Resolved.
Mon, Nov 4, 2:27 PM · Quarry
rook closed T348873: update github action as Resolved.
Mon, Nov 4, 2:27 PM · PAWS, Quarry
rook added a parent task for T348873: update github action: T378978: update build-and-push.
Mon, Nov 4, 2:26 PM · PAWS, Quarry
rook added a subtask for T378978: update build-and-push: T348873: update github action.
Mon, Nov 4, 2:26 PM · Quarry
rook added a parent task for T348873: update github action: T378977: Update build-and-push.
Mon, Nov 4, 2:26 PM · PAWS, Quarry
rook added a subtask for T378977: Update build-and-push: T348873: update github action.
Mon, Nov 4, 2:26 PM · PAWS
rook created T378978: update build-and-push.
Mon, Nov 4, 2:26 PM · Quarry
rook created T378977: Update build-and-push.
Mon, Nov 4, 2:25 PM · PAWS

Thu, Oct 31

rook closed T378674: New upstream release for OpenRefine as Resolved.
Thu, Oct 31, 4:57 PM · PAWS
rook closed T378719: jupyterlab to 4.3.0 as Resolved.
Thu, Oct 31, 4:38 PM · PAWS
rook closed T378675: New upstream release for Pywikibot as Resolved.
Thu, Oct 31, 4:06 PM · PAWS
rook reopened T378674: New upstream release for OpenRefine as "Open".
Thu, Oct 31, 4:05 PM · PAWS
rook closed T378674: New upstream release for OpenRefine as Resolved.
Thu, Oct 31, 4:04 PM · PAWS
rook closed T378718: remove 126a cluster as Resolved.
Thu, Oct 31, 2:25 PM · PAWS
rook created T378719: jupyterlab to 4.3.0.
Thu, Oct 31, 2:06 PM · PAWS
rook created T378718: remove 126a cluster.
Thu, Oct 31, 2:02 PM · PAWS
rook updated subscribers of T373896: Can gitlab build docker images?.

@Jelto In T357612 is it being suggested that one can build docker images in a runner with Dockerfile at this point? Or is it describing that images built with Dockerfile can be run and it is more about where they are allowed to be pulled from?

Thu, Oct 31, 1:58 PM · PAWS

Wed, Oct 30

rook closed T378643: paws nfs full as Resolved.
Wed, Oct 30, 6:48 PM · PAWS
rook added a comment to T378643: paws nfs full.

Large files have been removed.

Wed, Oct 30, 6:48 PM · PAWS
rook added a comment to T378643: paws nfs full.

Confirmed on nfs host (paws-nfs-1.paws.eqiad1.wikimedia.cloud):

rook@paws-nfs-1:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           394M  484K  393M   1% /run
/dev/sda1        20G  3.9G   15G  21% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           1.0G     0  1.0G   0% /var/lib/nginx
/dev/sda15      124M   11M  114M   9% /boot/efi
/dev/sdb        1.5T  1.4T     0 100% /srv/paws
tmpfs           394M     0  394M   0% /run/user/0
tmpfs           394M     0  394M   0% /run/user/38011
Wed, Oct 30, 6:48 PM · PAWS
rook created T378643: paws nfs full.
Wed, Oct 30, 6:42 PM · PAWS

Mon, Oct 28

rook added a comment to T360041: Set query result retention time.

I believe PII such as email addresses, password hashes, and IPs is scrubbed by the replicas? Quarry isn't a system I think of as having PII in it. All the data it queries is public, I think.

Mon, Oct 28, 5:46 PM · Quarry
Pppery awarded T360041: Set query result retention time a Dislike token.
Mon, Oct 28, 4:42 PM · Quarry
rook closed T360041: Set query result retention time as Resolved.
Mon, Oct 28, 4:40 PM · Quarry
rook added a comment to T360041: Set query result retention time.

I appreciate the commentary. Though none of it gets at the central issue of PII, and the reality that quarry is not designed to keep data in perpetuity. Data persistence is an expensive process and not being applied to quarry as such we're one system crash from the data being gone regardless. There are the current download options to export query results. If additional export options are desired, patches are welcome.

Mon, Oct 28, 3:08 PM · Quarry

Fri, Oct 25

rook closed T378158: New upstream release for OpenRefine as Resolved.
Fri, Oct 25, 7:21 PM · PAWS

Mon, Oct 21

rook closed T377010: [bug] Quarry queries are stopped as Declined.
Mon, Oct 21, 8:18 PM · Quarry
rook added a comment to T377010: [bug] Quarry queries are stopped.

It is possible that you were encountering the three hour time limit for analytics searches. If there was some lag it could have increased your query time from what looks like an hour to later. I'm unsure of how additional data could be provided, though it may be possible. Likely though it is easier to check https://replag.toolforge.org/ for lag which if there was much would suggest long running queries may not complete.

Mon, Oct 21, 8:17 PM · Quarry

Fri, Oct 18

rook closed T376556: New upstream release for Pywikibot as Resolved.
Fri, Oct 18, 2:48 PM · PAWS
rook added a comment to T375988: Quarry shows error: This web service cannot be reached.

Quarry is working again. Though I didn't have time to investigate what is happening so this may happen again. Opening T375997 to investigate the underlying issue.

Indeed, the same error is showing up at Quarry again right now.

Fri, Oct 18, 10:31 AM · Quarry

Oct 11 2024

rook added a comment to T336586: magnum: kubectl fails to connect after time.

It's probably better to generate it using the deploy.sh script from the bastion. The only tofu thing that it should attempt doing is creating the kube.config file for the current cluster. That way we shouldn't have to keep track of the various files in /opt

Oct 11 2024, 7:49 PM · Openstack-Magnum
rook added a comment to T377010: [bug] Quarry queries are stopped.

It's been awhile since I've looked at that code. When I worked on it it was to have the stopped status appear when someone manually presses the "stop" button, I thought I added it just for that, but maybe it existed for something else as well. So, ideally, the status doesn't mean to indicate that the query failed or otherwise couldn't run, just that it was stopped while it was running by the user. Are you seeing your queries end with this status without pressing the stop button?

Oct 11 2024, 6:01 PM · Quarry

Sep 30 2024

rook closed T375988: Quarry shows error: This web service cannot be reached as Resolved.
Sep 30 2024, 7:32 AM · Quarry
rook added a comment to T375988: Quarry shows error: This web service cannot be reached.

Quarry is working again. Though I didn't have time to investigate what is happening so this may happen again. Opening T375997 to investigate the underlying issue.

Sep 30 2024, 7:30 AM · Quarry
rook created T375997: worker nodes issue with garbage collection.
Sep 30 2024, 7:29 AM · Quarry
rook added a comment to T375988: Quarry shows error: This web service cannot be reached.

Looks like k8s is having trouble with garbage collection

Warning  FreeDiskSpaceFailed  118s (x159 over 13h)  kubelet  Failed to garbage collect required amount of images. Attempted to free 4281512755 bytes, but only found 0 bytes eligible to free.
Sep 30 2024, 7:04 AM · Quarry

Sep 27 2024

rook added a comment to T375831: Unable to start.

Could you try this again? I seem to be able to log in on my account. It may be the case that quay.io was offline when you tried to connect to paws.

Sep 27 2024, 9:19 AM · PAWS

Sep 12 2024

rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

The hope being that beta cluster could benefit from any updates that are made to production, without having to track them independently of how production updates are being tracked.

I think I understand why you would want this, but it also seems like the exact inverse of the intent of a pre-production integration environment.

Sep 12 2024, 10:35 AM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure

Sep 11 2024

rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

Oh some good answers here https://etherpad.wikimedia.org/p/rooks-questions-to-alex

Sep 11 2024, 6:02 PM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure
rook added a comment to T360041: Set query result retention time.

Oh sorry "Oh some good answers here https://etherpad.wikimedia.org/p/rooks-questions-to-alex" was meant for another ticket. Please disregard.

Sep 11 2024, 3:39 PM · Quarry
rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

Trying to improve my understanding of how deploys are done in prod. Asking around in a few places, mostly related to what is happening here and @bd808 knows a lot of things, so figured I would ask here too.
Current understanding of deployments is: Code is updated, then helm is updated in https://gerrit.wikimedia.org/r/operations/deployment-charts After an update to the chart is merged, deploys can be done from a deployment server either through deployment.eqiad.wmnet or directly. Moving to the desired service cd /srv/deployment-charts/helmfile.d/services/<service> and running helmfile helmfile -e <env> -i apply
Helmfiles seem to reference /etc/helmfile-defaults/general-{{ .Environment.Name }}.yaml files, which appear to contain environment variables for each environment. Though I'm not sure where these files themselves are generated.
Is my basic understanding of how helm is deployed correct?
Can any deployment server deploy to any environment? deploy1003.eqiad.wmnet could deploy to codfw?
Is there a way to deploy all the projects to an environment?
I'm also not sure where cluster access is granted for helmfile to make updates to k8s. Where is that managed?
Would it be feasible to add environment variables for beta cluster, and k8s connection information. Then have the deployment server be able to deploy to beta cluster? The hope being that beta cluster could benefit from any updates that are made to production, without having to track them independently of how production updates are being tracked.

Sep 11 2024, 1:11 PM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure

Sep 9 2024

rook added a comment to T360041: Set query result retention time.

I apologize I have yet to understand the interest in old data. The above seems to be suggesting that if the data is retained for 90 days, it would be copied all over the web to be read later. Results that are over 5 years old are probably safe to remove. I mostly only understand the desire to have a result set that is easy to access with curl or other tools, such that one doesn't have to put a query inside of their project. But all data diminishes in value as it ages, this runs contrary to what is being described above. Rerunning a query 4 times a year such that it has (Now more valuable fresh) data seems a minor effort by comparison to copying the data somewhere so that the decaying data can be referenced forever. I do feel that I'm missing something, why would anyone want old data, when they can have new data?

Sep 9 2024, 11:39 PM · Quarry
rook added a comment to T360041: Set query result retention time.

The issue is not one of size, or suspicion that people may think the data is fresh, but the data itself. Periodically there are tickets opened regarding data that has been removed from the wikis but remains in results from quarry. The information having being removed due to being destructive in varying degrees to the people who had it removed. Involving things like identifying information for people who have edited about particular conflicts or other things that may lead to them being the target of investigation in parts of the world. We don't want this data living in quarry, and presumably for each request we get to scrub a particular result there are likely many more that go unnoticed by those they might identify (Who are likely not aware that quarry exists).

Sep 9 2024, 3:50 PM · Quarry
rook added a comment to T360041: Set query result retention time.

The query itself will remain, so getting fresh results should be nothing more than a submit query away.

That's not quite accurate when the purpose of the query is to get trends, for example in the number of links.

Sep 9 2024, 3:26 PM · Quarry
rook renamed T374349: Upgrade to Ansible 10.3.0 from upgrade ansible to Upgrade to Ansible 10.3.0.
Sep 9 2024, 3:22 PM · PAWS
rook closed T374362: Upgrade to Ansible 10.3.0 as Resolved.
Sep 9 2024, 2:56 PM · Quarry
rook created T374362: Upgrade to Ansible 10.3.0.
Sep 9 2024, 2:44 PM · Quarry
rook added a comment to T360041: Set query result retention time.

https://lists.wikimedia.org/hyperkitty/list/cloud-announce@lists.wikimedia.org/thread/JA4F2K4EBEC3CMS54JDTJBMRAPKND2NN/

Sep 9 2024, 2:11 PM · Quarry
rook closed T374349: Upgrade to Ansible 10.3.0 as Resolved.
Sep 9 2024, 11:16 AM · PAWS
rook created T374349: Upgrade to Ansible 10.3.0.
Sep 9 2024, 11:10 AM · PAWS

Sep 6 2024

rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

I've created a T372498 branch on the https://gitlab.wikimedia.org/bd808/deployment-prep-opentofu/ repo. The primary changes were to introduce multiple datacenter logic. Mostly to be able to test without fear of upsetting anything that may be happening in eqiad1 deployment-prep though with a view that in the distant future a "prod" datacenter could be added. I took out the ssh keys for the cluster, as I haven't found them useful, the times that I have used them is to dig around in the control node in a situation where it is deploying but the worker nodes are not. Though all the logs in the control node appear to be in parts of openstack. Additionally if access to a control or worker node is needed post cluster deploy it is possible with:

kubectl debug node/<node name> -it --image=ubuntu
chroot /host/

If you're finding other uses of the ssh key please let me know.

Sep 6 2024, 9:16 PM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure

Sep 5 2024

rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

I agree moving from a docker/systemd approach to a k8s approach is better, both to better resemble prod and to be a more expected framework. And that the first step should be to to just update what is already in deployment-prep to look more like production, rather than add anything new to deployment-prep, and the general conversation of is deployment-prep the right place for things that are not already in it or would elsewhere be a better option is a second day conversation.

Sep 5 2024, 11:34 PM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure
rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

My hoped for end result is that we can apply the helm charts from https://gerrit.wikimedia.org/g/operations/deployment-charts to this cluster.

Sep 5 2024, 6:48 PM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure
rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

Is ingress separate from the HAProxy setup in your case? If not what is the desired ingress setup?

Sep 5 2024, 11:57 AM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure
rook added a comment to T372498: Figure out how to provision a Kubernetes cluster using Magnum and OpenTofu.

While higher availability is nice, is it valuable in this case? PAWS, quarry and superset all run without HAProxy nodes, and I have yet to identify an instance where the service was not accessible because the instance dns was connecting to had failed. In addition regardless of how much the networking is redundant, magnum in our current setup only has one control node. Perhaps waiting until octavia is available is the better method? As it allows for cloud native load balancing/redundancy, and should allow multiple control nodes.

Sep 5 2024, 11:56 AM · Patch-For-Review, User-bd808, Beta-Cluster-Infrastructure

Sep 3 2024

rook closed T365820: object store for tf-infra-test as Resolved.
Sep 3 2024, 7:14 PM · Cloud-VPS
rook closed T365830: move project to tofuinfratest as Resolved.
Sep 3 2024, 5:36 PM · Cloud-VPS
rook closed T365830: move project to tofuinfratest, a subtask of T365820: object store for tf-infra-test, as Resolved.
Sep 3 2024, 5:36 PM · Cloud-VPS
rook added a comment to T365830: move project to tofuinfratest.

Created tofuinfratest project. Added to metrics infra (https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Monitoring#Monitoring_for_Cloud_VPS):

INSERT INTO alerts VALUES (NULL, 240, 'ApplyFailed', 'tofu_apply{instance="tf-bastion", job="node", project="tofuinfratest"} != 0', '0m', 'critical', '{"summary": "Tofu failed to apply/create the resources on {{ $labels.instance }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestApplyFailed"}');
INSERT INTO alerts VALUES (NULL, 240, 'DestroyFailed', 'tofu_destroy{instance="tf-bastion", job="node", project="tofuinfratest"} != 0', '0m', 'critical', '{"summary": "Tofu failed to destroy the resources on {{ $labels.instance }}", "runbook": "https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Runbooks/TfInfraTestDestroyFailed"}');

Test apply alert successful

Sep 3 2024, 5:35 PM · Cloud-VPS
rook renamed T365830: move project to tofuinfratest from move project to TfInfraTest to move project to tofuinfratest.
Sep 3 2024, 5:35 PM · Cloud-VPS
rook closed T340762: Upgrade quarry os as Resolved.
Sep 3 2024, 5:02 PM · Quarry
rook edited projects for T328712: Create a community offering of OpenStack Magnum, added: Openstack-Magnum; removed Cloud-VPS.
Sep 3 2024, 4:54 PM · Openstack-Magnum
rook closed T328713: Create default magnum template, a subtask of T328712: Create a community offering of OpenStack Magnum, as Declined.
Sep 3 2024, 4:53 PM · Openstack-Magnum
rook closed T328713: Create default magnum template as Declined.
Sep 3 2024, 4:53 PM · Openstack-Magnum
rook added a comment to T328713: Create default magnum template.

As magnum is further understood a default template seems somewhat ill advisable. Some parts of a template would be the same, namely the bits out of the magnum documentation describing how to setup a cluster of a given k8s version. Others will be different, such as image and volume size, potentially network type. This could be reopened if views change.

Sep 3 2024, 4:53 PM · Openstack-Magnum
rook closed T364753: remove buster systems as Resolved.
Sep 3 2024, 4:47 PM · Quarry
rook added a comment to T365820: object store for tf-infra-test.
Sep 3 2024, 4:43 PM · Cloud-VPS
rook created T373896: Can gitlab build docker images?.
Sep 3 2024, 3:57 PM · PAWS

Aug 30 2024

rook closed T373094: Redeploy bastion for tf-infra-test in codfw1dev as Resolved.
Aug 30 2024, 8:55 PM · VPS-Projects
rook closed T373544: jupyterlab to 4.2.5 as Resolved.
Aug 30 2024, 12:20 PM · PAWS

Aug 29 2024

rook added a comment to T369150: Analysis and metrics collection for quarry and superset adoption.

Since 2024-04-02 it would appear that superset has had 103 unique users and quarry has had 917 unique users. Between the two there is an overlap of 70 users.

Aug 29 2024, 7:57 PM · Quarry, superset.wmcloud.org
rook closed T373372: Upgrade to k8s 1.27 as Resolved.
Aug 29 2024, 4:31 PM · PAWS

Aug 28 2024

rook added a comment to T373134: PR usually not posting to phabricator.

May be caused by T362401

Aug 28 2024, 8:56 PM · Quarry, PAWS
rook added a project to T373134: PR usually not posting to phabricator: Quarry.
Aug 28 2024, 8:49 PM · Quarry, PAWS
rook added a comment to T373134: PR usually not posting to phabricator.

In another log:
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 172.183.131.107 via cp1108 cp1108, Varnish XID 305039307<br>Upstream caches: cp1108 int<br>Error: 429, Too many requests at Wed, 28 Aug 2024 20:24:50 GMT</code></p>

Aug 28 2024, 8:48 PM · Quarry, PAWS
rook renamed T373134: PR usually not posting to phabricator from PR not posting to phabricator to PR usually not posting to phabricator.
Aug 28 2024, 8:47 PM · Quarry, PAWS
rook created T373544: jupyterlab to 4.2.5.
Aug 28 2024, 5:01 PM · PAWS
rook updated the task description for T373528: unused dns proxies?.
Aug 28 2024, 2:47 PM · Quarry
rook updated the task description for T373528: unused dns proxies?.
Aug 28 2024, 2:47 PM · Quarry
rook updated the task description for T373528: unused dns proxies?.
Aug 28 2024, 2:47 PM · Quarry
rook created T373528: unused dns proxies?.
Aug 28 2024, 2:45 PM · Quarry
rook closed T373375: Remove quarry-124 cluster as Resolved.
Aug 28 2024, 2:44 PM · Quarry