Nothing Special   »   [go: up one dir, main page]

Jump to content

Phabricator

From Wikitech

Phabricator is an open-source software development platform. In Wikimedia, Phabricator is used for project management, software bug reporting, and feature requests. See mw:Phabricator for more details on end user usage.

phabricator.wikimedia.org runs on phab1004 in eqiad.

The Phabricator install relies on db1183 (m3 eqiad master), with several replicas. Databases access is routed through dbproxy1003, a.k.a. m3-master.

A disaster recovery plan for phabricator.wikimedia.org is at Phabricator/Disaster Recovery.

Metrics are on https://grafana.wikimedia.org/d/000000587/phabricator.

Since 2023-08-23, we actually use the Phorge fork of Phabricator (T333885), but we have not (yet?) started to update references to the old software name.

Operations Projects Workflows

The operations specific projects on Phabricator[1] include:

Project Description
SRE General SRE Team Project
Labs Labs Team Project
DC-Ops Data center Team Project
domains Domain support/changing/issues
hardware requests Server Allocation Requests
procurement Vendor & Procurement Tasks. Direct ordering of SSL certificates.
network Network Requests
Ops Access Requests Access requests to any Operations systems
ops-codfw Onsite queue for codfw
ops-eqdfw Onsite queue for eqdfw
ops-eqiad Onsite queue for eqiad
ops-eqord Onsite queue for eqord
ops-esams Onsite queue for esams
ops-ulsfo Onsite queue for ulsfo
DBA Database administration requests
Operations Software Development Software development projects

Hardware Request Stage

  • Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the hardware-requests project removed from the task.
  • If the system specification meets an on-site spare, system allocation may proceed.
  • This allocation step is typically processed by Rob and approved by Mark. (It involves a general overview of the roadmap and system procurement planning.)
  • If the system specifications require an order of hardware, the following occurs:
  • A RT procurement queue ticket is created for each set of vendor quotes.
  • Example: A caching system at this time could be Dell or HP, we create two RT tickets. One for each vendor to provide quotes for the system specification in question.
  • Quotes are generated and reviewed by Rob, Mark, and the requestors for the hardware.
  • Quotes are approved for purchase by Mark/Damon/Lila (escalation dependent on overall cost) and are typically placed by Rob (for US ordering) or Mark (for EU ordering).
  • The hardware-requests task will have the system details noted (hostname/asset tag) and the task will be linked to the system setup task.
  • These are kept separate for easy future search history on hardware allocations; thus its nice to leave a task with the hardware-request in said project.

Hardware/Server Setup / Deployment Stage Workflow

  • This task is the primary tracking task for the setup and deployment of the server.
  • Task should include the following (base template):
  • System Deployment Steps:
   [] - mgmt dns entries created/updated (both asset tag & hostname) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - system bios and mgmt setup and tested [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - network switch setup (port description & vlan) [link sub-task for network configuration here, sub-task should include the network project]
   [] - production dns entries created/updated (just hostname, no asset tag entry) [link sub-task for on-site work here, sub-task should include the ops-datacenter project]
   [] - install-server module updated (dhcp and netboot/partitioning) [done via this task when on-site subtasks complete]
   [] - install OS (note jessie or trusty) [done via this task when network sub-task(s) complete]
   [] - service implementation [done via this task post puppet acceptance]
  • The main task is basically for all the software setup, and the sub-tasks are for the specific on-site or networking tasks.
  • Many times, the network task isn't created, as the person doing the software work can also do the network configuration.

Misc. Production Virtual Machine Requests Workflow

  • Tasks assigned to others are not reviewed as often, as they are awaiting input from the assignee. If they are left neglected by the assignee long term, they will likely be rejected, or have the vm-requests project removed from the task.
  • If you are reading this as the SRE that is reviewing the request or are evaluating your own request the docs for what to look for are at: Ganeti#Verify_cluster_resource_availability
  • If the system specifications meet all requirements for approval/allocation of a production virtual machine, the machine can be created. The creation should be undertaken by the SRE that filed the request to increase familiarity with the platform.

Administrative Commands

  • All Phabricator documentation refers to scripts in the phabricator bin directory. On our setup, that is: /srv/phab/phabricator/bin/

Dump the entire database

Write the entire contents of phabricator's databases to disk, compressed:

FIXME!

/srv/dumps is not the right path to use - it is synced to public.

cd /srv/phab/phabricator
sudo ./bin/storage dump --output /srv/dumps/phabricator_db_$(date +%Y%m%d%H%M%S).sql.gz --compress

Remove a repo

First you need the repo's callsign. This is an all-uppercase identifier with 'r' prefixed that is used in urls and such in Phabricator for the repo. For example, Puppet's is OPUP. First SSH to phab100N. Then:

cd /srv/phab/phabricator
sudo ./bin/remove destroy rFOO

Remove a file

First you need the file's ID prefixed with 'F'. First SSH to phab100N. Then:

cd /srv/phab/phabricator
sudo ./bin/remove destroy Fxxxxxxxx

Ban a user

Members of the #acl*userdisable Phabricator project can ban a user via https://phab-ban.toolforge.org/

Delete a user

This is not recommended if the account has already been active! Deleting a user can be needed when a user entered a wrong email address in the registration form and now cannot verify their address to finish account creation. First SSH to phab100N. Then:

cd /srv/phab/phabricator
sudo ./bin/remove destroy @AccountNameOfThatUser

Removing Two Factor Authentication

  • Please note that removal of 2FA is a serious request, and all too easily socially engineered. All requests of this nature should be treated with the same degree of security and confirmation as ssh key changes. The user guidelines require one month between the paste of the user committed identity hash on the wiki user page and the reset request, or verification via a video call.
  • When copying the text phrase from a Phabricator Paste, make sure to use View Raw File and save the file, to avoid issues with line breaks via copy&paste. (Potentially also check with a hex editor that no additional byte such as 0x0A has been added.) Afterwards, run cat file | sha512sum (or whatever algorithm was used, e.g. could also be sha3sum -a 512 or such).
  • Once confirmed, the actual command is quite simple, run on the phabricator host:
  sudo /srv/phab/phabricator/bin/auth strip --all-types --user <username>
  • You will be prompted with a yes or no to remove the multi-authentication types on the user.

Revoking a Conduit token

Users can do this themselves with the big red "Terminate Tokens" button in Settings > Conduit API Tokens. If it needs to be forced for some reason, you can do it from a phabricator server:

ssh phab1004.eqiad.wmnet
    sudo /srv/phab/phabricator/bin/auth revoke --type conduit --from @<username> 

Revoking a user's sessions

This invalidates any active sessions and forces the user to log in again.

ssh phab1004
    sudo /srv/phab/phabricator/bin/auth revoke --type session --from @<username>

Revoking a user's ssh keys

This invalidates any authorized ssh keys that the user has configured in phabricator.

ssh phab1004
    sudo /srv/phab/phabricator/bin/auth revoke --type ssh --from @<username>

Rebuild phabricator search index

Warning: This takes a really long time, probably more than 8 hours. Service will be online during the reindex, however, search quality will be degraded.

ssh phab1004
   sudo /srv/phab/phabricator/bin/search init
   sudo /srv/phab/phabricator/bin/search index --all --force --background

Revert all activity of a given user

Caution: This removes most of the user's activity from Phabricator and it is a destructive operation. This should only be done when cleaning up vandalism from an account which has no legitimate activity. If the account had real contributions prior to being compromised, then another solution is needed to avoid deleting the legitimate contributions along with the spam.

This procedure will attempt to undo all edits made by a given user. If you add the--delete argument it will also remove all traces of the corresponding transactions from the phabricator activity log. This should be successful in all cases except for 1 limitation: Any field which has been edited by someone after the vandal's edit will be treated as an edit conflict and the field will be left alone to avoid potentially overwriting useful edits by other users.

How it works: The rollback script simply replays the edit transactions in reverse, from newest to oldest. Each transaction in Phabricator stores the field name, the old value and the new value. To revert a user's activity, the script will do is as follows:

  1. For each task edited by the vandal user:
    1. For each transaction made by the vandal user (newest to oldest):
      1. If the transaction's "new" value matches the field's current value, then the transaction's "old" value is applied to the field.
    2. After all transactions have been replayed, if any field was changed then the record is saved back to the database.
    3. Finally, if --delete was also specified, then all the replayed transactions are also deleted to clean up the history of activity.
ssh phab1004
    sudo /srv/phab/libext/misc/bin/rollback execute --delete --user <username>

Converting a parent project into a subproject

There is no such script anymore as it led to database corruption; see phab:T342275. Thus this is manual work now.

Run a bulk job silently (suppressing notification spam)

First set up a bulk job in phabricator's GUI, then get the bulk job id and run the make-silent command below, specifying your bulk job id. Finally, start the job in the GUI and it will run without sending notifications.

ssh phab1004
    sudo /srv/phab/phabricator/bin/bulk make-silent  --id <bulkid>

See also mw:Phabricator/Help#Batch edits for more information and guidance.

read-only mode / restarting mariadb

To put phabricator into read-only mode, which allows it to continue serving requests during a master database restart, do the following on the active phabricator server:

ssh phab1004
    sudo /srv/phab/phabricator/bin/config set cluster.read-only true
    # restart database server
    sudo /srv/phab/phabricator/bin/config set cluster.read-only false

Disabling a Herald rule

Herald rules can be disabled via

ssh phab1004
    sudo /srv/phab/phabricator/bin/herald rule --disable --rule <rulenumber>

Check on a Phabricator user

To check if a Phabricator user is who they say they are there is a script to get their email address and whether it's verified from the SQL database:

chk_phuser <Phabricator username>

ssh phab1004
    sudo chk_phuser Dzahn

Unlocking edit permissions on a task

ssh phab1004
    sudo /srv/phab/phabricator/bin/policy unlock --edit YourPhabUserName T12345678

Unlocking edit permissions on random objects

First get the internal PHID of the object to unlock, for example via Conduit by passing {"ids":[12345678]} as constraints.

ssh phab1004
    sudo /srv/phab/phabricator/bin/policy unlock --edit YourPhabUserName PHIDofObject

Mail debugging

See Phabricator/Mail debugging.

Ban an IP address

See Phabricator/Ban IP address

Rate Limiting

Access to Phabricator is restricted by rate limiting rules in requestctl. This rate limiting was enabled in May 2024 due to a high level of scraping and abusive traffic (see T362401). Users affected by the rate limiting will see a "http 429 - too many requests" temporarily.

Normal traffic from legitimate users shouldn't be affected in most cases. To avoid triggering the rate limit, the following can help:

  • Keep the number of requests per second low, especially when using the API, scripts, or curl.
  • Use a unique, non-shared IP address (avoid cloud networks, VPNs or proxies).
  • Set a proper, application-specific User-Agent header.
  • Try again later once the limit resets.

SREs can adjust the rate limiting settings in private-puppet/requestctl. The relevant configuration for Phabricator can be found under cache-text/phabricator*.

Network Architecture

Phabricator is currently hosted on phab1004.eqiad.wmnet / phab2002.codfw.wmnet.

The full path of traffic from the public internet through to the database is as follows:

cache_text esams -> cache_text codfw -> cache_text eqiad -> phab1004 -> dbproxy1003 -> db1043

Fixing Common Problems

PhutilMissingSymbolException

Some Phabricator applications throwing exceptions like Failed to load class or interface "Phabricator*" - this can sometimes be resolved by running arc liberate inside of /srv/phab/phabricator which will update the library map as in this commit.

Phabricator is intermittently down or slow

Phabricator Dashboard on Grafana

Check the logs on /var/log/apache2/phabricator_error.log (or in Logstash applicaion logs and Logstash apache logs for a more readable format)

Check the host in Icinga for more failed checks (eg. PHD should be running).

Check the status of the phd process (sudo systemctl status phd).

Do not run aphlict server using websockets and proxy through Apache also running main Phabricator.

See Phabricator/Slowness for more info.

Pybal alerts for git-ssh after rebooting Phabricator servers

If a Phabricator server had to be rebooted for any reason you might get Icinga pybal alerts. Pybal will alert that backends for the git-ssh.wikimedia.org service are down but pooled. Example:

<+icinga-wm> PROBLEM - PyBal backends health check on lvs1018 is CRITICAL: PYBAL CRITICAL - CRITICAL - git-ssh6_22: Servers phab1001-vcs.eqiad.wmnet are marked down but pooled

The reason for this is a race condition where the ssh-phab service is started before the additional IPv6 IP is added to the interface. The service WILL be running so it might not be obvious why it's considered down. This is because it will be listening only on IPv4 while pybal / lvs servers are going to try and use IPv6.

The server has multiple IPs, 2 on loopback for LVS and 4 on en01.

The fix in this case is to manually restart the service with

systemctl restart ssh-phab

You'll have to wait a few minutes and then the pybal Icinga alerts will recover.

Failure Scenarios / Failover

Simple failure of the phabricator server

A simple failure of the phabricator server, e.g. a disk failure or other hardware failure on phab1001.

Take a look at a previous fail-over ticket at T238956.

Code changes needed for the actual fail-over can be seen at the topic branch phab-buster. Decommissioning of the previous server can be seed at the topic branch phab1003-decom.

Additionally the etherpad Phabricator-migration-20191203 was used.

Steps to fail-over an existing Phabricator server to a new server

If there are 2 existing servers, just follow the steps. If the existing prod server died, assume "old_server" means the warm standby in the other data center. If the standby server died see the section below.

  1. install a new server and add the role::phabricator puppet class on it, run puppet agent
  2. rsync /srv/repos from old_server to new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
  3. verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
  4. switch the "phabricator dumps host" to the new server. code change
  5. (optional) put phab on new_server in maintenance mode (phab admin action)
  6. set downtimes for both servers in Icinga
  7. change the "phabricator_server" setting to the new server name. code change
  8. (changing the "active server" setting is not needed anymore, setup has been simplified)
  9. switch the discovery record in DNS to the new server. The TTL is 300 seconds by default for all discovery records. It does not need to be changed but be aware there might be a 5 minute window where clients could get the old server. code change
  10. switch the config for varnish to the new server code change
  11. switch the mail destination on mx to the new server code change
  12. using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
  13. using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
  14. (if reimage script failed in the past and you have ongoing Icinga alerts about pybal and the vcs server): delete stale confd files on puppetmaster to clear Icinga alerts about confd template compilation failing
  15. make the "phd" service run on the new server to avoid breakage of repos code change
  16. verify things work and remove Icinga downtimes
  17. (a few days later) decom the old server following the usual decom steps and as outlined in the phab1003-decom branch linked above

Steps to re-create a warm standby server

If the non-active server died and you want to re-create it under a new host name:

  1. install a new server and add the role::phabricator puppet class on it, run puppet agent
  2. rsync /srv/repos from the prod server to the new_server, run it with --delete as well and ensure both sides have the same size. (rsyncd / ferm rules for this are already puppetized on all servers)
  3. verify code in /srv/phab is up to date and both servers are on the same git tag (if not use scap to deploy to new server / run 'scap pull' on it)
  4. Add the new host name to the list of "phabricator_servers" in Hiera in hieradata/role/common/phabricator.yaml.
  5. using systemctl, restart the "ssh-phab" service on the new server to make it listen on IPv6
  6. using conftool, depool the "vcs" service on the old server, change conftool data to use the new server code change and pool it
  7. You do NOT have to worry about the phd service running, it's only needed on the active server.

Complete data center failover

Complete data center failover, e.g. some major event takes down eqiad and we need to fail over to codfw.

How to make codfw master writable

root@cumin1001:~# mysql --skip-ssl -hm3-master.codfw.wmnet

Master database failure

Master database fails, we need to fail over to a slave and swap the slave to become a master

If the master goes down, the proxy would automatically failover to the existing slave (which is read-only) and would need to be set up as read_only=OFF by an admin.

References

  1. The Operations specific Phabricator projects were discussed in T119944 in early 2016.