Dumps
Documentation about Dumps is stored in a few different wikis:
- These Wikitech docs are for maintainers of the various dumps.
- Information about the
clouddumps
servers serving mirrors to various clients can be found on Portal:Data Services/Admin/Dumps.
- Information about the
- Information for users of the dumps can be found at Meta-wiki's m:Data dumps page.
- Information for developers can be found at MediaWiki-wiki's mw:SQL/XML Dumps page.
Daily checks
Dumps maintainers should watch or check a few things every day:
- emails to the ops-dumps internal mailing list..
- xmldatadumps-l public mailing list
- Phabricator Dumps-Generation workboard
- https://dumps.wikimedia.org/ (mentions the current run, unless idle)
- icinga for dumps hosts: snapshots, dumpsdata
Dumps types
We produce several types of dumps. For information about deployment of updates, architecture of the dumps, and troubleshooting each dump type, check the appropriate entry below.
- xml/sql dumps which contain revision metadata and content for public Wikimedia projects, along with contents of select sql tables
- adds/changes dumps which contain a daily xml dump of new pages or pages with new revisions since the previous run, for public Wikimedia projects
- Wikidata entity dumps which contain dumps of 'entities' (Qxxx) in various formats, and a dump of lexemes, run once a week.
- category dumps which contain weekly full and daily incremental category lists for public Wikimedia projects, in rdf format
- other miscellaneous dumps including content translation dumps, cirrus search dumps, and global block information.
Other datasets are also provided for download, such as page view counts; these datasets are managed by other folks and are not documented here.
Service
Hardware
- Dumps snapshot hosts that run scripts to generate the dumps
- Dumps datastores where the snapshot hosts write intermediate and final dump output files, which are later published to our web servers
- Dumps servers that provide the dumps to the public, to our mirrors, and via nfs to Wikimedia Cloud Services and stats host users
Adding new dumps
If you are interested in adding a new dumpset, please check the guidelines (still in draft form).
If you are working with wikibase dumps of some sort, you might want to look at a code walkthrough; see Dumps/Wikibase dumps overview.
Not an SLO but...
Dumps have never had an SLO. But current dumps maintainers have a set of unofficial standards for responsiveness and reliability.
We try to reply to incoming tasks filed, emails from folks interested in hosting mirrors, and requests for information, within 2 business days. This may be extended if the dumps maintainers are ill or out of the office for other reasons.
When the SQL/XML dumps for one or more wikis are broken, we do our best to respond to the breakage within 24 hours; this usually includes the filing of a task in Phabricator and some investigation of the problem. If changes to MediaWiki code are required, we will coordinate that work even when we don't write the patch, also arranging for a timely backport and deployment of the patch.
We do our best to ensure that all jobs for the SQL/XML dumps for every wiki are complete before the start of the next run. So for the run starting on the 1st of the month, all jobs on all wikis must be complete before the 20th of the month, and for the run starting on the 20th, all jobs for all wikis must be complete before the end of the month. This sometimes requires work on days off, or beyond the regular workday, in which case future workdays might be shortened to compensate.
Testing changes to the dumps or new scripts
See Dumps/Testing for more about this.
Dealing with problems
See Dumps/Troubleshooting for more about this.
Mirrors
If you are adding a mirror, see Dumps Mirror setup.