s4 is now the largest section by a wide margin:
It has basically doubled since 2019 (while many other sections got smaller with the optimizations in place). Many optimizations we have done improved the storage (e.g. the image metadata storage change in late 2021) but with this rate of growth no optimization can prevent major issues in a year or two.
It's just three tables that are growing really fast: categorylinks, templatelinks, externallinks (in total, they are responsible for 800GB). The rest don't seem to be too problematic:
By a quick look, I think we can do some easy fixes and it should drastically reduce the database growth:
For externallinks:
- Use interwiki links/pagelinks instead of raw https links.
For templatelinks (Most used templates):
- Merge some templates that are both heavily used and only used by the same users
- Use redirect target in heavily used templates. e.g. Template:Location_dec is used 5 million times, using the redirect target removes 5m rows from templatelinks (I can give the list of heavily used redirect templates)
- Migrate some functionalities to software or parser functions, etc. to avoid having a dedicated template used in basically every page (e.g. https://commons.wikimedia.org/wiki/Template:Dir)
For categorylinks (Most used categories):
- There are a lot, really a lot, of hidden categories being added basically everywhere, I don't think many of them are really needed:
- https://commons.wikimedia.org/wiki/Category:Flickr_images_reviewed_by_FlickreviewR_2 (7m rows)
- https://commons.wikimedia.org/wiki/Category:Uses_of_Wikidata_Infobox (4.5m rows)
- https://commons.wikimedia.org/wiki/Category:Files_from_NASA_with_known_IDs
- https://commons.wikimedia.org/wiki/Category:Information_field_template_with_formatting
- Massive set of categories like https://commons.wikimedia.org/wiki/Category:M%C3%A9rim%C3%A9e_ID_same_as_Wikidata that might not be really needed?
Templates and categories added by https://commons.wikimedia.org/wiki/Module:SDC_tracking might be contributing to the issue.