Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

Create a parser function to get the direction of a language or script
Closed, ResolvedPublicFeature

Description

Feature summary (what you would like to be able to do and where):

There should be a way for pages (typically templates) to easily get the direction for a language code or script code.

For example, there could be a parser function such as {{#dir:...}}:

  • {{#dir:en}} would produce "ltr".
  • {{#dir:ar}} would produce "rtl".
  • {{#dir:Arab}} would produce "rtl".
  • {{#dir:und-arab}} would produce "rtl".

If the input is a language code without a script code, it would return the direction MediaWiki has for that language.

If the input is a language code with a script code, or just a script code, it would return the direction for that script code.

MediaWiki does not currently have data about scripts, but it could get it from CLDR, which provides data about scripts generated from Unicode data (main repository, JSON repository).

They currently list 35 scripts as rtl: Adlm Arab Armi Avst Chrs Cprt Elym Hatr Hebr Hung Khar Lydi Mand Mani Mend Merc Mero Narb Nbat Nkoo Orkh Ougr Palm Phli Phlp Phnx Prti Rohg Samr Sarb Sogd Sogo Syrc Thaa Yezi

There are also a few variants of those scripts which don't get included in CLDR's data: Aran, Syre, Syrj, Syrn

Use case(s) (list the steps that you performed to discover that problem, and describe the actual underlying problem which you want to solve. Do not describe only a solution):

Wiki pages often want to include words in other languages. Multilingual wikis often have translatable elements on pages and need to make sure the direction is set correctly.

There are many wikis with templates like Template:Dir (https://www.wikidata.org/wiki/Q14412446), most of which hardcode an outdated list of language codes.

Lots of wikis have modules with lists of rtl scripts (global search) and functions to return the direction (global search).

mw.language:getDir() in Lua is often not suitable because it's not easily accessible from a template without first writing a module, it only supports the languages included in MediaWiki, and it's easy to run into the limit on how many times you can use it on a single page.

Benefits (why should this be implemented?):

It would reduce the amount of maintenance needed and improve consistency across wikis (wikis would not need lists of rtl scripts, if a new rtl script is added to Unicode, it would only need to be added to one place for the data to become available to all wikis).

Supporting script codes and languages with script codes would improve support for languages not yet included in MediaWiki.

It might also be more efficient to fetch the direction from the script (when provided) by looking up scripts in a relatively short list of script codes.

Note {{Dir}} is currently the most used template in Commons. See also: T343131: Commons database is growing way too fast

Event Timeline

mw.language:getDir() in Lua is often not suitable because […] it's easy to run into the limit on how many times you can use it on a single page.

Every language counts only once, so as long as the same language is queried again and again, the Lua solution won’t run into the limit either. On the other hand, the limit is there for a reason, so a parser function solution will probably also have a limit. This is not to say that there should be no parser function, but this particular argument isn’t very strong.

(By the way, since dd74abb853ba56aef99b7c9d09dd02bdcb88129b the limit on Wikimedia is 200, so it’s not really likely that one accidentally hits the limit, unless one wants to load all languages on a page.)

mw.language:getDir() in Lua is often not suitable because […] it only supports the languages included in MediaWiki […].

What is the use case for getting directionality of languages not included in MediaWiki? Multilingual wikis’ contents are usually available only in languages included in MediaWiki.

mw.language:getDir() in Lua is often not suitable because it's not easily accessible from a template without first writing a module […].

This is true; using a module would only worsen T343131.

Multilingual wikis’ contents are usually available only in languages included in MediaWiki.

See also: T202794: Many more languages need to be added to Multilingual Wikisource (mul.ws)

Indeed, T202794 uses a different definition of “languages included in MediaWiki” than what I was thinking of:

  • Commons usually uses languages that can be selected in the preferences (have MediaWiki translations), since it displays the appropriate translation based on the language selected in the preferences.
  • Multilingual Wikisource wants to also use languages that are long extinct and thus don’t make much sense in the preferences (don’t have MediaWiki translations). However, they still need to be included in MediaWiki in one way or the other: for example, to be able to display languages using that language with the right directionality.

I looked up the source code, and Scribunto is actually extremely permissive as to what languages it accepts: for example, mw.language.new('fklflmwlmfkmf'):isRTL() happily returns false without throwing any error. So if a language is included in MediaWiki by any definition (including the definition used by mulwikisource), Scribunto will handle it and return its directionality. If it’s not included at all, a magic word won’t work either.

The implementation can't replace {{dir}} as that needs to be invoked like {{dir|fa}} instead of {{dir:fa}} (it's not possible to have {{dir|fa}} as a parser function) ...

See: T204371: Replace initial colon in (hash-prefixed) parser function invocation with vertical bar

Change #1032542 had a related patch set uploaded (by Ebrahim; author: Ebrahim):

[mediawiki/core@master] Add dir parser function

https://gerrit.wikimedia.org/r/1032542

The implementation can't replace {{dir}} as that needs to be invoked like {{dir|fa}} instead of {{dir:fa}} (it's not possible to have {{dir|fa}} as a parser function) ...

See: T204371: Replace initial colon in (hash-prefixed) parser function invocation with vertical bar

I didn't know about this, thanks, but even that won't help on having {{dir|fa}} as a parser function IIUC. So the decision here is either to either use {{dir:fa}} or {{#dir:en}} (which as T204371 can be used as {{#dir|fa}} in future) and I think {{#dir:fa}} matches better with currently available {{#language:fa}} (among other decisions or either if we want this at all).

The implementation can't replace {{dir}} as that needs to be invoked like {{dir|fa}} instead of {{dir:fa}} (it's not possible to have {{dir|fa}} as a parser function) ...

See: T204371: Replace initial colon in (hash-prefixed) parser function invocation with vertical bar

I didn't know about this, thanks, but either that won't help on replacing on having {{dir|fa}} as a parser function IIUC. So the decision here is either to either use {{dir:fa}} or {{#dir:en}} (which as T204371 can be used as {{#dir|fa}} in future) and I think {{#dir:fa}} matches better with currently available {{#language:fa}} (among other decisions or either if we want this at all).

{{dir:fa}} or {{#dir:en}} would be both fine. I would go with the {{#dir:en}} format so it looks like other parser functions. We can than do bunch of replacements on Commons.

Agreed that replacement shouldn't be that much work. We can check how https://en.wikipedia.org/wiki/Template:! was migrated.

{{dir:fa}} or {{#dir:en}} would be both fine. I would go with the {{#dir:en}} format so it looks like other parser functions. We can than do bunch of replacements on Commons.

Thanks, just applied in https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1032542

Also renamed it to {{#direction}} and made {{#dir}} an alias, maybe we can even remove {{#dir}} or keep it as the uses in Common? As @cscott review removed second and third parameter and we should take care of it with a {{#ifeq}} using some template and subst:.

If the input is a language code with a script code, or just a script code, it would return the direction for that script code.

Also applied this but perhaps we should see what @cscott will think about it. Perhaps we can go with one version of it, maybe the simplest one, then decide about the details later. Or on the other direction, we can first start with this implementation, have a tracking category for cornercases on Commons and cleanup the input using it and simplify MediaWiki's implementation.

Turned the implementation into what @cscott said at the review system,

This doesn't seem correct here. It should be part of Language::isRTL() or fixed in the language definition. The parser function should return exactly the same as Language::isRTL.

I guess if we can have one the bare minimum implementation we can even tweak getLanguage for more feature parity with {{dir}} later.

What is currently missing:

The list of rtl languages derived from likelySubtags are: (note it contains deprecated language codes such as ji)
aao, abh, abv, acm, acq, acw, acx, adf, ae, aeb, aec, aee, aeq, afb, aib, aij, aiq, amw, apc, apd, ar, arc, arq, ars, ary, arz, ask, atn, auj, auz, avd, avl, ayh, ayl, ayn, ayp, azb, bal, bdz, bej, bft, bgn, bgp, bhe, bhm, bhn, bjf, bjm, bqi, brh, brk, bsh, bsk, chg, cja, ckb, clh, czk, dcc, def, deh, dmk, dml, dv, ecy, esh, fa, fay, faz, fia, fub, gbz, ggg, gha, ghr, gig, gjk, gju, glh, glk, grc, gwc, gwf, gwt, gzi, hac, haz, hbo, he, hkh, hnd, hno, hoh, hrt, hrz, hss, huy, isk, itk, iw, jad, jat, jbe, jbn, jdg, ji, jnd, jog, jpa, jpr, jrb, jye, kbu, kby, kfm, khw, klj, kmz, kqd, ks, ktl, kvx, kxp, lad, lah, lhs, lki, lrc, lrk, lrl, lsa, lsd, lss, luv, luz, mby, mde, mfa, mfi, mhj, mid, mki, mnj, mvy, myz, mzn, nli, nlm, nqo, ntz, nyq, oar, obm, odk, oru, ota, otk, pal, pbt, pgd, phl, phn, phr, phv, plk, pra, prc, prd, prx, ps, psh, psi, pst, qxq, rdb, rhg, rmt, sam, sbn, scl, sd, sdb, sdf, sdg, sdh, sds, sgr, sgy, shd, shm, shu, shv, siy, siz, skr, smp, smy, sog, sqo, sqt, srh, srz, ssh, sts, swb, syc, syn, syr, tjo, tks, tmr, tov, tra, trg, trm, trw, ug, ur, ush, uzs, vaf, vgr, vmh, wbk, wlo, wne, wni, wsv, xco, xhe, xka, xkc, xkj, xkp, xld, xly, xmn, xmr, xna, xpr, xsa, xvi, ydg, yhd, yi, yih, yud, zba, zdj, zrp, zum

The following languages is defined in Commons as rtl but not included in likelySubtags at all:
aic, ajp, ara, arb, bbz, bcc, bqp, gda, kcn, kfr, mve, mzb, pbu, pga, pnb, prs, sqr, swh, tly, wbl, xpu, ydd

On Commons, the main use of Dir template is to return direction of the text in the language used by the user. A pseudo code would be {{dir | {{{lang | {{int:lang}} }}} }}. That way any template displaying stuff is using html tags indicating text direction of the language used by the user. Most of the time templates do not use {{{lang}} parameter, but for the testing purposes we can pass it to the template to see the template using other text direction. That means that in great majority of the cases on Commons input to {{dir}} is the output of {{int:lang}}, and the languages returned by {{int:lang}} are the ones we care about. Current template returns ltr for {{dir|Arab}} or for any other random string which is not recognized as language.

Change #1032542 merged by jenkins-bot:

[mediawiki/core@master] Add {{#dir}} parser function

https://gerrit.wikimedia.org/r/1032542

Working with @Ebrahim, we got most of the uses of c:Template:Dir replaced with #dir parser function, at least in template namespace. Database still shows 123,999,814 transclusions, so it will be interesting to see how long is it going to take for this number to drop.

Does commons have a list of templates to be automatically substituted by a bot?

Does commons have a list of templates to be automatically substituted by a bot?

I think we substituted all calls to {{dir}} templates for all pages in template namespace, so there should not be anything else to do other than wait for the database to catch up, which might be a while.