Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

Determine appropriate buckets for annual edit count
Closed, ResolvedPublic

Description

There are three main requirements for the buckets:

  • When these buckets are combined with our project groups, the resulting bins must be large enough to minimize re-identification risk (we don't plan to release raw answers, but this is an additional safeguard).
  • According to @JAnstee_WMF, the numbers of users per bin should follow a somewhat normal distribution.
  • There should be bin boundaries at 30 and 600 edits to preserve comparability with last year's data.

There are two bucket proposals right now. One creates relatively even-sized bins (e_binned_edits), which prioritizes the first criterion. The other creates relatively normal-sized bins (n_binned_edits), which prioritizes the second.

E bins

BIN              EDITORS
[10, 30)            2792
[30, 150)          14299
[150, 600)         14578
[600, 1350)         6953
[1350, 3800)        6873
[3800, 1100000)     6734

N bins

BIN               EDITORS
[10, 30)             2792
[30, 100)            9971
[100, 600)          18906
[600, 6000)         16096
[6000, 12000)        2374
[12000, 1100000)     2090

Further comparison information is in this notebook.

Event Timeline

nshahquinn-wmf renamed this task from Analyze distribution of annual edits to determine appropriate buckets to Determine appropriate buckets for annual edit count.Mar 13 2018, 11:43 AM
nshahquinn-wmf triaged this task as High priority.
nshahquinn-wmf updated the task description. (Show Details)
nshahquinn-wmf moved this task from Backlog to Neil's in progress on the Contributors-Analysis board.

@JAnstee_WMF, @egalvezwmf Our main problem right now with the edit bins is that the initial set I proposed (the "N bins") has small buckets on the high and low ends, which leads some of the final bins to be too small (you can see the sizes in this notebook).

To fix this, I'm suggesting we use a set of more evenly-sized bins (the "E bins") and combine the Sub-Saharan Africa Wikipedia group (which is very small) with the Middle East and North Africa group.

The E bins would still preserve comparability with last year. The only downsides are:

  • The smallest Wikidata bin remains small (8 users)
  • The bin sizes are less normal-looking. However, I don't really understand that requirement so I can't judge; if you explain which statistical technique we need it for, I could make a more informed recommendation :)

Seems a good solution to use E bins, we should perhaps exclude the smallest bin in populations with less than 20 in that group

Seems a good solution to use E bins, we should perhaps exclude the smallest bin in populations with less than 20 in that group

That sounds good to me! I'm assuming that also includes combining the Sub-Saharan Africa group, since even with the E bins all of its groups have less than 20 members.

With that done, the only group with less than 20 members will be Wikidata's [10, 30) edit bucket. Since it's a whole 0.0002% of the population, I think we can safely pretend it doesn't exist :)

If there's no objection to that, I'm unblocked.

Also, thanks for commenting on Phabricator!

@Neil_P._Quinn_WMF Great! Thanks for confirming you are unblocked :)

Okay, I think this is all taken care of!

I slightly tweaked the E bins to make the boundaries a little rounder, but they still effectively reduce the small bin problem.

@egalvezwmf, I set 1 650 as the sample target for the combined group of Middle Eastern and African language Wikipedias (the same as the previous Sub-Saharan Africa target). I'm happy to increase it, but it won't make much difference since there are only 1 703 editors in the group :)

Updated counts and graphs available in my notebook. GitHub has decided this would be a good time to stop displaying it, but you can see this version on jupyter.org.

nshahquinn-wmf raised the priority of this task from High to Needs Triage.Mar 30 2018, 10:17 AM
nshahquinn-wmf moved this task from Next up to Done on the Contributors-Analysis board.