Nothing Special   »   [go: up one dir, main page]

Page MenuHomePhabricator

RFC: How should we store longer revision comments?
Closed, ResolvedPublic5 Estimated Story Points

Description

We need to be able to store longer revision comments in the database in order to accommodate non-Latin languages. (See T6715 for background.) Two approaches have been proposed that seem more or less viable. We need to decide which approach is best.

Approach #1 is to just increase the existing rev_comments field from tinyblob (255 bytes) to varbinary(767), but keep a 255 character limit on the client-side. The 255 character limit means that this should have little to no effect on Latin language wikis, but it could increase table bloating in the long-term for non-Latin wikis. This solution is relatively easy to implement.

Approach #2 is to gradually move revision comments to a separate look-up table. This approach is described at T6715#1735048 and T6715#1735257. This approach will help reduce bloating of the revision table and put us in a better position for the long-term. However, it will require dedicated work over many months to implement and will not be as backwards compatible as approach #1 (for example, it may break some Tool Labs tools).

It is also possible to pursue both approaches: #1 in the short term and #2 in the long-term.

What do people think is the best way forward?

See also T161671: Compacting the revision table

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Just confirming that we're looking at this, already planning to break out the comment field as part of the revision table updates. Need to look over the structured data implications for anything fancier than just supporting a larger length.

I don't think community tech needs to be involved in this task, Brion is handling it. We already discussed it in the previous RFC meeting: https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-02-15-22.01.html , so I don't think we need another RFC meeting about it.

Since it sounds like moving the comments to another table has pretty broad consensus, my only outstanding questions are:

  • What is the rough timeframe for making this change? Weeks? Months? Years?
  • If it's a long timeframe, would it be feasible to also increase the size of rev_comments in the meantime as a short-term solution for non-Latin wikis?

@daniel: Any chance we could get feedback from ArchCom regarding the two questions above (T153333#3085722)?

Had an in-person discussion with daniel. Apparently the timeline on the long-term solution depends on what the new MediaWiki platform team decides. As to whether an interim solution makes sense, daniel suggested that we get the opinion of @jcrespo on how difficult it would actually be to implement https://gerrit.wikimedia.org/r/#/c/191811/ in production. Would this require a data-center switch-over to implement or would it be relatively easy?

Given the size and the traffic of the revision table, and the type of change, we would need read-only mode/master failover to implement this. The schema change would take 3-5 days per server, plus the time to pool, depool, and stabilize the query cache (we have around 60 per datacenter), and a huge operational overhead- it would require a single person just dedicated to do this for a few months, blocking all other schema changes and maintenance.

This could be considered seriously during the datacenter failover (while eqiad is passive), but as it would take 3-5 days on the master, a similar time on the slave, plus later catchup with replication, I cannot guarantee that the 2 weeks of the proposed failover happening soon would be enough time to do it (plus there are other alters that were scheduled already to happen then).

Why isn't a separate table considered first?- it would require no schema changes, and it would not be blocked on me or on MCR implementation, and it would make the revision table lighter, not heavier (steps towards a better design), and the overal footpring of the data smaller- which means more data in memory, so faster access. Note that making the table larger makes the problem worse- later alters require more time, not less (while a smaller revision would make following alters easier). I understand that code has to be written, and that it is not obvious, due to complex filtering, and that someone has to do that, but in our size and traffic, it takes less time to code than to physically run an alter over 400GB tables on hundreds of 5-year-old HD servers with 10K queries per second.

@jcrespo thank you for your assessment! This is even worse than I feared. But there is a positive take-away:

It would of course be possible to start storing comments in a new separate table without touching the revision table! This should of course be planned as part of T161671: Compacting the revision table, but it can potentially be implemented standalone. It would essentially be a table mapping revision ID to a blob containing the comment. This has the disadvantage that comments are not fully normalized / de-duplicated.

@jcrespo do you have some idea of how much de-duplication of comments would actually help? Do you know what percentage of comments is used 2 times, 3 times, etc? My expectation is that the vast majority of comments is used only once. Well, perhaps not for projects that heavily rely on bots, like wikidata and commons.

but it can potentially be implemented standalone

Yes, that would be my suggestion. It would even make MCR and other refactoring later much faster, but no need to be blocked- only coordinated so thing are not 100% incompatible.

@jcrespo do you have some idea of how much de-duplication of comments would actually help?

I don't right now but a) I believe a non-significant part of the total size is not due to duplication, but to just the comments column itself being in a single monolitical entitiy- that would be my #1 concern (deduplication would be a nice thing to be done at the same time, if reasonable). b) I can calculate how many distinct comment are there compared to total comments in a long running query on a passive slave. Let me create a ticket for that and give you an estimation in some hours or day's time.

Why isn't a separate table considered first?

A separate table is definitely the preferred solution. We were only interested in pursuing the ALTER solution as a short-term fix if it was relatively easy, which isn't the case according to your comments. Thanks for providing a detailed assessment.

Yeah, revision, page and image (on commons) "are complicated". Sorry about that :-/ It will only get better.

Side note, just to spread this little brain-bug of mine: I'd love to be able to associate additional data with an edit. E.g. paragraphs added/removed, or a machine readable / translatable version of the edit summary given by bots. A JSON blob would be good. That could be an additional field in the same table, or we could just always encode the comment in a JSON blob, with the option to add more fields. Or we could used an n:m relation... but then we are getting close to making the edit summary a content slot afterall.

I think the findings on:

More than justify a separate table.

@jcrespo you mean, they justify full normalization? We were planning a separate table anyway.

The two options are:

  1. a revision_comment table, with rev_id and rev_comment fields. No deduplication. We can do that right away.
  2. a comment_text table, with comment_id (or comment_hash) and comment_text fields, plus a new rev_comment_id field in the revision table. This allows deduplication, but we'd need to change the revision table. That would probably have to wait until we have figured out all the updates we want to do to that table.

So, we can do this now, independent of the revision table update, with no deduplication; or later, when we overhaul the revision table, with deduplication...

@jcrespo which do you prefer?

Both are a win. If we cannot do 2 now, we can do 1 now, and maybe, depending on how it works, 2 later. Anything that will split the table will be a win, no matter the details.

How about a hybrid solution:

  1. make a comment table with three columns for now: rev_id, comment_id (hash), and comment_blob. MW should write comments for new revisions there.
  2. when refactoring the revision table, make it reference the comment_id.
  3. remove duplicate rows with the same comment_id, make comment_id the primary key, and drop rev_id.
  4. import old comments into the comment table, and drop rev_comment from the revision table.

Hashes are bad primary keys or candidate keys: T158724#3063877 Check the structure I used instead: T162138#3159663

@jcrespo Hashes have the advantage that they can be computed, and do not require a lookup. I still believe that this is a major advantage in many scenarios.

Can you explain why a hash makes a bad primary key? I'd even be ready to go for a 64bit hash stored as a LONGINT. IIRC, the chance of collision is roughly 1 in a billion. That seems acceptable if the consequences of a collision are not catastrophic.

And you can have your hashes if you need them, but not as primary keys.

And you can have your hashes if you need them, but not as primary keys.

Heh, you are too quick ;)

Care to explain why? And, if not primary, can I have a unique key?

I commeted already explaining why. Again: T153333#3167643

@jcrespo Hm. I see your point, but based on our access patterns, I still think the advantages of hashes outweigh the problems - and some are actually features, not bugs. For instance, the "rename" example you give would be catastrophic if we want dedublication. I'd *really* not want people to do that. And I don't see why calculating the foreign key is a problem.

I think we will have to examine the pros and cons for each individual use case to decide for or against hashes. I'm afraid relying on auto-increment IDs will mean a LOT of extra lookups or joins.

@daniel Please do not fall on the loop of blocking progress to @kaldari leading request by blocking future hypothetical developments for which there is yet no code. You said we can have (rev_id, larger_comment) now, so let's get that ASAP. I do not agree with your statements, but that is irrelevant and offtopic to have something done *now*.

@daniel Please do not fall on the loop of blocking progress to @kaldari leading request by blocking future hypothetical developments for which there is yet no code.

@jcrespo How am I blocking it? Why can't we have (id, hash, comment) now? Even if we have no need for the hash now, we could introduce it now, and start using it later. Btw, code will need to be written for the (id, comment) solution, too - and it will break stuff on labs.

Of course, if we end up not using the hash, then this is field is useless, and would have to be dropped later. But that's still better than the opposite - introducing the hash later will also need a schema change, plus the work for populating the column.

@brion, any thoughts on this? Should we decouple this from T161671?

Why can't we have (id, hash, comment) now?
Of course, if we end up not using the hash, then this is field is useless, and would have to be dropped later.

A hash on a comments table (including an index) would triple it size:

$ ls -la
-rw-rw----   1 mysql mysql  67108864 Apr 11 09:14 comments.ibd

mysql> ALTER TABLE comments DROP COLUMN sha256;
Query OK, 0 rows affected (12.07 sec)               
Records: 0  Duplicates: 0  Warnings: 0

$ ls -la
-rw-rw----   1 mysql mysql  24117248 Apr 11 09:27 comments.ibd

If it become necessary at some point (I do not see why) we can alter the table online or add a supporting hash table. The largest problem with revision and its comments are its size, and your proposal add to much overhead with no current use. In fact, an index on comment(100) would be as effective,100% transparent to the application, and more reliable (no collisions):

mysql> SELECT count(*) as unique_comments, count(DISTINCT LEFT(comment, 100)) as unique_100_comments FROM comments;
+-----------------+---------------------+
| unique_comments | unique_100_comments |
+-----------------+---------------------+
|          306485 |              306453 |
+-----------------+---------------------+
1 row in set (0.63 sec)

mysql> EXPLAIN SELECT * FROM comments WHERE comment = 'now that the engine doesn\'t have explicit subpage support, Topic/Subtopic will probably just become a naming convention'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: comments
         type: ref
possible_keys: comment
          key: comment
      key_len: 103
          ref: const
         rows: 1
        Extra: Using where
1 row in set (0.00 sec)

root@dbstore2002[ops]> EXPLAIN SELECT * FROM comments WHERE comment = 'Biography'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: comments
         type: ref
possible_keys: comment
          key: comment
      key_len: 103
          ref: const
         rows: 1
        Extra: Using where
1 row in set (0.00 sec)

Setting up a HASH on top of a B+Tree with InnoDB quirks is not a good idea. It is larger to store, and breaks locality and temporal caching performance. We can add it later if it is useful.

How am I blocking it?

Because you are resurfacing a conversation that was already discussed at https://tools.wmflabs.org/meetbot/wikimedia-office/2017/wikimedia-office.2017-02-15-22.01.html and you are the only one calling to change it, as tstarling said above. Reviewing taken decisions is not bad by iself but you should propose (Side note, just to spread this little brain-bug of mine) on a separate ticket to avoid a textbook bikeshedding.

kaldari- let's go with the simplest of approaches, the one that can get us closer and faster to T6715. I can start creating a comment table right now, and we can fill it up now, even with depending on any code deploy.

I don't see how we can efficiently de-duplicate without the hash. But my intention isn't to block this, just to make sure we consider options and implications.

After discussing Brion's proposal, the decision was to wait with the long comment storage until we do the full revision table refactor. But seeing that this can, at least partially, also be done without touching the revision table, I'm also fine with revising that decision.

However, we'd need a plan for what code needs to be written, and who writes it, and when. It should be fairly simple, but it doesn't happen on its own.

I'm not sure about the best way forward, and I'm sorry for the back-and-forth on this ticket. We can have an RFC-style discussion about this, or try to fit it into the new platform team's backlog. In either case, I'd want feeback from @brion, as he is the one working on refactoring the revision table.

@kaldari We discussed this briefly at the ArchCom meeting. We can have a public IRC meeting on this next Wednesday, May 3. During the meeting, the following options could be considered:

  • a new table with the columns (com_rev_id, com_text). No de-duplication, revision table not touched.
  • a new table with the columns (com_id, com_text, com_hash). Full de-duplication, revision table needs a new column rev_com_id.
    • com_id and com_hash could be the same thing. We could use a 64 bit hash as an ID. The Cognate extension is already using this approach.
  • a new table with the columns (com_rev_id, com_text, com_hash). No de-duplication, revision table not touched for now, but can be upgraded to support de-duplication and be referenced from rev_com_id, by putting a unique index on com_hash, and dropping com_rev_id.

Kaldari, will you be around next Wednesday?

PS: mote that there is an experimental patch up for the revision table refactoring: https://gerrit.wikimedia.org/r/#/c/350097/

Yes, I'm available for the ArchCom meeting next week. I've added myself to the calendar invite.

I doubt I'll be around for the meeting, so I'll leave some comments here.

  • a new table with the columns (com_rev_id, com_text). No de-duplication, revision table not touched.

It could be deduplicated later on by adding a prefix index on com_text as mentioned in T153333#3170756, and renaming com_rev_id to just com_id.

OTOH, if we're deduplicating then using a rev_id as the com_id might be an issue for RevDel, in which case we might want a separate com_id column and plan to drop com_rev_id at some point.

  • a new table with the columns (com_id, com_text, com_hash). Full de-duplication, revision table needs a new column rev_com_id.
    • com_id and com_hash could be the same thing. We could use a 64 bit hash as an ID. The Cognate extension is already using this approach.

I'd guess this is unlikely to happen separately from the other planned schema changes to the revision table, but I'd be happy to be wrong.

While a hash truncated to 64 bits and represented as a BIGINT is probably reasonable, I wonder whether the risk of hash collisions is enough that we should figure out some way to reduce it. For example, maybe we could do something like this

function storeComment( $dbw, $comment ) {
    $comment = (string)$comment; // Sanity

    // Split the 512-bit sha512 hash into 8 bigints. We'll try them in order.
    $hash = hash( 'sha512', $comment );
    $hashes = [];
    for ( $i = 0; $i < 128; $i += 16 ) {
        $hashes[] = Wikimedia\base_convert( substr( $hash, $i, 16 ), 16, 10 );
    }

    // Select them all
    $res = $dbw->select(
        'comments',
        [
            'com_id',
            'is_match' => 'com_text = ' . $dbw->addQuotes( $comment ),
        ],
        [ 'com_id' => $hashes ],
        __METHOD__
    );
    $data = [];
    foreach ( $res as $row ) {
        $data[$row->com_id] = $row->is_match;
    }

    // Find a match
    foreach ( $hashes as $id ) {
        if ( !isset( $data[$id] ) ) {
            $dbw->insert(
                'comments',
                [ 'com_id' => $id, 'com_text' => $comment ],
                __METHOD__,
                [ 'IGNORE' ]
            );
            if ( $dbw->affectedRows() ) {
                return $id;
            }

            // Race. Use it if it's a match for our comment, otherwise just continue the loop.
            $value = $dbw->selectField( 'comments', 'com_text', [ 'com_id' => $id ], __METHOD__ );
            if ( $value === false ) {
                throw new LogicException( 'Insert failed but row doesn\'t exist' );
            }
            if ( $value === $comment ) {
                return $id;
            }
        } elseif ( $data[$id] ) {
            return $id;
        }
    }

    throw new UnexpectedValueException( 'Too many hash collisions' );
}

@jcrespo: One question that came up during today's RFC discussion is whether or not it would be easy to add a new rev_comment_id field to the existing revision table (with no other immediate changes to that table). Some folks speculated that this might be a relatively trivial change, but no one was sure. Could you comment on that?

@jcrespo: One question that came up during today's RFC discussion is whether or not it would be easy to add a new rev_comment_id field to the existing revision table (with no other immediate changes to that table). Some folks speculated that this might be a relatively trivial change, but no one was sure. Could you comment on that?

Given the current size of the table on enwiki (almost 300G), it won't be easy on that shard. Some servers will probably not have enough disk space to alter this table (truth be told, those servers will go away "soon", as they are part of the decommission process we are in: T134476)
Specially on some masters, where we have "metadata lock" issues due to concurrency in some shards. We have been doing alter tables to the revision table for the PK unification and it took several months (and we are still not even done, as we have S2 pending) but we had to leave enwiki master aside for example and get it done once we did the switchover.

To sum up, it is a really long process and so far it has been taking months. Some masters need to be switched over in order to get it done, as if we they don't get depooled the ALTER will not go thru.

I think that IF this wants to be done "fast", we can have a non-deduplicated, rev_id-indexed comments table now. That is a new table, so it will take no time to be created, and we can fill the comments with NULLs or empty strings afterwards. Alternatively, we can fill the current comment table with identifiers and use that. Later we can have it done "good", with the full deduplicated id-based model Brion suggested and a deep alter table change. I do not like hashes at all, for performance and logical model reasons, but 64-bit truncated hash is the worst idea I have yet heard. Meaningless monotonical increasing indexes are much better for performance and maintenance purposes (and I am not talking about DBA work, I am talking about mediawiki maintenance).

This RFC was discussed in a public meeting on IRC on May 3. Rough consensus was reached on some critical design questions. Minutes, full log.

A brief summary of the outcome, to my best understanding:

  • Introduce a new table called comments. The table structure should be ( comm_id, comm_text, comm_data ). comm_data would be a blob optionally containing a machine readable (and particularly, internationalizable) version of the comment. This is particularly useful for system generated edits (reverts, wikibase edits, etc) on multilingual wikis.
  • The comments table should probably not have a deletion flag. Comment should be append-only and stateless.
  • A relation (aka bridge) table shall be used to link comments to revisions, at least temporarily. With this able in place, comments can be transferred from rev_comment to comm_text, and rev_comment can be set to NULL. Later, rev_comment can be dropped, and the relation table could be replaced by a rev_comment_id field.
    • It may be possible to use the relation table only temporarily, as a stepping stone during the migration itself. MediaWiki would never know about it.
  • Deduplication should not (yet) be applied. There are open questions about hashing and indexing, and it's unclear how comment suppression would be handled. Deduplication can be applied later without the need for structural changes.

Some open questions:

  • How to handle comments for archived revisions? Do we need another relation table? Note that there is a tentative plan to leave revisions of deleted pages in the revision table, and handle deletion on the page level.
  • How to handle log comments? Yet another relation table?
  • How to filter rows for suppressed revisions comments from the replica on labs?

How to handle log comments?

In theory log should not be as dificult to alter [add new colum] as revision, as it is much smaller and less read, but I have not actually tested it.

Deduplication should not (yet) be applied

I understand not needing deduplication, or not doing in a first phase, that is cool to me (and I have no technical issue with it), but why needing an intermediate table if that is the case? is it to avoid rev_id on the final table?

How to filter rows for suppressed revisions comments from the replica on labs?

The same way it is done currently for other per-row filtering- either triggers or views. The problem is not how to do this- the issue is to have it into account at the same time the schema change is done- that means, involving labs changes as part of the WMF migration, and not leaving just dbas and the labs team with the responsibility to figure out that for ourselves on code we do not fully understand 0:-)

How to handle comments for archived revisions?

A decision should be take there, but from the schema point of view I see no issue, no matter the actual direction chosen. The comment table should (eventually) be the place where all comments are gathered, be it for revisions or anything else. The same way I will ask for titles+namespaces to be aggregated on a separate, append-only table on an unrelated proposal.

  • It may be possible to use the relation table only temporarily, as a stepping stone during the migration itself. MediaWiki would never know about it.

I still haven't gotten an explanation of how that would work. Is the plan to do some query-rewriting trickery in the DB so every time MediaWiki tries to access revision.rev_comment_id (for both reads and writes) it's really joining the intermediate table and accessing the field there?

I understand not needing deduplication, or not doing in a first phase, that is cool to me (and I have no technical issue with it), but why needing an intermediate table if that is the case? is it to avoid rev_id on the final table?

Yes. The plan is that the intermediate table would be temporary, so we can get the longer comments for revisions quickly without having to wait for the schema change to add revision.rev_comment_id.

We don't want a rev_id in the comment table because ideally this table could replace ar_comment, rc_comment, log_comment, ipb_reason, img_description, oi_description, fa_deleted_reason, fa_description, and pt_reason as well. There may be extension tables that could use it too.

The deployment plan, as I see it, might look something like this. If you have a better plan for this sort of thing, I'd be happy to hear about it.

For each table involved (these could be done in parallel),

  1. Schema change: Add the new comment_id column or intermediate table.
  2. Adjust MW to write the new column/table, falling back to the old column if the new column/table is empty.
  3. Run a maintenance script to migrate the old column into the comments table and fill in the new column/table.
  4. Remove MW's fallback code, so it no longer accesses the old column at all.
  5. Schema change: Remove the old column.

How to filter rows for suppressed revisions comments from the replica on labs?

The same way it is done currently for other per-row filtering- either triggers or views. The problem is not how to do this- the issue is to have it into account at the same time the schema change is done- that means, involving labs changes as part of the WMF migration, and not leaving just dbas and the labs team with the responsibility to figure out that for ourselves on code we do not fully understand 0:-)

At first glance, a view might look something like

SELECT comm_id, comm_text, comm_data
FROM comments
  LEFT JOIN revision_comments ON (rev_comment_id = comm_id) LEFT JOIN revision ON (comment_rev = rev_id)
  LEFT JOIN logging ON (log_comment_id = comm_id)
  -- ... and so on for the other tables
WHERE
  (rev_deleted & 2) == 0 OR
  (log_deleted & 2) == 0 OR
  -- ... and so on for the other tables

I don't know what the performance on that would look like, or if there's a better way to do it. It'd be similar if we did deduplicate, we'd just need to make sure that the joins don't duplicate the comments table rows.

I don't know whether triggers would work or not, you'd have to both remove the row when something is revision-deleted and re-add it if the revision is un-revision-deleted.

I've asked @ZhouZ whether we really need to do all that, but he hasn't had a chance to get back to me yet.

Deduplication should not (yet) be applied

I understand not needing deduplication, or not doing in a first phase, that is cool to me (and I have no technical issue with it), but why needing an intermediate table if that is the case? is it to avoid rev_id on the final table?

Yes, to avoid having comm_rev_id which would later need to be removed - but also, because comm_rev_id could refer to the comment OR the archive table. And we may want to use comment rows in other places too, as @Anomie mentioned.

How to handle comments for archived revisions?

A decision should be take there, but from the schema point of view I see no issue, no matter the actual direction chosen. The comment table should (eventually) be the place where all comments are gathered, be it for revisions or anything else. The same way I will ask for titles+namespaces to be aggregated on a separate, append-only table on an unrelated proposal.

My personal opinion is that we should have a second relation table for archive for now. If the archive table is refactored or removed later, this may go away, too. In my mind, there would just be deletion flags on the page and the revision table.

Support for log entries can be added later, either via another relation table, or by introducing log_comment_id.

  • It may be possible to use the relation table only temporarily, as a stepping stone during the migration itself. MediaWiki would never know about it.

I still haven't gotten an explanation of how that would work. Is the plan to do some query-rewriting trickery in the DB so every time MediaWiki tries to access revision.rev_comment_id (for both reads and writes) it's really joining the intermediate table and accessing the field there?

I see three options:

  1. Use it just during the schema change, and drop it again before MW looks at the migrated db schema.
  2. Make MW code aware of the relation table. This means we have to touch all the relevant code twice. This may not be a bad thing, as it may for us to clean up a bit before doing the "big" refactor.
  3. Create a view that emulates the rev_comment_id field - or the rev_comment field. We'd need to somehow tell MW to use the view instead of the real table, though. Or we can make this transparent, by renaming the actual table - or by renaming the entire database, and have a database with only views, like the _p databases on labs.

I see three options:

  1. Use it just during the schema change, and drop it again before MW looks at the migrated db schema.

That makes no sense. There's nothing in the schema change itself that needs the table, the problem is that doing the schema change to revision would take a very long time and we don't want the sites to be down while that happens. Thus we make the intermediate table now and do the schema change at leisure.

  1. Make MW code aware of the relation table. This means we have to touch all the relevant code twice. This may not be a bad thing, as it may for us to clean up a bit before doing the "big" refactor.

That's the opposite of "MediaWiki would never know about it."

  1. Create a view that emulates the rev_comment_id field - or the rev_comment field. We'd need to somehow tell MW to use the view instead of the real table, though. Or we can make this transparent, by renaming the actual table - or by renaming the entire database, and have a database with only views, like the _p databases on labs.

A view isn't sufficient, we'd also need some sort of query rewriting so inserts to the view turn into inserts to both tables. Or else we'd need MW to be aware of the relation table after all.

I see MariaDB (at least 10.1.22) has views that are updatable in some cases, but in some quick testing I couldn't get it to allow an insert of the kind we'd need so something more advanced would seem to be needed.

I still haven't gotten an explanation of how that would work. Is the plan to do some query-rewriting trickery in the DB so every time MediaWiki tries to access revision.rev_comment_id (for both reads and writes) it's really joining the intermediate table and accessing the field there?

Yeah, I mentioned view in some cases, but thinking on cases like revision + revision_user_index tricks on labs, not for production usage- unless they are trivial cases (with trivial filtering) performance will be not good.

Yes. The plan is that the intermediate table would be temporary,

Thanks for the explanation, I get it and have no problem with it.

I don't know whether triggers would work or not, you'd have to both remove the row when something is revision-deleted and re-add it if the revision is un-revision-deleted.

The concerns are more than reasonable. Normally not doable, but with the new slave-based triggers, that may* (* no guarantee) actually happen. I am not that worried about labs (I think there are worse or similar views right now), and I do not think it should compromise a production design, but we can create some mockups and test them if we are not sure. Also there is still the option of having "public comments" and "private_comments" tables, and many other possibilities...

I see three options:

  1. Use it just during the schema change, and drop it again before MW looks at the migrated db schema.

That makes no sense. There's nothing in the schema change itself that needs the table, the problem is that doing the schema change to revision would take a very long time and we don't want the sites to be down while that happens. Thus we make the intermediate table now and do the schema change at leisure.

Ask Tim, he proposed it. I imagine the idea is to do this during a DC switch. The relation table would allow us to move comments out of the revision table before the ALTER table, reducing the amount of data to be shoved around.

  1. Make MW code aware of the relation table. This means we have to touch all the relevant code twice. This may not be a bad thing, as it may for us to clean up a bit before doing the "big" refactor.

That's the opposite of "MediaWiki would never know about it."

Yes, indeed.

  1. Create a view that emulates the rev_comment_id field - or the rev_comment field. We'd need to somehow tell MW to use the view instead of the real table, though. Or we can make this transparent, by renaming the actual table - or by renaming the entire database, and have a database with only views, like the _p databases on labs.

A view isn't sufficient, we'd also need some sort of query rewriting so inserts to the view turn into inserts to both tables. Or else we'd need MW to be aware of the relation table after all.

It's an option that came up, I don't know if it's feasible. This option is basically "let's see if we can do it with some database trickery".

I personally think we will probably end up making MW code aware of the new table. This will be a nice incentive to reduce the number of places in the code that need to know about this.

Will this transition affect bots, or will the API hide the whole change?
Is there any chance to enhance the 255 characters limit?

Will this transition affect bots, or will the API hide the whole change?

This change will be completely transparent to bots.

Is there any chance to enhance the 255 characters limit?

That's the goal of this, yes.

Is there any chance to enhance the 255 characters limit?

That's the goal of this, yes.

Well, the goal of this task is just to decide how to make it possible to store longer comments in the database, although it'll probably not be closed until the change is actually made. The currently proposed schema change would allow the DB to store comments of up to 16383 characters (65535 bytes), since that's the next reasonable size up from the current 255 bytes (only 63 characters if they're all outside the BMP).

Making the limit actually be 200 characters rather than 255 bytes will very likely be done soon after the schema change is completed. I'm not sure whether there's an actual task for that. T6715: Allow comments longer than 255 bytes could serve, although the current description on that task is completely obsolete as it calls for a long-since-rejected solution to this task.

Increasing the limit beyond 200 characters would need further discussion, I see some opposition to that idea on T6714: Epic: Increasing the length of the edit summary.

Setting the limit to 200 characters instead of 255 bytes would be catastophical, because it would effectively reduce the possible summary length in most languages and situations. For example, you can't have useful summaries of IPv6 reverts anymore. In many other cases the summary length even today is to small too. I stronly discourage a transitional switch from 255 bytes to 200 characters and think, if you want to do something like this, I think you should set at least 250 characters, or problems in many Wikipedias will follow.

Did we decide for sure on whether to use an association table bridging rev_id to comment_id, or whether to just use rev_id=comment_id until changing the revision table structure?

Using a separate table would look like:

CREATE TABLE /*_*/rev_comment(
  -- Association between revisions and comments.
  -- This is present so we can migrate comments before we do a full
  -- rebuild of revision on large Wikimedia production sites.
  rev_comment_rev bigint unsigned NOT NULL,
  rev_comment_comment bigint unsigned NOT NULL
) /*$wgDBTableOptions*/

CREATE UNIQUE INDEX /*i*/rev_comment_unique
  ON /*_*/rev_comment (rev_comment_rev, rev_comment_comment);

and queries would just LEFT JOIN through that to get comment_text, with PHP side using either revision.rev_comment or comment.comment_text, whichever is present.

Avoiding a table and initially using rev_id=comment_id means:

  • no deduplication initially
    • adding later would either have to allow dupe rows to exist, or would have to migrate them first!
  • to use for an archive row, must ensure that ar_rev_id is assigned
  • can't use for non-revisiony things yet (logging etc)
    • on later changes, we could add revision.rev_comment_id and logging.log_comment_id etc and use for multiple things

I don't know that anyone actually made the decision, but IMO it'd be better to do the table so we can go ahead and fix the comment fields in all tables rather than only in revision.

Please no:

CREATE UNIQUE INDEX /*i*/rev_comment_unique

From https://www.mediawiki.org/wiki/Development_policy#Database_policy:

All tables must have a primary key

Associative tables are created with a primary key in the order of the JOINs happening:

,
PRIMARY KEY (rev_comment_rev, rev_comment_comment)
)

With an optional unique index on the opposite direction if the reverse join is needed (but often it is not).

Note I am not saying this specific decision should be the one to implement, but IF done, that should be the "proper" way to do it. --Jaime, primary hey fan

We're close to a decision here, let's get this RFC closed so I can know what code I need to write. I'll file appropriate tasks with the finalized versions of the stuff below once we have answers.

Decided

  • We're going to do this. We're not going to wait on the rest of the MCR changes to do it.
  • We're going to use a revision_comment-style table as mentioned in T153333#3278519 to avoid having to alter revision right away.
    • At some point in the future we will alter revision and merge that table in.
  • We're going to use this comment table to replace other tables' comment fields too.
  • We're going to include a field in the comment table that holds a JSON blob of structured data. The intended use for this is to be able to use something like LogFormatter to localize auto-generated comments. comment_text will still be populated to avoid the problem the LogFormatter system has where old log entries suddenly can't be formatted if the extension providing the LogFormatter is undeployed.

Open questions

Two questions, both for @jcrespo and/or @Marostegui.

Deduplication

From T162138: How many revision comments are exactly the same? Get some stats., it seems that deduplication of comments is worthwhile (correct me if I'm wrong). We need to decide how it will be done. I believe there are two main contenders:

CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/comment_text ON comment (comment_text(100)); -- Is 100 good, or should we use a different number?

-- Inserts would look like
-- -- First, look up the existing ID for the comment
-- SELECT comment_id FROM comment WHERE comment_text = ? AND comment_data = ? LIMIT 1 FOR UPDATE;
-- -- If none was found, insert one.
-- INSERT INTO comment (comment_text, comment_data) VALUES (?, ?);
CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_hash VARBINARY(32), -- Probably a base-36 SHA1
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE INDEX /*i*/comment_hash ON comment (comment_hash);

-- Inserts would look like
-- -- First, look up the existing ID for the comment
-- SELECT comment_id FROM comment WHERE comment_hash = ? AND comment_text = ? AND comment_data = ? LIMIT 1 FOR UPDATE;
-- -- If none was found, insert one.
-- INSERT INTO comment (comment_hash, comment_text, comment_data) VALUES (?, ?, ?);

Or we could do the second with comment_hash being unique, if we're willing to accept the risk of hash collisions.

CREATE TABLE /*_*/comment(
  comment_id bigint unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
  comment_hash VARBINARY(32), -- Probably a base-36 SHA1
  comment_text BLOB NOT NULL,
  comment_data BLOB NOT NULL
) /*$wgDBTableOptions*/;
CREATE UNIQUE INDEX /*i*/comment_hash ON comment (comment_hash);

-- Inserts would look like
-- -- First, try to insert the row
-- INSERT IGNORE INTO comment (comment_hash, comment_text, comment_data) VALUES (?, ?, ?);
-- -- If the insert was ignored, fetch the existing row.
-- SELECT comment_id FROM comment WHERE comment_hash = ?;

On the MediaWiki side, any of these is easily doable. Or anything similar. What would you prefer, DBAs?

Temporary auxiliary tables

We've already decided that we're going to use an auxiliary table so we don't have to alter revision right away to add the new column. Which other tables need this done, and which can we alter directly? Or, what's the cutoff for a table being too big to alter in a timely manner?

These are the tables that are affected by this task, along with some data that might be helpful in making the decision. The data comes from the "Rows", "Data_length", and "Index_length" columns of SHOW TABLE STATUS.

TableRows (enwiki)Data (enwiki)Indexes (enwiki)Rows (meta)Data (meta)Indexes (meta)Rows (commons)Data (commons)Indexes (commons)Rows (wikidata)Data (wikidata)Indexes (wikidata)
revision 7e81e111e11 2e7 3e9 3e9 2e85e106e10 5e88e101e11
logging 7e71e104e10 2e7 3e91e10 2e85e101e11 5e88e102e11
image² 9e5 7e8 4e8 2e3 5e6 1e6 5e78e102e10  0  2e4 8e4
recentchanges 9e6 4e9 5e9 2e5 9e7 1e8 5e74e102e10 1e7 4e9 4e9
archive 5e71e10 9e9 3e5 7e7 6e7 2e7 4e9 3e9 3e6 6e8 5e8
cu_changes¹ 2e7 6e9 5e9 1e6 5e8 4e8 1e7 5e9 3e9 3e71e10 5e9
oldimage² 2e5 1e8 6e7 8e2 2e6 3e5 6e6 7e9 1e9  0  2e4 7e4
filearchive² 3e6 2e9 9e8 7e3 8e6 4e6 5e6 6e9 1e9  0  2e4 8e4
ipblocks 1e6 2e8 2e8 3e4 6e6 7e6 6e4 1e7 1e7 3e3 4e5 6e5
cu_log¹ 4e5 5e7 8e7 4e4 5e6 1e7 5e4 7e6 1e7 1e3 1e5 3e5
protected_titles 5e4 1e7 4e6 5e2 1e5 5e4 8e2 2e5 7e4 2e1 2e4 2e4
global_block_whitelist¹ 5e0 2e4 2e4  0  2e4 2e4 1e0 2e4 2e4  0  2e4 2e4
TableRowsDataIndexes
flowdb.flow_revision¹ 6e5 1e8 1e8
centralauth.globalblocks¹ 8e3 2e6 1e6

¹ These tables come from extensions. If they need auxiliary tables, I'll probably just skip converting them for now.
² Or should we wait on these until T28741: Migrate file tables to a modern layout (image/oldimage; file/file_revision; add primary keys) happens?


Deployment plan

This is how I plan to implement and deploy this, unless people have better ideas.

  1. (gerrit) Merge a patch that includes the following:
    1. Schema changes to add the new columns and tables.
    2. A feature flag to control whether they're used ("old only", "read+write both", "write new only, read both", "new only").
      1. (config) Also, a config change to set the feature flag to "old only" for WMF.
    3. Code changes to use the new columns/tables, depending on the feature flag.
      1. There will probably be extension patches involved here too, if only to clean up things that are directly accessing the old columns.
    4. A maintenance script for migration of old stuff to new.
  2. (db) Make the task for doing that schema change, and wait for it to be done.
  3. (config) Turn the feature flag to "read+write both". See if stuff breaks.
  4. (config) Turn the feature flag to "write new only, read both". See if stuff breaks.
    1. At this point, if problems show up then a revert to an earlier step would be unlikely to fix them. It would also need manual intervention to populate whatever old columns the breaking thing depends on.
  5. (shell) Run the maintenance script(s) to migrate all the old stuff to new stuff, blanking the old stuff in the process.
  6. (config) Turn the feature flag to "new only".
  7. (gerrit) Merge a patch that includes:
    1. A schema change to drop the old columns where that makes sense, and to fix up "not null" flags for the new columns.
    2. Remove the feature flag.
  8. (db) Add a task for deploying that schema change.

Steps 2–6 could be done per wiki, thanks to the feature flag, for a more careful rollout.

Answering the first question, as long as access and primary key is through the auto-increment, I do not see much of a difference. Add the hash if you believe you are going to use it, don't add if if you don't. I personally do not see a reason to add the hash, as I demonstrated that a prefix index on the comments would be much more efficient and useful (this is not a url where there is low selectivity on the first X characters), nor it is not an image where hash and content is so different in size. But the important part is to use a small auto_increment PK for access and sorting. And I can always drop the hash later :-)

For the second, by experience, revision, page (on large wikis), and image (on commons) are the ones causing metadata locking and difficult to change without requiring read only (size annoying, but not the main problem- access is). All other are doable without large problems, I think. While logging is large (and recentchanges accessed a lot), I think they are is only written at the bottom mostly and there are not many long running selects on them (correct me if I am wrong), and the few ones are concentrated on the special slaves. Other indeed problematic tables are *links, but I think these are not affected. Revision is the worst offender because it is used by almost every query. Note that difficult here means "they will be slow because they have to be done one server at a time and the master will have to be failed over" (that would be step #2). For example, altering the revision table on one server (we have 120 of those) takes 1-5 days. With exceptions, changing the 800 least accessed wikis completely online is very easy and fast.

Ok, we'll go with the prefix index. Unless I hear differently, I'll go with 100 as the prefix length.

It's possible for there to be long queries on the logging table (e.g. T149076), but it's probably not that common.

I filed T166732: Refactor comment storage in the database and abstract access in MediaWiki and T166733: Deploy refactored comment storage to track the work to implement this RFC, so I'm going to close this task.

tstarling added a subscriber: Domas.

I'm very skeptical about deduplication via the comment_text(100) index, for the following reasons:

  • Inserting into the comment table will lock the comment_text index in the region of the inserted text until the transaction is committed, effectively limiting the rate at which revisions (or log entries, etc.) can be inserted with any given comment.
  • The enormous index size, and the performance consequences if this index cannot be stored in memory. The lock time will be extended even further if disk reads are required.
  • Significant row scanning might be triggered by bots using long fixed comment prefixes followed by a variable portion.

I spoke with @Domas about it on IRC, and he agreed that comment_text(100) is not the best option, preferring the following options over it:

  • Deduplication by content hash. The hash can be truncated to a very small size if collisions are handled in the same way as collisions with comment_text(100). For example, we could use a 32-bit hash, stored as an UNSIGNED INT. The deduplication query would check for a match in the hash as well as comment_text and comment_data, but unlike comment_text(100), there is no chance of bots accidentally triggering scanning of many rows. The non-locality of content hashes avoids lock contention due to gap locking.
  • Clustering by page_id and using InnoDB page compression. The compression ratio would not be as good as deduplication.
  • Clustering by page_id and using some alternative storage engine like MyRocks.
  • Generating a hashtable of most common comments, distributed with scap e.g. as a CDB file. I think the maintenance required would be excessive, and it doesn't help non-WMF users very much.

How would clustering by page_id work when page_id is not present in the table? There isn't necessarily a page_id even associated with rows for the logging, ipblocks, and protected_titles tables.

I found the comments technically correct, but solutions are being given to problems we don't have. page_id clustering is the best way to store wikitext- Clustering pages which would be similar together for better compression. Comments from the same page do not have similar contents, nor they are usually accessed close in time - recentchanges are normally more frequent comment access (which are timestamp ordered) than page history. Suggesting MyRocks is ludicrous and shows how disconnected one can be from regular operations at Wikimedia right now- it only takes reading https://github.com/facebook/mysql-5.6/wiki/MyRocks-limitations . Note that this doesn't mean it is not interesting, just not the moment or place to suggest it.

Inserting into the comment table will lock the comment_text index in the region of the inserted text until the transaction is committed, effectively limiting the rate at which revisions (or log entries, etc.) can be inserted with any given comment.

The first insert will create that- the following ones will insert less data because they will reuse the same comment (unlike now). Our average comment length is 60 bytes and we have 3-4 edits per second, per shard. If we keep transactions open for more than 300 miliseconds, we deserve the outage. You can always insert comments on a separate transaction- an append only table with unused comments is not that bad.

I only suggested the comment_text(100) index because some people is obsessed with hashes- and I objected to that as a PK because it affects locality (inserts do not happen at the end in that case). I do not think neither the size nor the collisions are a problem -please show me where a bot has 90+ bytes of comments where only the last bits are different. I did my analysis, and said that a) there were very limited collisions with size 100, b) the total size of the deduplicated comments table would be 30GB, including indexes size T162138#3162593 Without the index, it would take 15GB. Our databases have between 128 and 512GB of memory.

I don't mind about the hash as a secondary index- I just want something done. I think a comment(100) or comment(50) will be useful for searching duplicates, but also for searching comments -which labsdb users do extensively- while simplifying the usage. Use a crc if you prefer it- which will require an additional column and a crc calculation on every retrieval-, but do something! I think index discussion to save 15GB are non-sense while we still have a monolithic 400GB table that get slower and slower every day- I have on monitoring 3-second revision inserts, but see no one complaining about that :-). This is my analysis of comments collisions (using enwiki recentchanges to make it fast):

root@neodymium[enwiki]> SELECT count(rc_comment), count(DISTINCT rc_comment), count(DISTINCT LEFT(rc_comment, 100)), count(DISTINCT LEFT(rc_comment, 50)), count(DISTINCT crc32(rc_comment)) FROM recentchanges\G
*************************** 1. row ***************************
                    count(rc_comment): 9537783
           count(DISTINCT rc_comment): 3190686
count(DISTINCT LEFT(rc_comment, 100)): 3145021
 count(DISTINCT LEFT(rc_comment, 50)): 2541258
    count(DISTINCT crc32(rc_comment)): 3189536
1 row in set (54.66 sec)

Doing: SELECT ... FROM comments WHERE comment_crc32 = crc32(new_comment_text) and new_comment_text = comment_text and SELECT ... FROM comments WHERE new_comment_text = comment_text only has a few GB of difference.

I do not know why we are at this point- the initial proposal was not to deduplicate comments- which will make an index/hash on comments unnecessary and no concerns on gap locking. Please let's stick to the original idea- fixing the concerns of the revision table size and allowing for longer comments by splitting the revision table into 2, a rev_id (or similar autoincrement PK) and a comment, unindexed, and we can later take on other, more complex goals. I think you are overthinking too much and put yourselves too many obstacles. I understand that you do not want to drive yourseves into a bad design, but splitting the table will achieve both initial goals and will make further schema changes easier. Even if you think that deduplication is important, just avoid locking on write, and a LIMIT 1 on retrieval, and allow for duplicates (hell, we can always run a compactComments job later asynchronously if you think you cannot pay that in real time).

Let's do the simplest design that satisfies the requirements, and we can later add indexes, and hashes, and all those things that you seem to love :-)

I only suggested the comment_text(100) index because some people is obsessed with hashes

I got the impression that you were not really serious about it. I probably should have put my foot down before it got turned into code. I'm putting my foot down now, anyway. It's suboptimal in terms of performance, and it's actually not that much simpler than using a hash, if you look at how hashes would be implemented in the framework of https://gerrit.wikimedia.org/r/#/c/357892 .

I do not know why we are at this point- the initial proposal was not to deduplicate comments- which will make an index/hash on comments unnecessary and no concerns on gap locking.

I supported the idea of not deduplicating, but @Marostegui enthusiastically supported deduplication in T162138#3162527, and I believe @daniel and @Anomie always preferred deduplication also, and saw T162138 as support for their position.

Let's do the simplest design that satisfies the requirements, and we can later add indexes, and hashes, and all those things that you seem to love :-)

Well, the code is basically there already, I'm not going to reject it on the basis that too much effort went into writing it. Maybe @Anomie could split the existing deduplication feature out into a separate patch, but that would be at his discretion.

Ok, a hash column can be added to the existing patch easily enough. And best-effort deduplication is still a win as long as we document that it's best-effort.

For example, we could use a 32-bit hash, stored as an UNSIGNED INT.

Since PostgreSQL doesn't actually have UNSIGNED INT, I'll just make it signed.

So, someone (@tstarling?) pick a hash and decide whether we should truncate it to 32 or 64 bits (i.e. storage as an int or as a bigint). Or else I'm just going to go with CRC32.