- What use case is the model going to support/resolve?
Related to https://phabricator.wikimedia.org/T371902
- Do you have a '''model card''' ?
https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language-agnostic_reference_risk
- What team created/trained/etc.. the model? What tools and frameworks have you used?
- What kind of data was the model trained with, and what kind of data the model is going to need in production (for example, calls to internal/external services, special datasources for features, etc..) ?
This model requires access to precomputed domain metadata for inference. This metadata is currently expected to be updated monthly through an Airflow DAG maintained by research engineering which invokes the reference quality pipeline to generate a new snapshot of domain features every month. The pipeline itself depends on the wmf.mediawiki_wikitext_history dataset in the data lake for retrieving historical information on a domain and also on wmf.mediawiki_wikitext_current for getting the perennial sources labels for domains.
To facilitate retrieval, these snapshots are exported in the form of an sqlite database with the following table:
sqlite> .schema domains CREATE TABLE IF NOT EXISTS "domains" ( "wiki_db" TEXT, "domain" TEXT, "page_distinct_cnt" INTEGER, "add_user_distinct_cnt" INTEGER, "ref_max_lifespan_mean" REAL, "ref_max_lifespan_p25" REAL, "ref_max_lifespan_median" REAL, "ref_max_lifespan_p75" REAL, "ref_real_lifespan_mean" REAL, "ref_real_lifespan_p25" REAL, "ref_real_lifespan_median" REAL, "ref_real_lifespan_p75" REAL, "num_edits_mean" REAL, "num_edits_p25" INTEGER, "num_edits_median" INTEGER, "num_edits_p75" INTEGER, "sur_edit_ratio_mean" REAL, "sur_edit_ratio_p25" REAL, "sur_edit_ratio_median" REAL, "sur_edit_ratio_p75" REAL, "psl_local" TEXT, "psl_enwiki" TEXT, "snapshot" TEXT ); CREATE INDEX "ix_domains_wiki_db_domain"ON "domains" ("wiki_db","domain");
This database is then uploaded to the feature-sets container in swift which is world readable i.e. has read ACL '.r:*,.rlistings'. Available snapshots can be listed via:
$ curl 'https://thanos-swift.discovery.wmnet/v1/AUTH_research/feature-sets?prefix=reference-risk'
- If you have a minimal codebase that you used to run the first tests with the model, could you please share it?
The original source for the model lives in the reference-quality repo.
An adapted version of it that works with the generated sqlite databases has been added to knowledge-integrity and can be used by installing v0.8.3.
- State what team will own the model and please share some main point of contacts.
- What is the current latency and throughput of the model, if you have tested it? We don't need anything precise at this stage, just some ballparks numbers to figure out how the model performs with the expected inputs. For example, does the model take ms/seconds/etc.. to respond to queries? How does it react when 1/10/20/etc.. requests in parallel are made? If you don't have these numbers don't worry, open the task and we'll figure something out while we discuss about next steps!
- Is there an expected frequency in which the model will have to be retrained with new data?
- What are the resources required to train the model and what was the dataset size?
- Have you checked if the output of your model is safe from a human rights point of view? Is there any risk of it being offensive for somebody? Even if you have any slight worry or corner case, please tell us!
- Everything else that is relevant in your opinion.