Abstract
Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over \(75\,000\) correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Retail e-commerce sales worldwide from 2014 to 2019 - http://www.statista.com/statistics/379046/worldwide-retail-e-commerce-sales/.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
References
Gopalakrishnan, V., Iyengar, S.P., Madaan, A., Rastogi, R., Sengamedu, S.: Matching product titles using web-based enrichment. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 605–614. ACM, New York (2012)
Kannan, A., Givoni, I.E., Agrawal, R., Fuxman, A.: Matching unstructured product offers to structured product specifications. In: 17th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pp. 404–412 (2011)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. Proc. VLDB Endowment 3(1–2), 484–493 (2010)
Köpcke, H., Thor, A., Thomas, S., Rahm, E.: Tailoring entity resolution for matching product offers. In: Proceedings of the 15th International Conference on Extending Database Technology, pp. 545–550. ACM (2012)
Le, Q.V., Mikolov, T.: Distributed representations of sentences, documents. arXiv preprint arXiv:1405.4053 (2014)
McAuley, J., Targett, C., Shi, Q., van den Hengel, A.: Image-based recommendations on styles and substitutes. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 43–52. ACM (2015)
Melli, G.: Shallow semantic parsing of product offering titles (for better automatic hyperlink insertion). In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1670–1678. ACM, New York (2014)
Meusel, R., Petrovski, P., Bizer, C.: The webdatacommons microdata, RDFa and microformat dataset series. In: Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandečić, D., Groth, P., Noy, N., Janowicz, K., Goble, C. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Heidelberg (2014). doi:10.1007/978-3-319-11964-9_18
Meusel, R., Primpeli, A., Meilicke, C., Paulheim, H., Bizer, C.: Exploiting microdata annotations to consistently categorize product offers at web scale. In: Stuckenschmidt, H., Jannach, D. (eds.) EC-Web 2015. LNBIP, vol. 239, pp. 83–99. Springer, Heidelberg (2015). doi:10.1007/978-3-319-27729-5_7
Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endowment 4(7), 409–418 (2011)
Petrovski, P., Bryl, V., Bizer, C.: Integrating product data from websites offering microdata markup. In: Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, pp. 1299–1304. International World Wide Web Conferences Steering Committee (2014)
Petrovski, P., Bryl, V., Bizer, C.: Learning regular expressions for the extraction of product attributes from e-commerce microdata (2014)
Qiu, D., Barbosa, L., Dong, X.L., Shen, Y., Srivastava, D.: Dexter: large-scale discovery and extraction of product specifications on the web. Proc. VLDB Endowment 8(13), 2194–2205 (2015)
Ristoski, P., Mika, P.: Enriching product ads with metadata from HTML annotations. In: Proceedings of the 13th Extended Semantic Web Conference (2015, to appear)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Petrovski, P., Primpeli, A., Meusel, R., Bizer, C. (2017). The WDC Gold Standards for Product Feature Extraction and Product Matching. In: Bridge, D., Stuckenschmidt, H. (eds) E-Commerce and Web Technologies. EC-Web 2016. Lecture Notes in Business Information Processing, vol 278. Springer, Cham. https://doi.org/10.1007/978-3-319-53676-7_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-53676-7_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-53675-0
Online ISBN: 978-3-319-53676-7
eBook Packages: Computer ScienceComputer Science (R0)