Using LLMs for the Extraction and Normalization of Product Attribute Values

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14918))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

158 Accesses
1 Citations

Abstract

Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

The WDC Gold Standards for Product Feature Extraction and Product Matching

Data Driven Discovery of Attribute Dictionaries

Enriching Product Ads with Metadata from HTML Annotations

Notes

References

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D.: Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1998–2022 (2022)
Google Scholar
Blume, A., Zalmout, N., Ji, H., Li, X.: Generative models for product attribute extraction. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 575–585 (2023)
Google Scholar
Brinkmann, A., Shraga, R., Bizer, C.: Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Google Scholar
Fang, C., Li, X., Fan, Z., Xu, J., Nag, K., et al.: LLM-ensemble: optimal large language model ensemble method for e-commerce product attribute value extraction (2024). arXiv:2403.00863 [cs]
Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. ACM SIGKDD Explorations Newsl 8(1), 41–48 (2006)
Article Google Scholar
Goel, A., Gueta, A., Gilon, O., Liu, C., Erell, S., et al.: LLMs accelerate annotation for medical information extraction. In: Proceedings of the 3rd Machine Learning for Health Symposium, pp. 82–100 (2023)
Google Scholar
Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J.: Can language models automate data wrangling? Mach. Learn. 112(6), 2053–2082 (2023)
Article MathSciNet Google Scholar
Jain, M., Bhattacharya, S., Jain, H., Shaik, K., Chelliah, M.: Learning cross-task attribute-attribute similarity for multi-task attribute-value extraction. In: Proceedings of the 4th Workshop on e-Commerce and NLP, pp. 79–87 (2021)
Google Scholar
Kozareva, Z., Li, Q., Zhai, K., Guo, W.: Recognizing salient entities in shopping queries. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 107–111 (2016)
Google Scholar
Nederstigt, L.J., Aanen, S.S., Vandic, D., Frasincar, F.: FLOPPIES: a framework for large-scale ontology population of product information from tabular data in e-commerce stores. Decis. Support Syst. 59, 296–311 (2014)
Article Google Scholar
Parekh, T., Hsu, I.H., Huang, K.H., Chang, K.W., Peng, N.: Geneva: benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 3664–3686 (2023)
Google Scholar
Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference, pp. 381–386 (2019)
Google Scholar
Putthividhya, D., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1567 (2011)
Google Scholar
van Rooij, G., Sewnarain, R., Skogholt, M., van der Zaan, T., Frasincar, F., et al.: A data type-driven property alignment framework for product duplicate detection on the web. In: Proceedings of 17th International Web Information Systems Engineering Conference, pp. 380–395 (2016)
Google Scholar
Roy, K., Goyal, P., Pandey, M.: Exploring generative frameworks for product attribute value extraction. Expert Syst. Appl. 243, 122850 (2024)
Article Google Scholar
Sabeh, K., Kacimi, M., Gamper, J.: CAVE: correcting attribute values in e-commerce profiles. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4965–4969 (2022)
Google Scholar
Shinzato, K., Yoshinaga, N., Xia, Y., Chen, W.T.: Simple and effective knowledge-driven query expansion for QA-based product attribute extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 227–234 (2022)
Google Scholar
Valstar, N., Frasincar, F., Brauwers, G.: APFA: Automated product feature alignment for duplicate detection. Expert Syst. Appl. 174, 114759 (2021)
Article Google Scholar
Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the semantic web. Decis. Support Syst. 53(3), 425–437 (2012)
Article Google Scholar
Wang, Q., Yang, L., Kanagal, B., Sanghai, S., Sivakumar, D., et al.: Learning to extract attribute value from product via question answering: a multi-task approach. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 47–55 (2020)
Google Scholar
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Xu, H., Wang, W., Mao, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019)
Google Scholar
Yan, J., Zalmout, N., Liang, Y., Grant, C., Ren, X., et al.: AdaTag: multi-attribute value extraction from product profiles with adaptive decoding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4694–4705 (2021)
Google Scholar
Yang, L., Wang, Q., Wang, J., Quan, X., Feng, F., et al.: MixPAVE: mix-prompt tuning for few-shot product attribute value extraction. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 9978–9991 (2023)
Google Scholar
Yang, L., Wang, Q., Yu, Z., Kulkarni, A., Sanghai, S., et al.: Mave: a product dataset for multi-source attribute value extraction. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining, pp. 1256–1265 (2022)
Google Scholar
Zhang, L., Zhu, M., Huang, W.: A framework for an ontology-based E-commerce product information retrieval system. J. Comput. 4(6), 436–443 (2009)
Article Google Scholar
Zhang, X., Zhang, C., Li, X., Dong, X.L., Shang, J., et al.: OA-Mine: open-world attribute mining for e-commerce products with weak supervision. In: Proceedings of the ACM Web Conference 2022, pp. 3153–3161 (2022)
Google Scholar
Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: OpenTag: open attribute value extraction from product profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1049–1058 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Mannheim, Schloss, 68161, Mannheim, Germany
Alexander Brinkmann, Nick Baumann & Christian Bizer

Authors

Alexander Brinkmann
View author publications
You can also search for this author in PubMed Google Scholar
Nick Baumann
View author publications
You can also search for this author in PubMed Google Scholar
Christian Bizer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexander Brinkmann .

Editor information

Editors and Affiliations

Lebanese American University Engineering School, Lebanese American University, Chouran Beirut, Lebanon
Joe Tekli
Free University of Bozen-Bolzano, Bozen-Bolzano, Italy
Johann Gamper
Université de Pau et des Pays de l’Adour, Anglet, France
Richard Chbeir
Open University of Cyprus, Nicosia, Cyprus
Yannis Manolopoulos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Brinkmann, A., Baumann, N., Bizer, C. (2024). Using LLMs for the Extraction and Normalization of Product Attribute Values. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-70626-4_15
Published: 01 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70628-8
Online ISBN: 978-3-031-70626-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using LLMs for the Extraction and Normalization of Product Attribute Values

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The WDC Gold Standards for Product Feature Extraction and Product Matching

Data Driven Discovery of Attribute Dictionaries

Enriching Product Ads with Metadata from HTML Annotations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Using LLMs for the Extraction and Normalization of Product Attribute Values

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

The WDC Gold Standards for Product Feature Extraction and Product Matching

Data Driven Discovery of Attribute Dictionaries

Enriching Product Ads with Metadata from HTML Annotations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation