Abstract
Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
References
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D.: Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1998–2022 (2022)
Blume, A., Zalmout, N., Ji, H., Li, X.: Generative models for product attribute extraction. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 575–585 (2023)
Brinkmann, A., Shraga, R., Bizer, C.: Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537 (2023)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)
Fang, C., Li, X., Fan, Z., Xu, J., Nag, K., et al.: LLM-ensemble: optimal large language model ensemble method for e-commerce product attribute value extraction (2024). arXiv:2403.00863 [cs]
Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. ACM SIGKDD Explorations Newsl 8(1), 41–48 (2006)
Goel, A., Gueta, A., Gilon, O., Liu, C., Erell, S., et al.: LLMs accelerate annotation for medical information extraction. In: Proceedings of the 3rd Machine Learning for Health Symposium, pp. 82–100 (2023)
Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J.: Can language models automate data wrangling? Mach. Learn. 112(6), 2053–2082 (2023)
Jain, M., Bhattacharya, S., Jain, H., Shaik, K., Chelliah, M.: Learning cross-task attribute-attribute similarity for multi-task attribute-value extraction. In: Proceedings of the 4th Workshop on e-Commerce and NLP, pp. 79–87 (2021)
Kozareva, Z., Li, Q., Zhai, K., Guo, W.: Recognizing salient entities in shopping queries. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 107–111 (2016)
Nederstigt, L.J., Aanen, S.S., Vandic, D., Frasincar, F.: FLOPPIES: a framework for large-scale ontology population of product information from tabular data in e-commerce stores. Decis. Support Syst. 59, 296–311 (2014)
Parekh, T., Hsu, I.H., Huang, K.H., Chang, K.W., Peng, N.: Geneva: benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 3664–3686 (2023)
Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference, pp. 381–386 (2019)
Putthividhya, D., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1567 (2011)
van Rooij, G., Sewnarain, R., Skogholt, M., van der Zaan, T., Frasincar, F., et al.: A data type-driven property alignment framework for product duplicate detection on the web. In: Proceedings of 17th International Web Information Systems Engineering Conference, pp. 380–395 (2016)
Roy, K., Goyal, P., Pandey, M.: Exploring generative frameworks for product attribute value extraction. Expert Syst. Appl. 243, 122850 (2024)
Sabeh, K., Kacimi, M., Gamper, J.: CAVE: correcting attribute values in e-commerce profiles. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4965–4969 (2022)
Shinzato, K., Yoshinaga, N., Xia, Y., Chen, W.T.: Simple and effective knowledge-driven query expansion for QA-based product attribute extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 227–234 (2022)
Valstar, N., Frasincar, F., Brauwers, G.: APFA: Automated product feature alignment for duplicate detection. Expert Syst. Appl. 174, 114759 (2021)
Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the semantic web. Decis. Support Syst. 53(3), 425–437 (2012)
Wang, Q., Yang, L., Kanagal, B., Sanghai, S., Sivakumar, D., et al.: Learning to extract attribute value from product via question answering: a multi-task approach. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 47–55 (2020)
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Xu, H., Wang, W., Mao, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019)
Yan, J., Zalmout, N., Liang, Y., Grant, C., Ren, X., et al.: AdaTag: multi-attribute value extraction from product profiles with adaptive decoding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4694–4705 (2021)
Yang, L., Wang, Q., Wang, J., Quan, X., Feng, F., et al.: MixPAVE: mix-prompt tuning for few-shot product attribute value extraction. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 9978–9991 (2023)
Yang, L., Wang, Q., Yu, Z., Kulkarni, A., Sanghai, S., et al.: Mave: a product dataset for multi-source attribute value extraction. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining, pp. 1256–1265 (2022)
Zhang, L., Zhu, M., Huang, W.: A framework for an ontology-based E-commerce product information retrieval system. J. Comput. 4(6), 436–443 (2009)
Zhang, X., Zhang, C., Li, X., Dong, X.L., Shang, J., et al.: OA-Mine: open-world attribute mining for e-commerce products with weak supervision. In: Proceedings of the ACM Web Conference 2022, pp. 3153–3161 (2022)
Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: OpenTag: open attribute value extraction from product profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1049–1058 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Brinkmann, A., Baumann, N., Bizer, C. (2024). Using LLMs for the Extraction and Normalization of Product Attribute Values. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-70626-4_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70628-8
Online ISBN: 978-3-031-70626-4
eBook Packages: Computer ScienceComputer Science (R0)