Nothing Special   »   [go: up one dir, main page]

Skip to main content

Using LLMs for the Extraction and Normalization of Product Attribute Values

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2024)

Abstract

Product offers on e-commerce websites often consist of a product title and a textual product description. In order to enable features such as faceted product search or to generate product comparison tables, it is necessary to extract structured attribute-value pairs from the unstructured product titles and descriptions and to normalize the extracted values to a single, unified scale for each attribute. This paper explores the potential of using large language models (LLMs), such as GPT-3.5 and GPT-4, to extract and normalize attribute values from product titles and descriptions. We experiment with different zero-shot and few-shot prompt templates for instructing LLMs to extract and normalize attribute-value pairs. We introduce the Web Data Commons - Product Attribute Value Extraction (WDC-PAVE) benchmark dataset for our experiments. WDC-PAVE consists of product offers from 59 different websites which provide schema.org annotations. The offers belong to five different product categories, each with a specific set of attributes. The dataset provides manually verified attribute-value pairs in two forms: (i) directly extracted values and (ii) normalized attribute values. The normalization of the attribute values requires systems to perform the following types of operations: name expansion, generalization, unit of measurement conversion, and string wrangling. Our experiments demonstrate that GPT-4 outperforms the PLM-based extraction methods SU-OpenTag, AVEQA, and MAVEQA by 10%, achieving an F1-score of 91%. For the extraction and normalization of product attribute values, GPT-4 achieves a similar performance to the extraction scenario, while being particularly strong at string wrangling and name expansion.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/wbsg-uni-mannheim/wdc-pave.

  2. 2.

    https://schema.org/.

  3. 3.

    https://webdatacommons.org/.

  4. 4.

    https://commoncrawl.org/.

  5. 5.

    https://webdatacommons.org/largescaleproductcorpus/v2/.

  6. 6.

    https://platform.openai.com/docs/api-reference.

  7. 7.

    https://json-schema.org/.

  8. 8.

    https://platform.openai.com/docs/guides/embeddings/.

  9. 9.

    https://openai.com/pricing.

References

  1. Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., Sontag, D.: Large language models are few-shot clinical information extractors. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1998–2022 (2022)

    Google Scholar 

  2. Blume, A., Zalmout, N., Ji, H., Li, X.: Generative models for product attribute extraction. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 575–585 (2023)

    Google Scholar 

  3. Brinkmann, A., Shraga, R., Bizer, C.: Product attribute value extraction using large language models. arXiv preprint arXiv:2310.12537 (2023)

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186 (2019)

    Google Scholar 

  5. Fang, C., Li, X., Fan, Z., Xu, J., Nag, K., et al.: LLM-ensemble: optimal large language model ensemble method for e-commerce product attribute value extraction (2024). arXiv:2403.00863 [cs]

  6. Ghani, R., Probst, K., Liu, Y., Krema, M., Fano, A.: Text mining for product attribute extraction. ACM SIGKDD Explorations Newsl 8(1), 41–48 (2006)

    Article  Google Scholar 

  7. Goel, A., Gueta, A., Gilon, O., Liu, C., Erell, S., et al.: LLMs accelerate annotation for medical information extraction. In: Proceedings of the 3rd Machine Learning for Health Symposium, pp. 82–100 (2023)

    Google Scholar 

  8. Jaimovitch-López, G., Ferri, C., Hernández-Orallo, J., Martínez-Plumed, F., Ramírez-Quintana, M.J.: Can language models automate data wrangling? Mach. Learn. 112(6), 2053–2082 (2023)

    Article  MathSciNet  Google Scholar 

  9. Jain, M., Bhattacharya, S., Jain, H., Shaik, K., Chelliah, M.: Learning cross-task attribute-attribute similarity for multi-task attribute-value extraction. In: Proceedings of the 4th Workshop on e-Commerce and NLP, pp. 79–87 (2021)

    Google Scholar 

  10. Kozareva, Z., Li, Q., Zhai, K., Guo, W.: Recognizing salient entities in shopping queries. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 107–111 (2016)

    Google Scholar 

  11. Nederstigt, L.J., Aanen, S.S., Vandic, D., Frasincar, F.: FLOPPIES: a framework for large-scale ontology population of product information from tabular data in e-commerce stores. Decis. Support Syst. 59, 296–311 (2014)

    Article  Google Scholar 

  12. Parekh, T., Hsu, I.H., Huang, K.H., Chang, K.W., Peng, N.: Geneva: benchmarking generalizability for event argument extraction with hundreds of event types and argument roles. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 3664–3686 (2023)

    Google Scholar 

  13. Primpeli, A., Peeters, R., Bizer, C.: The WDC training dataset and gold standard for large-scale product matching. In: Companion Proceedings of The 2019 World Wide Web Conference, pp. 381–386 (2019)

    Google Scholar 

  14. Putthividhya, D., Hu, J.: Bootstrapped named entity recognition for product attribute extraction. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1567 (2011)

    Google Scholar 

  15. van Rooij, G., Sewnarain, R., Skogholt, M., van der Zaan, T., Frasincar, F., et al.: A data type-driven property alignment framework for product duplicate detection on the web. In: Proceedings of 17th International Web Information Systems Engineering Conference, pp. 380–395 (2016)

    Google Scholar 

  16. Roy, K., Goyal, P., Pandey, M.: Exploring generative frameworks for product attribute value extraction. Expert Syst. Appl. 243, 122850 (2024)

    Article  Google Scholar 

  17. Sabeh, K., Kacimi, M., Gamper, J.: CAVE: correcting attribute values in e-commerce profiles. In: Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp. 4965–4969 (2022)

    Google Scholar 

  18. Shinzato, K., Yoshinaga, N., Xia, Y., Chen, W.T.: Simple and effective knowledge-driven query expansion for QA-based product attribute extraction. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 227–234 (2022)

    Google Scholar 

  19. Valstar, N., Frasincar, F., Brauwers, G.: APFA: Automated product feature alignment for duplicate detection. Expert Syst. Appl. 174, 114759 (2021)

    Article  Google Scholar 

  20. Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the semantic web. Decis. Support Syst. 53(3), 425–437 (2012)

    Article  Google Scholar 

  21. Wang, Q., Yang, L., Kanagal, B., Sanghai, S., Sivakumar, D., et al.: Learning to extract attribute value from product via question answering: a multi-task approach. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 47–55 (2020)

    Google Scholar 

  22. Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)

  23. Xu, H., Wang, W., Mao, X., Lan, M.: Scaling up open tagging from tens to thousands: comprehension empowered attribute value extraction from product title. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5214–5223 (2019)

    Google Scholar 

  24. Yan, J., Zalmout, N., Liang, Y., Grant, C., Ren, X., et al.: AdaTag: multi-attribute value extraction from product profiles with adaptive decoding. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4694–4705 (2021)

    Google Scholar 

  25. Yang, L., Wang, Q., Wang, J., Quan, X., Feng, F., et al.: MixPAVE: mix-prompt tuning for few-shot product attribute value extraction. In: Findings of the Association for Computational Linguistics: ACL 2023, pp. 9978–9991 (2023)

    Google Scholar 

  26. Yang, L., Wang, Q., Yu, Z., Kulkarni, A., Sanghai, S., et al.: Mave: a product dataset for multi-source attribute value extraction. In: Proceedings of the 15th ACM International Conference on Web Search and Data Mining, pp. 1256–1265 (2022)

    Google Scholar 

  27. Zhang, L., Zhu, M., Huang, W.: A framework for an ontology-based E-commerce product information retrieval system. J. Comput. 4(6), 436–443 (2009)

    Article  Google Scholar 

  28. Zhang, X., Zhang, C., Li, X., Dong, X.L., Shang, J., et al.: OA-Mine: open-world attribute mining for e-commerce products with weak supervision. In: Proceedings of the ACM Web Conference 2022, pp. 3153–3161 (2022)

    Google Scholar 

  29. Zheng, G., Mukherjee, S., Dong, X.L., Li, F.: OpenTag: open attribute value extraction from product profiles. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1049–1058 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Brinkmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Brinkmann, A., Baumann, N., Bizer, C. (2024). Using LLMs for the Extraction and Normalization of Product Attribute Values. In: Tekli, J., Gamper, J., Chbeir, R., Manolopoulos, Y. (eds) Advances in Databases and Information Systems. ADBIS 2024. Lecture Notes in Computer Science, vol 14918. Springer, Cham. https://doi.org/10.1007/978-3-031-70626-4_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70626-4_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70628-8

  • Online ISBN: 978-3-031-70626-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics