%0 Conference Proceedings %T A Comprehensive Analysis of Memorization in Large Language Models %A Kiyomaru, Hirokazu %A Sugiura, Issa %A Kawahara, Daisuke %A Kurohashi, Sadao %Y Mahamood, Saad %Y Minh, Nguyen Le %Y Ippolito, Daphne %S Proceedings of the 17th International Natural Language Generation Conference %D 2024 %8 September %I Association for Computational Linguistics %C Tokyo, Japan %F kiyomaru-etal-2024-comprehensive-analysis %X This paper presents a comprehensive study that investigates memorization in large language models (LLMs) from multiple perspectives. Experiments are conducted with the Pythia and LLM-jp model suites, both of which offer LLMs with over 10B parameters and full access to their pre-training corpora. Our findings include: (1) memorization is more likely to occur with larger model sizes, longer prompt lengths, and frequent texts, which aligns with findings in previous studies; (2) memorization is less likely to occur for texts not trained during the latter stages of training, even if they frequently appear in the training corpus; (3) the standard methodology for judging memorization can yield false positives, and texts that are infrequent yet flagged as memorized typically result from causes other than true memorization. %U https://aclanthology.org/2024.inlg-main.45 %P 584-596