tutorial

Open access

Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era

Authors:

Jun XuAuthors Info & Claims

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 6437 - 6447

https://doi.org/10.1145/3637528.3671458

Published: 24 August 2024 Publication History

Abstract

With the rapid advancements of large language models (LLMs), information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at https://github.com/KID-22/LLM-IR-Bias-Fairness-Survey.

References

[1]

Abdollahpouri et al. 2020. Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction (2020).

[2]

Himan Abdollahpouri, Masoud Mansoury, Robin Burke, and Bamshad Mobasher. 2019. The unfairness of popularity bias in recommendation. arXiv (2019).

[3]

Abubakar Abid, Maheen Farooqi, and James Zou. 2021. Persistent anti-muslim bias in large language models. In AAAI.

[4]

Qingyao Ai, Ting Bai, et al. 2023. Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community. AI Open (2023).

[5]

Ekin Akyürek, Tolga Bolukbasi, et al. 2022. Towards tracing factual knowledge in language models back to the training data. arXiv (2022).

[6]

Yuntao Bai, Saurav Kadavath, et al. 2022. Constitutional ai: Harmlessness from ai feedback. arXiv (2022).

[7]

Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, et al. 2024. Longalign: A recipe for long context alignment of large language models. arXiv (2024).

[8]

Yanhong Bai, Jiabao Zhao, Jinxin Shi, Tingjiang Wei, Xingjiao Wu, and Liang He. 2023. FairMonitor: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models. arxiv: 2308.10397 [cs.CL]

[9]

Keqin Bao, Jizhi Zhang, Wenjie Wang, et al. 2023. A bi-step grounding paradigm for large language models in recommendation systems. arXiv (2023).

[10]

Emily M Bender, Timnit Gebru, et al. 2021. On the dangers of stochastic parrots: Can language models be too big?. In FAccT.

[11]

Camiel J Beukeboom et al. 2019. How stereotypes are shared through language: a review and introduction of the aocial categories and stereotypes communication (SCSC) framework. Review of Communication Research (2019).

[12]

Shikha Bordia and Samuel R Bowman. 2019. Identifying and reducing gender bias in word-level language models. arXiv (2019).

[13]

Yihan Cao, Siyu Li, et al. 2023. A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv (2023).

[14]

Yupeng Chang, Xu Wang, et al. 2023. A survey on evaluation of large language models. TIST (2023).

[15]

Guiming Hardy Chen, Shunian Chen, et al. 2024. Humans or LLMs as the Judge? A Study on Judgement Biases. arXiv (2024).

[16]

Jiawei Chen, Hande Dong, et al. 2023. Bias and debias in recommender system: A survey and future directions. TOIS (2023).

[17]

Jifan Chen, Grace Kim, et al. 2023. Complex Claim Verification with Evidence Retrieved in the Wild. arXiv (2023).

[18]

Xiaoyang Chen, Ben He, et al. 2024. Spiral of Silences: How is Large Language Model Killing Information Retrieval?--A Case Study on Open Domain Question Answering. acl (2024).

[19]

Yukang Chen, Shengju Qian, et al. 2023. Longlora: Efficient fine-tuning of long-context large language models. arXiv (2023).

[20]

I Chern, Steffi Chern, et al. 2023. FacTool: Factuality Detection in Generative AI--A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv (2023).

[21]

Steffi Chern, Ethan Chern, et al. 2024. Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate. arXiv (2024).

[22]

Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? ACL (2023).

[23]

Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, and Yiqun Liu. 2024. PRE: A Peer Review Based Large Language Model Evaluator. arXiv (2024).

[24]

Yung-Sung Chuang, Yujia Xie, et al. 2023. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv (2023).

[25]

John Joon Young Chung et al. 2023. Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. arXiv (2023).

[26]

Sunhao Dai, Weihao Liu, et al. 2024. Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration. Findings of ACL (2024).

[27]

Sunhao Dai, Ninglu Shao, et al. 2023. Uncovering ChatGPT's Capabilities in Recommender Systems. In RecSys.

[28]

Sunhao Dai, Yuqi Zhou, et al. 2024. Neural Retrievers are Biased Towards LLM-Generated Content. KDD (2024).

[29]

Debarati Das, Karin De Langis, et al. 2024. Under the Surface: Tracking the Artifactuality of LLM-Generated Data. arXiv (2024).

[30]

Michela Del Vicario, Gianna Vivaldo, et al. 2016. Echo chambers: Emotional contagion and group polarization on facebook. Scientific reports (2016).

[31]

Yashar Deldjoo. 2024. Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency. arXiv (2024).

[32]

Yashar Deldjoo and Tommaso di Noia. 2024. CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System.

[33]

Lucas Dixon, John Li, et al. 2018. Measuring and mitigating unintended bias in text classification. In AAAI.

[34]

Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. arXiv (2020).

[35]

Nouha Dziri, Hannah Rashkin, Tal Linzen, and David Reitter. 2021. Evaluating groundedness in dialogue systems: The begin benchmark. arXiv (2021).

[36]

Naomi Ellemers. 2018. Gender stereotypes. Annual review of psychology (2018).

[37]

Wenqi Fan, Zihuai Zhao, et al. 2023. Recommender systems in the era of large language models (llms). arXiv (2023).

[38]

Xiao Fang, Shangkun Che, Minjia Mao, Hongzhe Zhang, Ming Zhao, and Xiaohang Zhao. 2024. Bias of AI-generated content: an examination of news produced by large language models. Scientific Reports (2024).

[39]

Patrick Fernandes, Aman Madaan, and otherss. 2023. Bridging the gap: A survey on integrating (human) feedback for natural language generation. TACL (2023).

[40]

Felix Friedrich, Manuel Brack, et al. 2023. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv (2023).

[41]

Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2023. Gptscore: Evaluate as you desire. arXiv (2023).

[42]

Yao Fu, Rameswar Panda, et al. 2024. Data Engineering for Scaling Language Models to 128K Context. arXiv (2024).

[43]

Isabel O Gallegos, Ryan A Rossi, et al. 2023. Bias and fairness in large language models: A survey. arXiv (2023).

[44]

Luyu Gao, Zhuyun Dai, et al. 2022. Rarr: Researching and revising what language models say, using language models. arXiv (2022).

[45]

Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. Llm-based nlg evaluation: Current status and challenges. arXiv (2024).

[46]

Aparna Garimella, Akhash Amarnath, et al. 2021. He is very intelligent, she is very beautiful? on mitigating social biases in language modelling and generation. In ACL Findings.

[47]

Somayeh Ghanbarzadeh, Yan Huang, et al. 2023. Gender-tuning: Empowering fine-tuning for debiasing pre-trained language models. arXiv (2023).

[48]

Friedrich M Götz et al. 2023. Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods (2023).

[49]

Roger Grosse, Juhan Bae, et al. 2023. Studying large language model generalization with influence functions. arXiv (2023).

[50]

Nigel Guenole, Andrew Samo, et al. 2024. Pseudo-Discrimination Parameters from Language Embeddings. (2024).

[51]

Suriya Gunasekar, Yi Zhang, et al. 2023. Textbooks Are All You Need. arXiv (2023).

[52]

Izzeddin Gur, Hiroki Furuta, et al. 2023. A real-world webagent with planning, long context understanding, and program synthesis. arXiv (2023).

[53]

Elizabeth L Haines, Kay Deaux, and Nicole Lofaro. 2016. The times they are a-changing? or are they not? A comparison of gender stereotypes, 1983--2014. Psychology of Women Quarterly (2016).

[54]

Xudong Han, Timothy Baldwin, and Trevor Cohn. 2021. Balancing out bias: Achieving fairness through balanced training. arXiv (2021).

[55]

Hans WA Hanley and Zakir Durumeric. 2023. Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites. arXiv (2023).

[56]

Hosein Hasanbeig, Hiteshi Sharma, et al. 2023. Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv (2023).

[57]

Zhankui He, Zhouhang Xie, et al. 2023. Large language models as zero-shot conversational recommenders. In CIKM.

[58]

Yupeng Hou, Junjie Zhang, et al. 2024. Large Language Models are Zero-Shot Rankers for Recommender Systems. In ECIR.

[59]

Wenyue Hua, Yingqiang Ge, et al. 2023. Up5: Unbiased foundation model for fairness-aware recommendation. arXiv (2023).

[60]

Hui Huang, Yingqi Qu, et al. 2024. An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv (2024).

[61]

Lei Huang, Weijiang Yu, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv (2023).

[62]

Po-Sen Huang, Huan Zhang, et al. 2019. Reducing sentiment bias in language models via counterfactual evaluation. arXiv (2019).

[63]

Guangyuan Jiang, Manjie Xu, et al. 2024. Evaluating and inducing personality in pre-trained language models. NeurIPS (2024).

[64]

Meng Jiang, Keqin Bao, et al. 2024. Item-side Fairness of Large Language Model-based Recommendation System. arXiv (2024).

[65]

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In ICML.

[66]

Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. arXiv (2023).

[67]

SR Karra, ST Nguyen, and T Tulabandhula. 2023. Estimating the personality of white-box language models. CoRR, abs/2204.12000 (2023).

[68]

Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. 2024. Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A (2024).

[69]

Minbeom Kim, Hwanhee Lee, et al. 2022. Critic-guided decoding for controlled text generation. arXiv (2022).

[70]

Tae Soo Kim, Yoonjoo Lee, Jamin Shin, Young-Ho Kim, and Juho Kim. 2023. Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv (2023).

[71]

Ryan Koo, Minhwa Lee, et al. 2023. Benchmarking cognitive biases in large language models as evaluators. arXiv (2023).

[72]

Faisal Ladhak, Esin Durmus, et al. 2023. When do pre-training biases propagate to downstream tasks? a case study in text summarization. In EACL.

[73]

Antonio Laverghetta Jr and John Licato. 2023. Generating better items for cognitive assessments using large language models. In BEA Workshop 2023.

[74]

Katherine Lee, Daphne Ippolito, et al. 2021. Deduplicating training data makes language models better. arXiv (2021).

[75]

Nayeon Lee, Wei Ping, et al. 2022. Factuality enhanced language models for open-ended text generation. NeurIPS (2022).

[76]

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. In EMNLP.

[77]

Jiaqi Li, Mengmeng Wang, Zilong Zheng, and Muhan Zhang. 2023 f. LooGLE: Can Long-Context Language Models Understand Long Contexts? arXiv (2023).

[78]

Lei Li, Yongfeng Zhang, Dugang Liu, and Li Chen. 2023 h. Large language models for generative recommendation: A survey and visionary discussions. arXiv (2023).

[79]

Ruosen Li, Teerth Patel, and Xinya Du. 2023 d. Prd: Peer rank and discussion improve large language model based evaluations. arXiv (2023).

[80]

Shaobo Li, Xiaoguang Li, et al. 2022. How pre-trained language models capture factual knowledge? a causal-inspired analysis. arXiv (2022).

[81]

Tianlong Li, Xiaoqing Zheng, and Xuanjing Huang. 2023 i. Tailoring personality traits in large language models via unsupervisedly-built personalized lexicons. arXiv (2023).

[82]

Xinyi Li, Yongfeng Zhang, and Edward C Malthouse. 2023 g. A preliminary study of chatgpt on news recommendation: Personalization, provider fairness, fake news. arXiv (2023).

[83]

Yunqi Li, Hanxiong Chen, et al. 2023. Fairness in recommendation: Foundations, methods, and applications. ACM Transactions on Intelligent Systems and Technology (2023).

[84]

Yingji Li, Mengnan Du, Rui Song, Xin Wang, and Ying Wang. 2023. A survey on fairness in large language models. arXiv (2023).

[85]

Zongjie Li, Chaozheng Wang, et al. 2023 e. Split and merge: Aligning position biases in large language model based evaluators. arXiv (2023).

[86]

Zhen Li, Xiaohan Xu, et al. 2024. Leveraging large language models for nlg evaluation: A survey. arXiv (2024).

[87]

Jianghao Lin, Xinyi Dai, et al. 2023. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv (2023).

[88]

Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv (2021).

[89]

Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao Liu, and Jiliang Tang. 2019. Does gender matter? towards fairness in dialogue systems. arXiv (2019).

[90]

Nelson F Liu, Kevin Lin, et al. 2024. Lost in the middle: How language models use long contexts. TACL (2024).

[91]

Yang Liu, Dan Iter, et al. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In EMNLP.

[92]

Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. 2023. Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv (2023).

[93]

Yang Liu, Yuanshun Yao, et al. 2023. Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment. arXiv (2023).

[94]

Adian Liusie, Yassir Fathullah, and Mark JF Gales. 2024. Teacher-Student Training for Debiasing: General Permutation Debiasing for Large Language Models. arXiv (2024).

[95]

Kaiji Lu, Piotr Mardziel, et al. 2020. Gender bias in neural natural language processing. Logic, language, and security: essays dedicated to Andre Scedrov on the occasion of his 65th birthday (2020).

[96]

Sichun Luo, Bowei He, et al. 2023. RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation. arXiv (2023).

[97]

Tianhui Ma, Yuan Cheng, Hengshu Zhu, and Hui Xiong. 2023. Large Language Models are Not Stable Recommender Systems. arXiv (2023).

[98]

Alex Mallen, Akari Asai, et al. 2022. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv (2022).

[99]

Christopher D Manning. 2009. An introduction to information retrieval. Cambridge university press.

Digital Library

[100]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. arXiv (2020).

[101]

Nick McKenna, Tianyi Li, et al. 2023. Sources of Hallucination by Large Language Models on Inference Tasks. arXiv (2023).

[102]

Nicholas Meade, Spandana Gella, et al. 2023. Using in-context learning to improve dialogue safety. arXiv (2023).

[103]

Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) (2021).

[104]

Todor Mihaylov, Tsvetomila Mihaylova, Preslav Nakov, Lluís Màrquez, Georgi D Georgiev, and Ivan Kolev Koychev. 2018. The dark side of news community forums: Opinion manipulation trolls. Internet Research (2018).

[105]

Sewon Min, Kalpesh Krishna, et al. 2023. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv (2023).

[106]

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2021. Cross-task generalization via natural language crowdsourcing instructions. arXiv (2021).

[107]

Reiichiro Nakano, Jacob Hilton, et al. 2021. Webgpt: Browser-assisted question-answering with human feedback. arXiv (2021).

[108]

Helen Ngo, Cooper Raterink, et al. 2021. Mitigating harm in language models with conditional-likelihood filtration. arXiv (2021).

[109]

Eirini Ntoutsi, Pavlos Fafalios, et al. 2020. Bias in data-driven artificial intelligence systems-An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2020).

[110]

Yasumasa Onoe, Michael JQ Zhang, Eunsol Choi, and Greg Durrett. 2022. Entity cloze by date: What LMs know about unseen entities. arXiv (2022).

[111]

Hadas Orgad and Yonatan Belinkov. 2022. BLIND: Bias removal with no demographics. arXiv (2022).

[112]

Long Ouyang, Jeffrey Wu, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS (2022).

[113]

Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv (2023).

[114]

SunYoung Park, Kyuri Choi, Haeun Yu, and Youngjoong Ko. 2023. Never too late to learn: Regularizing gender bias in coreference resolution. In WSDM.

[115]

Gourab K Patro, Arpita Biswas, Niloy Ganguly, Krishna P Gummadi, and Abhijnan Chakraborty. 2020. Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In WWW.

[116]

Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv (2023).

[117]

Cara L Phillips and Timothy R Vollmer. 2012. Generalized instruction following with pictorial prompts. Journal of Applied Behavior Analysis (2012).

[118]

Evaggelia Pitoura, Kostas Stefanidis, and Georgia Koutrika. 2022. Fairness in rankings and recommendations: an overview. The VLDB Journal (2022).

[119]

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A Smith, and Mike Lewis. 2022. Measuring and narrowing the compositionality gap in language models. arXiv (2022).

[120]

Xiao Pu, Mingqi Gao, and Xiaojun Wan. 2023. Summarization is (almost) dead. arXiv (2023).

[121]

Yusu Qian, Urwa Muaz, Ben Zhang, and Jae Won Hyun. 2019. Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv (2019).

[122]

Yiwei Qin, Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2022. T5score: Discriminative fine-tuning of generative evaluation metrics. arXiv (2022).

[123]

Zhen Qin, Rolf Jagerman, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting. arXiv (2023).

[124]

Zexuan Qiu, Jingjing Li, Shijue Huang, Wanjun Zhong, and Irwin King. 2024. CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models. arXiv (2024).

[125]

Colin Raffel, Noam Shazeer, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research (2020).

[126]

Ori Ram, Yoav Levine, et al. 2023. In-context retrieval-augmented language models. arXiv (2023).

[127]

Manav Rathod, Tony Tu, and Katherine Stasaski. 2022. Educational Multi-Question Generation for Reading Comprehension. In BEA Workshop 2022.

[128]

Mustafa Safdari, Greg Serapio-García, et al. 2023. Personality traits in large language models. arXiv (2023).

[129]

Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei Akimoto. 2023. Verbosity bias in preference labeling by large language models. arXiv (2023).

[130]

Tom Sander, Pierre Fernandez, et al. 2024. Watermarking Makes Language Models Radioactive. arXiv (2024).

[131]

Victor Sanh, Albert Webson, et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv (2021).

[132]

Irwin G Sarason, Gregory R Pierce, and Barbara R Sarason. 2014. Cognitive interference: Theories, methods, and findings. Routledge.

[133]

Patrick Schramowski, Cigdem Turan, et al. 2022. Large pre-trained language models contain human-like biases of what is right and wrong to do. Nature Machine Intelligence (2022).

[134]

Nikhil Sharma, Q Vera Liao, and Ziang Xiao. 2024. Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (2024).

Digital Library

[135]

Weijia Shi, Anirudh Ajith, et al. 2023. Detecting pretraining data from large language models. arXiv (2023).

[136]

Weijia Shi and Sewonand others Min. 2023. Replug: Retrieval-augmented black-box language models. arXiv (2023).

[137]

Kurt Shuster, Jing Xu, et al. 2022. Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv (2022).

[138]

Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. (2001).

[139]

Karan Singhal, Tao Tu, et al. 2023. Towards expert-level medical question answering with large language models. arXiv (2023).

[140]

Kyle Dylan Spurlock, Cagla Acun, Esin Saka, and Olfa Nasraoui. 2024. ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback. arXiv (2024).

[141]

Hao Sun, Zhexin Zhang, et al. 2022. MoralDial: A framework to train and evaluate moral dialogue systems via moral discussions. arXiv (2022).

[142]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Pengjie Ren, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv (2023).

[143]

Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, and Denny Zhou. 2022. Recitation-augmented language models. arXiv (2022).

[144]

Ekaterina Svikhnushina and Pearl Pu. 2023. Approximating Human Evaluation of Social Chatbots with Prompting. arXiv (2023).

[145]

Hexiang Tan, Fei Sun, et al. 2024. Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA? ACL (2024).

[146]

Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, and Ferhan Ture. 2023. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv (2023).

[147]

Hugo Touvron, Louis Martin, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv (2023).

[148]

Tom R Tyler and E Allan Lind. 2002. Procedural justice. In Handbook of justice research in law.

[149]

Tom R Tyler and Heather J Smith. 1995. Social justice and social movements. (1995).

[150]

Megan Ung, Jing Xu, and Y-Lan Boureau. 2021. Saferdialogues: Taking feedback gracefully after conversational safety failures. arXiv (2021).

[151]

Jingtan Wang, Xinyang Lu, et al. 2023. WASA: Watermark-based source attribution for large language model-generated data. arXiv (2023).

[152]

Liwen Wang, Yuanmeng Yan, Keqing He, Yanan Wu, and Weiran Xu. 2021. Dynamically disentangling social bias from task-oriented representations with adversarial attack. In NAACL.

[153]

Liang Wang, Nan Yang, and Furu Wei. 2023 g. Query2doc: Query Expansion with Large Language Models. arXiv (2023).

[154]

Lei Wang, Jingsen Zhang, et al. 2023 h. Recagent: A novel simulation paradigm for recommender systems. arXiv (2023).

[155]

Peiyi Wang, Lei Li, et al. 2023. Large language models are not fair evaluators. arXiv (2023).

[156]

Rui Wang, Pengyu Cheng, and Ricardo Henao. 2023. Toward fairness in text generation via mutual information minimization based on importance sampling. In AISTATS.

[157]

Weizhi Wang, Li Dong, et al. 2024. Augmenting language models with long-term memory. NeurIPS (2024).

[158]

Xi Wang, Hossein A Rahmani, Jiqun Liu, and Emine Yilmaz. 2023 e. Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation. arXiv (2023).

[159]

Xuezhi Wang, Jason Wei, et al. 2023 f. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arxiv: 2203.11171 [cs.CL]

[160]

Yizhong Wang, Yeganeh Kordi, et al. 2022. Self-instruct: Aligning language models with self-generated instructions. arXiv (2022).

[161]

Yifan Wang, Weizhi Ma, Min Zhang, Yiqun Liu, and Shaoping Ma. 2023 d. A survey on the fairness of recommender systems. ACM Transactions on Information Systems (2023).

[162]

Jiaxin Wen, Pei Ke, et al. 2023. Unveiling the implicit toxicity in large language models. arXiv (2023).

[163]

Robert Wolfe and Aylin Caliskan. 2021. Low frequency names exhibit bias and overfitting in contextualizing language models. arXiv (2021).

[164]

Tae-Jin Woo, Woo-Jeoung Nam, Yeong-Joon Ju, and Seong-Whan Lee. 2023. Compensatory debiasing for gender imbalances in language models. In ICASSP.

[165]

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Hong Lin. 2023. Ai-generated content (aigc): A survey. arXiv (2023).

[166]

Likang Wu, Zhaopeng Qiu, Zhi Zheng, Hengshu Zhu, and Enhong Chen. 2023. Exploring large language model for graph data understanding in online job recommendations. arXiv (2023).

[167]

Likang Wu, Zhi Zheng, et al. 2023. A Survey on Large Language Models for Recommendation. arXiv (2023).

[168]

Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. arXiv (2023).

[169]

Chen Xu, Sirui Chen, et al. 2023. P-MMF: Provider Max-min Fairness Re-ranking in Recommender System. In Proceedings of the ACM Web Conference 2023.

Digital Library

[170]

Chen Xu, Wenjie Wang, Yuxin Li, Liang Pang, Jun Xu, and Tat-Seng Chua. 2023 d. Do llms implicitly exhibit user discrimination in recommendation? an empirical study. arXiv (2023).

[171]

Jun Xu, Xiangnan He, and Hang Li. 2018. Deep Learning for Matching in Search and Recommendation. In SIGIR.

[172]

Lanling Xu, Junjie Zhang, et al. 2024 d. Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis. arXiv (2024).

[173]

Peng Xu, Wei Ping, et al. 2023. Retrieval meets long context large language models. arXiv (2023).

[174]

Shicheng Xu, Danyang Hou, et al. 2023. AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval. arXiv (2023).

[175]

Shicheng Xu, Liang Pang, et al. 2024. List-aware reranking-truncation joint model for search and retrieval-augmented generation. WWW (2024).

[176]

Shicheng Xu, Liang Pang, et al. 2024. Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks. WWW (2024).

[177]

Shicheng Xu, Liang Pang, et al. 2024. Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation. arXiv (2024).

[178]

Wenda Xu, Guanglei Zhu, et al. 2024 e. Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models. arXiv (2024).

[179]

Jintang Xue, Yun-Cheng Wang, et al. 2023. Bias and fairness in chatbots: An overview. arXiv (2023).

[180]

Ke Yang, Charles Yu, Yi R Fung, Manling Li, and Heng Ji. 2023. Adept: A debiasing prompt framework. In AAAI.

[181]

Tao Yang, Tianyuan Shi, et al. 2023. PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection. arXiv (2023).

[182]

Zonghan Yang, Xiaoyuan Yi, Peng Li, Yang Liu, and Xing Xie. 2022. Unified detoxifying and debiasing in language generation via inference-time adaptive optimization. arXiv (2022).

[183]

Seonghyeon Ye, Hyeonbin Hwang, et al. 2023. Investigating the effectiveness of task-agnostic prefix prompt for instruction following. arXiv (2023).

[184]

Wenhao Yu, Zhihan Zhang, Zhenwen Liang, Meng Jiang, and Ashish Sabharwal. 2023. Improving Language Models via Plug-and-Play Retrieval Feedback. arXiv (2023).

[185]

Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. Bartscore: Evaluating generated text as text generation. NeurIPS (2021).

[186]

Ali Zarifhonarvar. 2023. Economics of chatgpt: A labor market view on the occupational impact of artificial intelligence. SSRN 4350925 (2023).

[187]

Abdelrahman Zayed, Goncalo Mordido, Samira Shabanian, and Sarath Chandar. 2023. Should we attend more or less? modulating attention for fairness. arXiv (2023).

[188]

Meike Zehlike, Ke Yang, and Julia Stoyanovich. 2021. Fairness in ranking: A survey. arXiv (2021).

[189]

An Zhang, Leheng Sheng, Yuxin Chen, Hao Li, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2023 e. On generative agents in recommendation. arXiv (2023).

[190]

Jizhi Zhang, Keqin Bao, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. In RecSys.

[191]

Junjie Zhang, Yupeng Hou, et al. 2023. Agentcf: Collaborative learning with autonomous language agents for recommender systems. arXiv (2023).

[192]

Yang Zhang, Fuli Feng, et al. 2021. Causal intervention for leveraging popularity bias in recommendation. In SIGIR.

[193]

Yue Zhang, Yafu Li, et al. 2023 d. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arxiv: 2309.01219 [cs.CL]

[194]

Zhexin Zhang, Leqi Lei, et al. 2023. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv (2023).

[195]

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. 2023. Large language models are not robust multiple choice selectors. In ICLR.

[196]

Lianmin Zheng, Wei-Lin Chiang, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS (2024).

[197]

Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. 2023. Why Does ChatGPT Fall Short in Answering Questions Faithfully? arXiv (2023).

[198]

Zhi Zheng, Zhaopeng Qiu, Xiao Hu, Likang Wu, Hengshu Zhu, and Hui Xiong. 2023. Generative job recommendations with large language model. arXiv (2023).

[199]

Fan Zhou, Yuzhou Mao, et al. 2023. Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In ACL.

[200]

Yuqi Zhou, Sunhao Dai, et al. 2024. Source Echo Chamber: Exploring the Escalation of Source Bias in User, Data, and Recommender System Feedback Loop. arXiv (2024).

[201]

Yutao Zhu, Huaying Yuan, et al. 2023. Large language models for information retrieval: A survey. arXiv (2023).

[202]

Ziwei Zhu, Yun He, Xing Zhao, and James Caverlee. 2021. Popularity bias in dynamic recommendation. In KDD.

[203]

Terry Yue Zhuo, Yujin Huang, Chunyang Chen, and Zhenchang Xing. 2023. Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv (2023).

[204]

Bowei Zou, Pengfei Li, Liangming Pan, and Aiti Aw. 2022. Automatic true/false question generation for educational purpose. In BEA Workshop 2022.

[205]

Jiyun Zu, Ikkyu Choi, and Jiangang Hao. 2023. Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Testing and Assessment Modeling (2023).

Index Terms

Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era

Index terms have been assigned to the content through auto-classification.

Recommendations

Gender Fairness in Information Retrieval Systems
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Recent studies have shown that it is possible for stereotypical gender biases to find their way into representational and algorithmic aspects of retrieval methods; hence, exhibit themselves in retrieval outcomes. In this tutorial, we inform the audience ...
Analyzing the Influence of Bigrams on Retrieval Bias and Effectiveness
ICTIR '20: Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing ...
On the Orthogonality of Bias and Utility in Ad hoc Retrieval
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Various researchers have recently explored the impact of different types of biases on information retrieval tasks such as ad hoc retrieval and question answering. While the impact of bias needs to be controlled in order to avoid increased prejudices, the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2024

6901 pages

ISBN:9798400704901

DOI:10.1145/3637528

General Chairs:
Ricardo Baeza-Yates
Northeastern University, USA
,
Francesco Bonchi
CENTAI / Eurecat, Italy

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2024

Check for updates

Author Tags

Qualifiers

Tutorial

Funding Sources

National Natural ScienceFoundation of China (NSFC)
Youth Innova-tion Promotion Association CAS
National Key R&D Program of China
National Natural ScienceFoundation of China (NSFC)
National Natural ScienceFoundation of China (NSFC)

Conference

KDD '24

Sponsor:

KDD '24: The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona, Spain

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
1,206
Total Downloads

Downloads (Last 12 months)1,206
Downloads (Last 6 weeks)711

Reflects downloads up to 10 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents