Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3630106.3658979acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
research-article
Open access

Collective Constitutional AI: Aligning a Language Model with Public Input

Published: 05 June 2024 Publication History

Abstract

There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs—from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.

References

[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, 2023. GPT-4 Technical Report. arXiv:2303.08774 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2303.08774
[2]
Cecilia Ovesdotter Alm. 2011. Subjective Natural Language Problems: Motivations, Applications, Characterizations, and Implications. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 107–112.
[3]
Anthropic. 2023. Claude’s Constitution. Retrieved Dec 23, 2023 from https://www.anthropic.com/index/claudes-constitution
[4]
Anthropic. 2023. Model Card and Evaluations for Claude Models. https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf
[5]
Anthropic. 2023. Releasing Claude Instant 1.2. Retrieved Dec 23, 2023 from https://www.anthropic.com/index/releasing-claude-instant-1-2
[6]
Kenneth J Arrow. 2012. Social Choice and Individual Values. Vol. 12. Yale University Press.
[7]
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, 2021. A General Language Assistant as a Laboratory for Alignment. arXiv:2112.00861 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2112.00861
[8]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862 (2022). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2204.05862
[9]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, 2022. Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073 (2022). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2212.08073
[10]
Abeba Birhane. 2021. Algorithmic injustice: a relational ethics approach. Patterns 2, 2 (2021).
[11]
Abeba Birhane, William Isaac, Vinodkumar Prabhakaran, Mark Diaz, Madeleine Clare Elish, Iason Gabriel, and Shakir Mohamed. 2022. Power to the People? Opportunities and Challenges for Participatory AI. Equity and Access in Algorithms, Mechanisms, and Optimization (2022), 1–8.
[12]
Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao. 2022. The Values Encoded in Machine Learning Research. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184.
[13]
Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (Technology) is Power: A Critical Survey of" Bias" in NLP. arXiv:2005.14050 (2020). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2005.14050
[14]
Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. 2021. Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1004–1015.
[15]
Samuel R Bowman and George E Dahl. 2021. What Will it Take to Fix Benchmarking in Natural Language Understanding?arXiv:2104.02145 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2104.02145
[16]
Ralph Allan Bradley and Milton E Terry. 1952. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39, 3/4 (1952), 324–345.
[17]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems 30 (2017).
[18]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2110.14168
[19]
Fernando Delgado, Stephen Yang, Michael Madaio, and Qian Yang. 2023. The Participatory Turn in AI Design: Theoretical Foundations and the Current State of Practice. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization. 1–23.
[20]
Esin Durmus, Karina Nyugen, Thomas I Liao, Nicholas Schiefer, Amanda Askell, Anton Bakhtin, Carol Chen, Zac Hatfield-Dodds, Danny Hernandez, Nicholas Joseph, 2023. Towards Measuring the Representation of Subjective Global Opinions in Language Models. arXiv:2306.16388 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2306.16388
[21]
KJ Feng, Quan Ze, Inyoung Cheong, King Xia, Amy X Zhang, 2023. Case Repositories: Towards Case-Based Reasoning for AI Alignment. arXiv:2311.10934 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2311.10934
[22]
Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment. Minds and machines 30, 3 (2020), 411–437.
[23]
Deep Ganguli, Amanda Askell, Nicholas Schiefer, Thomas Liao, Kamilė Lukošiūtė, Anna Chen, Anna Goldie, Azalia Mirhoseini, Catherine Olsson, Danny Hernandez, 2023. The Capacity for Moral Self-Correction in Large Language Models. arXiv:2302.07459 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2302.07459
[24]
Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, 2022. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. arXiv:2209.07858 (2022). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2209.07858
[25]
Gemini Team Google. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv:2312.11805 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2312.11805
[26]
Mitchell L Gordon, Michelle S Lam, Joon Sung Park, Kayur Patel, Jeff Hancock, Tatsunori Hashimoto, and Michael S Bernstein. 2022. Jury Learning: Integrating Dissenting Voices into Machine Learning Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
[27]
Lara Groves, Aidan Peppin, Andrew Strait, and Jenny Brennan. 2023. Going Public: the Role of Public Participation Approaches in Commercial AI Labs. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1162–1173.
[28]
Amy Gutmann and Dennis F Thompson. 2004. Why Deliberative Democracy?Princeton University Press.
[29]
Karen Hao. 2022. Artificial Intelligence for the People. MIT Technology Review (2022). Retrieved Dec 23, 2023 from https://www.technologyreview.com/2022/04/22/1050394/artificial-intelligence-for-the-people/
[30]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 (2020). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2009.03300
[31]
Saffron Huang and Divya Siddarth. 2023. Generative AI and the Digital Commons. arXiv:2303.11074 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2303.11074
[32]
Abigail Z Jacobs and Hanna Wallach. 2021. Measurement and Fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 375–385.
[33]
Liwei Jiang, Jena D Hwang, Chandra Bhagavatula, Ronan Le Bras, Jenny Liang, Jesse Dodge, Keisuke Sakaguchi, Maxwell Forbes, Jon Borchardt, Saadia Gabriel, 2021. Can Machines Learn Morality? The Delphi Experiment. arXiv:2110.07574 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2110.07574
[34]
Xisen Jin, Francesco Barbieri, Brendan Kennedy, Aida Mostafazadeh Davani, Leonardo Neves, and Xiang Ren. 2021. On Transferability of Bias Mitigation Effects in Language Model Fine-Tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 3770–3783. https://doi.org/10.18653/v1/2021.naacl-main.296
[35]
Sachin Kumar, Vidhisha Balachandran, Lucille Njoo, Antonios Anastasopoulos, and Yulia Tsvetkov. 2023. Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Andreas Vlachos and Isabelle Augenstein (Eds.). Association for Computational Linguistics, Dubrovnik, Croatia, 3299–3321. https://doi.org/10.18653/v1/2023.eacl-main.241
[36]
Sandipan Kundu, Yuntao Bai, Saurav Kadavath, Amanda Askell, Andrew Callahan, Anna Chen, Anna Goldie, Avital Balwit, Azalia Mirhoseini, Brayden McLean, 2023. Specific versus General Principles for Constitutional AI. arXiv:2310.13798 (2023). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2310.13798
[37]
Andras Kupcsik, David Hsu, and Wee Sun Lee. 2018. Learning Dynamic Robot-to-Human Object Handover from Human Feedback. Robotics Research: Volume 1 (2018), 161–176.
[38]
Michelle S Lam, Mitchell L Gordon, Danaë Metaxa, Jeffrey T Hancock, James A Landay, and Michael S Bernstein. 2022. End-User Audits: A System Empowering Communities to Lead Large-Scale Investigations of Harmful Algorithmic Behavior. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–34.
[39]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, 2022. Holistic Evaluation of Language Models. arXiv:2211.09110 (2022). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2211.09110
[40]
Alexandra Mateescu and Madeleine Elish. 2019. AI in Context: The Labor of Integrating New Technologies. (2019).
[41]
Wilhelmina Nekoto, Vukosi Marivate, Tshinondiwa Matsila, Timi Fasubaa, Tajudeen Kolawole, Taiwo Fagbohungbe, Solomon Oluwole Akinola, Shamsuddeen Hassan Muhammad, Salomon Kabongo, Salomey Osei, 2020. Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages. arXiv:2010.02353 (2020). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2010.02353
[42]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
[43]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel R Bowman. 2021. BBQ: A Hand-Built Bias Benchmark for Question Answering. arXiv:2110.08193 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2110.08193
[44]
Andi Peng, Aviv Netanyahu, Mark K Ho, Tianmin Shu, Andreea Bobu, Julie Shah, and Pulkit Agrawal. 2023. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In International Conference on Machine Learning. PMLR, 27630–27641.
[45]
Pew Research Center. 2023. Majority of Americans have heard of ChatGPT, but few have tried it. (24 May 2023). https://www.pewresearch.org/short-reads/2023/05/24/a-majority-of-americans-have-heard-of-chatgpt-but-few-have-tried-it-themselves/
[46]
Crowd Wisdom Project. 2024. Moderation Policy - Crowd Wisdom Project. Retrieved Apr 8, 2024 from https://www.crowdwisdomproject.org/moderation-policy/
[47]
The Computational Democracy Project. 2024. The Computational Democracy Project - Moderation. Retrieved Apr 8, 2024 from https://compdemocracy.org/Moderation/
[48]
Rida Qadri, Renee Shelby, Cynthia L Bennett, and Emily Denton. 2023. AI’s Regimes of Representation: A Community-centered Study of Text-to-Image Models in South Asia. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 506–517.
[49]
Organizers Of Queerinai, Anaelia Ovalle, Arjun Subramonian, Ashwin Singh, Claas Voelcker, Danica J Sutherland, Davide Locatelli, Eva Breznik, Filip Klubicka, Hang Yuan, 2023. Queer In AI: A Case Study in Community-Led Participatory AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 1882–1895.
[50]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems 36 (2024).
[51]
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. 2023. Whose Opinions Do Language Models Reflect?. In Proceedings of the 40th International Conference on Machine Learning(Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 29971–30004. https://proceedings.mlr.press/v202/santurkar23a.html
[52]
Nandana Sengupta, Ashwini Vaidya, and James Evans. 2023. In her Shoes: Gendered Labelling in Crowdsourced Safety Perceptions Data from India. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 183–192.
[53]
Christopher Small, Michael Bjorkegren, Timo Erkkilä, Lynette Shaw, and Colin Megill. 2021. Polis: Scaling Deliberation by Mapping High Dimensional Opinion Spaces. Recerca: revista de pensament i anàlisi 26, 2 (2021).
[54]
Graham Smith and Corinne Wales. 2000. Citizens’ Juries and Deliberative Democracy. Political studies 48, 1 (2000), 51–65.
[55]
Irene Solaiman and Christy Dennison. 2021. Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. Advances in Neural Information Processing Systems 34 (2021), 5861–5873.
[56]
Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, [n. d.]. Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties. arXiv:2309.00779 ([n. d.]). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2309.00779
[57]
Ryan Steed, Swetasudha Panda, Ari Kobren, and Michael Wick. 2022. Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 3524–3542. https://doi.org/10.18653/v1/2022.acl-long.247
[58]
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
[59]
Jack Stilgoe, Richard Owen, and Phil Macnaghten. 2020. Developing a framework for responsible innovation. In The Ethics of Nanotechnology, Geoengineering, and Clean Energy. Routledge, 347–359.
[60]
Harini Suresh, Rajiv Movva, Amelia Lee Dogan, Rahul Bhargava, Isadora Cruxên, Ángeles Martinez Cuba, Guilia Taurino, Wonyoung So, and Catherine D’Ignazio. 2022. Towards Intersectional Feminist and Participatory ML: A Case Study in Supporting Feminicide Counterdata Collection. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 667–678.
[61]
Mark E Warren and Hilary Pearse. 2008. Designing Deliberative Democracy: The British Columbia Citizens’ Assembly. (2008).
[62]
Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Conor Griffin, Po-Sen Huang, John Mellor, Amelia Glaese, Myra Cheng, Borja Balle, Atoosa Kasirzadeh, 2022. Taxonomy of Risks Posed by Language Models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 214–229.
[63]
Wikipedia. 2023. Wiki survey. Retrieved Dec 23, 2023 from https://en.wikipedia.org/wiki/Wiki_survey
[64]
Stephen Tze-Inn Wu, Daniel Demetriou, and Rudwan Ali Husain. 2023. Honor Ethics: The Challenge of Globalizing Value Alignment in AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 593–602.
[65]
Alexandros Xenos, John Pavlopoulos, and Ion Androutsopoulos. 2021. Context sensitivity estimation in toxicity detection. In Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021). 140–145.
[66]
Taku Yamagata, Ryan McConville, and Raul Santos-Rodriguez. 2021. Reinforcement learning with feedback from multiple humans with diverse skills. arXiv:2111.08596 (2021). Retrieved Dec 23, 2023 from https://arxiv.org/abs/2111.08596

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
FAccT '24: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency
June 2024
2580 pages
ISBN:9798400704505
DOI:10.1145/3630106
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2024

Check for updates

Author Tags

  1. AI alignment
  2. AI bias
  3. AI ethics
  4. collective alignment
  5. generative AI
  6. human-centered AI
  7. participatory AI
  8. reinforcement learning from human feedback
  9. value alignment

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

FAccT '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)709
  • Downloads (Last 6 weeks)185
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media