Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3632620.3671097acmconferencesArticle/Chapter ViewAbstractPublication PagesicerConference Proceedingsconference-collections
research-article
Open access

Using Benchmarking Infrastructure to Evaluate LLM Performance on CS Concept Inventories: Challenges, Opportunities, and Critiques

Published: 12 August 2024 Publication History

Abstract

BACKGROUND AND CONTEXT. The pace of advancement of large language models (LLMs) motivates the use of existing infrastructure to automate the evaluation of LLM performance on computing education tasks. Concept inventories are well suited for evaluation because of their careful design and prior validity evidence.
OBJECTIVES. Our research explores the feasibility of using an automated benchmarking framework to evaluate computer science (CS) concept inventories. We explore three primary objectives: evaluation of LLM performance on the SCS1 and BDSI concept inventories; informal expert panel review of items which had variations between LLM and expected student performance; and description of challenges with using benchmarking infrastructure as a methodological innovation.
METHOD. We used the Holistic Evaluation of Language Models (HELM) framework to evaluate the SCS1 and BDSI against 10 LLMS with zero-shot and few-shot in-context learning: GPT (3.5, 4.0), Claude (1.3, 2.0, 2.1), Llama (7B, 13B, 70B), Mistral v0.1 7B, and Mixtral 8x7B. We used psychometric data from prior studies to measure knowledge levels for each LLM run. We then conducted an informal expert review to qualitatively explore how question design, CS content knowledge, and LLM design may explain differences between LLM and expected student performances.
FINDINGS. Our quantitative analysis found that most LLM response patterns reflected a below average introductory computing student with the SCS1 and did not fit the psychometric 2PL model for the BDSI. Our qualitative analysis identified that LLMs performed well on code infill questions, but poorly on nested conditionals, runtime analysis, and longer questions. We also identified several methodological challenges related to item security, translation, the structure when using HELM.
IMPLICATIONS. We consider the feasibility of using automated benchmarking as a methodology to support more reproducible, replicable, and rigorous investigations to understand the intersection of LLM capabilities, computing concepts, and assessment design. We also consider connections between psychometric approaches and LLM evaluations to inform the design of computing assessments that are more resilient to LLM advancements.

References

[1]
Vibhor Agarwal, Nakul Thureja, Madhav Krishan Garg, Sahiti Dharmavaram, Meghna, and Dhruv Kumar. 2024. “Which LLM should I use?”: Evaluating LLMs for tasks performed by Undergraduate Computer Science Students in India. (Jan. 2024). arxiv:2402.01687 [cs.CY]
[2]
Garima Agrawal, Kuntal Pal, Yuli Deng, Huan Liu, and Ying-Chih Chen. 2024. CyberQ: Generating Questions and Answers for Cybersecurity Education Using Knowledge Graph-Augmented LLMs. Proceedings of the AAAI Conference on Artificial Intelligence 38, 21 (Mar. 2024), 23164–23172. https://doi.org/10.1609/aaai.v38i21.30362
[3]
Murtaza Ali, Sourojit Ghosh, Prerna Rao, Raveena Dhegaskar, Sophia Jawort, Alix Medler, Mengqi Shi, and Sayamindu Dasgupta. 2023. Taking Stock of Concept Inventories in Computing Education: A Systematic Literature Review. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA) (ICER ’23, Vol. 1). Association for Computing Machinery, New York, NY, USA, 397–415. https://doi.org/10.1145/3568813.3600120
[4]
Mary J Allen and Wendy M Yen. 2001. Introduction to Measurement Theory. Waveland Press.
[5]
Ryan S Baker. 2016. Stupid Tutoring Systems, Intelligent Humans. International Journal of Artificial Intelligence in Education 26, 2 (June 2016), 600–614. https://doi.org/10.1007/s40593-016-0105-0
[6]
Rishabh Balse, Bharath Valaboju, Shreya Singhal, Jayakrishnan Madathil Warriem, and Prajish Prasad. 2023. Investigating the Potential of GPT-3 in Providing Feedback for Programming Assessments. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 292–298. https://doi.org/10.1145/3587102.3588852
[7]
Erin M Bardar, Edward E Prather, Kenneth Brecher, and Timothy F Slater. 2007. Development and validation of the light and spectroscopy concept inventory. Astronomy Education Review 5, 2 (2007), 103–113.
[8]
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. 2000. A neural probabilistic language model. Advances in neural information processing systems 13 (2000).
[9]
BIG-bench authors. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023).
[10]
Paulo Blikstein and Sepi Hejazi Moghadam. 2019. Computing Education Literature Review and Voices from the Field. In The Cambridge Handbook of Computing Education Research. Cambridge University Press, 56–78. https://doi.org/10.1017/9781108654555.004
[11]
Borhane Blili-Hamelin and Leif Hancox-Li. 2023. Making Intelligence: Ethical Values in IQ and ML Benchmarks. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (Chicago, IL, USA) (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 271–284. https://doi.org/10.1145/3593013.3593996
[12]
Su Lin Blodgett and Michael Madaio. 2021. Risks of AI Foundation Models in Education. (Oct. 2021). arxiv:2110.10024 [cs.CY]
[13]
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W Thomas, Florian Tramèr, Rose E Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2021. On the Opportunities and Risks of Foundation Models. (Aug. 2021). arxiv:2108.07258 [cs.LG]
[14]
Neil C C Brown, Eva Marinus, and Aleata Hubbard Cheuoua. 2022. Launching Registered Report Replications in Computer Science Education Research. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 309–322. https://doi.org/10.1145/3501385.3543971
[15]
Ricardo Caceffo, Pablo Frank-Bolton, Renan Souza, and Rodolfo Azevedo. 2019. Identifying and Validating Java Misconceptions Toward a CS1 Concept Inventory. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education (Aberdeen, Scotland Uk) (ITiCSE ’19). Association for Computing Machinery, New York, NY, USA, 23–29. https://doi.org/10.1145/3304221.3319771
[16]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S Yu, Qiang Yang, and Xing Xie. 2023. A Survey on Evaluation of Large Language Models. (July 2023). arxiv:2307.03109 [cs.CL]
[17]
Will Crichton, Gavin Gray, and Shriram Krishnamurthi. 2023. A Grounded Conceptual Model for Ownership Types in Rust. Proc. ACM Program. Lang. 7, OOPSLA2 (Oct. 2023), 1224–1252. https://doi.org/10.1145/3622841
[18]
C H Crouch and E Mazur. 2001. Peer Instruction: Ten years of experience and results. Am. J. Phys. 69 (Aug. 2001), 970–977. https://doi.org/10.1119/1.1374249
[19]
R J De Ayala. 2009. The theory and practice of item response theory. Guilford Press, New York.
[20]
Adrian de Freitas, Joel Coffman, Michelle de Freitas, Justin Wilson, and Troy Weingart. 2023. FalconCode: A Multiyear Dataset of Python Code Samples from an Introductory Computer Science Course. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada,) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 938–944. https://doi.org/10.1145/3545945.3569822
[21]
Mostafa Dehghani, Yi Tay, Alexey A Gritsenko, Zhe Zhao, Neil Houlsby, Fernando Diaz, Donald Metzler, and Oriol Vinyals. 2021. The Benchmark Lottery. (July 2021). arxiv:2107.07002 [cs.LG]
[22]
Andre Del Carpio Gutierrez, Paul Denny, and Andrew Luxton-Reilly. 2024. Evaluating Automatically Generated Contextualised Programming Exercises. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 (Portland, Oregon, USA) (SIGCSE 2024). Association for Computing Machinery, New York, NY, USA, 289–295. https://doi.org/10.1145/3626252.3630863
[23]
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya Krishna, Yada Pruksachatkun, Kai-Wei Chang, and Rahul Gupta. 2021. BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (Virtual Event, Canada) (FAccT ’21). Association for Computing Machinery, New York, NY, USA, 862–872. https://doi.org/10.1145/3442188.3445924
[24]
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and Zhifang Sui. 2023. A Survey on In-context Learning. arxiv:2301.00234 [cs.CL]
[25]
Fritz Drasgow, Michael V Levine, and Esther A Williams. 1985. Appropriateness measurement with polychotomous item response models and standardized indices. Br. J. Math. Stat. Psychol. 38, 1 (May 1985), 67–86. https://doi.org/10.1111/j.2044-8317.1985.tb00817.x
[26]
Jerome Epstein. 2007. Development and validation of the Calculus Concept Inventory. In Proceedings of the ninth international conference on mathematics education in a global community, Vol. 9. Citeseer, 165–170.
[27]
Karl Anders Ericsson and Herbert Alexander Simon. 1993. Protocol Analysis: Verbal Reports as Data Revised Edition. The MIT Press.
[28]
Hans Eysenck and Steven Rose. 1979. Race, intelligence and education. New community 7, 2 (June 1979), 278–283. https://doi.org/10.1080/1369183X.1979.9975576
[29]
James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A Becker. 2023. My AI wants to know if this will be on the exam: Testing OpenAI’s codex on CS2 programming exercises. In Australasian Computing Education Conference (Melbourne VIC Australia). ACM, New York, NY, USA. https://doi.org/10.1145/3576123.3576134
[30]
James Finnie-Ansley, Paul Denny, Andrew Luxton-Reilly, Eddie Antonio Santos, James Prather, and Brett A. Becker. 2023. My AI Wants to Know if This Will Be on the Exam: Testing OpenAI’s Codex on CS2 Programming Exercises. In Proceedings of the 25th Australasian Computing Education Conference (Melbourne, VIC, Australia) (ACE ’23). Association for Computing Machinery, New York, NY, USA, 97–104. https://doi.org/10.1145/3576123.3576134
[31]
Melissa J Gillis and Andrew T Jacobs. 2019. Introduction to Women’s and Gender Studies: An Interdisciplinary Approach. Oxford University Press.
[32]
Global Future Council on Artificial Intelligence for Humanity. 2022. A Blueprint for Equity and Inclusion in Artificial Intelligence. Technical Report. World Economic Forum.
[33]
Ken Goldman, Paul Gross, Cinda Heeren, Geoffrey Herman, Lisa Kaczmarczyk, Michael C. Loui, and Craig Zilles. 2008. Identifying important and difficult concepts in introductory computing courses using a delphi process. SIGCSE Bull. 40, 1 (mar 2008), 256–260. https://doi.org/10.1145/1352322.1352226
[34]
Linyuan Gong, Sida Wang, Mostafa Elhoushi, and Alvin Cheung. 2024. Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks. (March 2024). arxiv:2403.04814 [cs.CL]
[35]
S Hall, F G Abrantes, Hanwen Zhu, Grace A Sodunke, Aleksandar Shtedritski, and Hannah Rose Kirk. 2023. VisoGender: A dataset for benchmarking gender bias in image-text pronoun resolution. Adv. Neural Inf. Process. Syst. abs/2306.12424 (June 2023). https://doi.org/10.48550/arXiv.2306.12424
[36]
Ibrahim Abou Halloun and David Hestenes. 1985. The initial knowledge state of college physics students. Am. J. Phys. 53, 11 (Nov. 1985), 1043–1055. https://doi.org/10.1119/1.14030
[37]
Sally Hamouda, Stephen H Edwards, Hicham G Elmongui, Jeremy V Ernst, and Clifford A Shaffer. 2017. A basic recursion concept inventory. Computer Science Education 27, 2 (2017), 121–148.
[38]
Sarah Heckman, Jeffrey C Carver, Mark Sherriff, and Ahmed Al-zubidy. 2021. A Systematic Literature Review of Empiricism and Norms of Reporting in Computing Education Research Literature. ACM Trans. Comput. Educ. 22, 1 (Oct. 2021), 1–46. https://doi.org/10.1145/3470652
[39]
Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
[40]
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021. Measuring Mathematical Problem Solving With the MATH Dataset. (March 2021). arxiv:2103.03874 [cs.LG]
[41]
Geoffrey L Herman, Shan Huang, Peter A Peterson, Linda Oliva, Enis Golaszewski, and Alan T Sherman. 2023. Psychometric Evaluation of the Cybersecurity Curriculum Assessment. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 228–234. https://doi.org/10.1145/3545945.3569762
[42]
Geoffrey L. Herman, Michael C. Loui, and Craig Zilles. 2010. Creating the digital logic concept inventory. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (Milwaukee, Wisconsin, USA) (SIGCSE ’10). Association for Computing Machinery, New York, NY, USA, 102–106. https://doi.org/10.1145/1734263.1734298
[43]
Matthew Hertz. 2010. What do "CS1" and "CS2" mean? investigating differences in the early courses. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (Milwaukee, Wisconsin, USA) (SIGCSE ’10). Association for Computing Machinery, New York, NY, USA, 199–203. https://doi.org/10.1145/1734263.1734335
[44]
David Hestenes, Malcolm Wells, Gregg Swackhamer, and Others. 1992. Force concept inventory. Phys. Teach. 30, 3 (1992), 141–158.
[45]
Charles L Hulin, Fritz Drasgow, and Charles K Parsons. 1983. Item Response Theory: Application to Psychological Measurement. Dow Jones-Irwin.
[46]
Frederick Jelinek. 1998. Statistical methods for speech recognition. MIT press.
[47]
Hong Jiao and Robert W Lissitz. 2020. Application of Artificial Intelligence to Assessment. IAP.
[48]
Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What Disease Does This Patient Have? A Large-Scale Open Domain Question Answering Dataset from Medical Exams. NATO Adv. Sci. Inst. Ser. E Appl. Sci. 11, 14 (July 2021), 6421. https://doi.org/10.3390/app11146421
[49]
Ishika Joshi, Ritvik Budhiraja, Harshal Dev, Jahnvi Kadia, Mohammad Osama Ataullah, Sayan Mitra, Harshal D. Akolekar, and Dhruv Kumar. 2024. ChatGPT in the Classroom: An Analysis of Its Strengths and Weaknesses for Solving Undergraduate Computer Science Questions. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 (Portland, OR, USA) (SIGCSE 2024). Association for Computing Machinery, New York, NY, USA, 625–631. https://doi.org/10.1145/3626252.3630803
[50]
Lisa C. Kaczmarczyk, Elizabeth R. Petrick, J. Philip East, and Geoffrey L. Herman. 2010. Identifying student misconceptions of programming. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (Milwaukee, Wisconsin, USA) (SIGCSE ’10). Association for Computing Machinery, New York, NY, USA, 107–111. https://doi.org/10.1145/1734263.1734299
[51]
Michael T Kane. 2013. Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement 50, 1 (March 2013), 1–73. https://doi.org/10.1111/jedm.12000
[52]
Masahiro Kaneko, Danushka Bollegala, Naoaki Okazaki, and Timothy Baldwin. 2024. Evaluating Gender Bias in Large Language Models via Chain-of-Thought Prompting. (Jan. 2024). arxiv:2401.15585 [cs.CL]
[53]
Kuba Karpierz and Steven A. Wolfman. 2014. Misconceptions and concept inventory questions for binary search trees and hash tables. In Proceedings of the 45th ACM Technical Symposium on Computer Science Education (Atlanta, Georgia, USA) (SIGCSE ’14). Association for Computing Machinery, New York, NY, USA, 109–114. https://doi.org/10.1145/2538862.2538902
[54]
Theodoros A Kyriazos and Anastasios Stalikas. 2018. Applied psychometrics: The steps of scale development and standardization process. Psychology 09, 11 (2018), 2531–2560. https://doi.org/10.4236/psych.2018.911145
[55]
Hollis Lai, Mark J Gierl, Claire Touchie, Debra Pugh, André-Philippe Boulais, and André De Champlain. 2016. Using Automatic Item Generation to Improve the Quality of MCQ Distractors. Teach. Learn. Med. 28, 2 (2016), 166–173. https://doi.org/10.1080/10401334.2016.1146608
[56]
Juho Leinonen, Paul Denny, Stephen MacNeil, Sami Sarsa, Seth Bernstein, Joanne Kim, Andrew Tran, and Arto Hellas. 2023. Comparing Code Explanations Created by Students and Large Language Models. (April 2023). arxiv:2304.03938 [cs.CY]
[57]
Juho Leinonen, Arto Hellas, Sami Sarsa, Brent Reeves, Paul Denny, James Prather, and Brett A Becker. 2023. Using Large Language Models to Enhance Programming Error Messages. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 563–569. https://doi.org/10.1145/3545945.3569770
[58]
Colleen M Lewis, Huda Khayrallah, and Amy Tsai. 2013. Mining data from the AP CS a exam: patterns, non-patterns, and replication failure. In Proceedings of the ninth annual international ACM conference on International computing education research (San Diego, San California, USA) (ICER ’13). Association for Computing Machinery, New York, NY, USA, 115–122. https://doi.org/10.1145/2493394.2493415
[59]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[60]
Changmao Li and Jeffrey Flanigan. 2023. Task Contamination: Language Models May Not Be Few-Shot Anymore. (Dec. 2023). arxiv:2312.16337 [cs.CL]
[61]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Ré, Diana Acosta-Navas, Drew A Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2022. Holistic Evaluation of Language Models. (Nov. 2022). arxiv:2211.09110 [cs.CL]
[62]
Percy Liang, Yifan Mai, Josselin Somerville, Farzaan Kaiyom, Tony Lee, and Rishi Bommasani. 2023. HELM Lite: Lightweight and Broad Capabilities Evaluation. https://crfm.stanford.edu/2023/12/19/helm-lite.html. Accessed: 2024-3-2.
[63]
Julie Libarkin. 2008. Concept Inventories in Higher Education Science. In National Research Council Promising Practices in Undergraduate STEM Education Workshop, Vol. 13. 14.
[64]
Julie C Libarkin and Steven W Anderson. 2005. Assessment of learning in entry-level geoscience courses: Results from the Geoscience Concept Inventory. Journal of Geoscience Education 53, 4 (2005), 394–401.
[65]
Stephen MacNeil, Andrew Tran, Arto Hellas, Joanne Kim, Sami Sarsa, Paul Denny, Seth Bernstein, and Juho Leinonen. 2023. Experiences from Using Code Explanations Generated by Large Language Models in a Web Software Development E-Book. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1 (Toronto ON, Canada) (SIGCSE 2023). Association for Computing Machinery, New York, NY, USA, 931–937. https://doi.org/10.1145/3545945.3569785
[66]
Stephen MacNeil, Andrew Tran, Dan Mogil, Seth Bernstein, Erin Ross, and Ziheng Huang. 2022. Generating Diverse Code Explanations using the GPT-3 Large Language Model. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 2 (Lugano and Virtual Event, Switzerland) (ICER ’22, Vol. 2). Association for Computing Machinery, New York, NY, USA, 37–39. https://doi.org/10.1145/3501709.3544280
[67]
Michael Madaio, Su Lin Blodgett, Elijah Mayfield, and Ezekiel Dixon-Román. 2021. Beyond “fairness:” structural (in)justice lenses on AI for education. (May 2021). arxiv:2105.08847 [cs.CY]
[68]
Joyce Mahon, Brian Mac Namee, and Brett A. Becker. 2023. No More Pencils No More Books: Capabilities of Generative AI on Irish and UK Computer Science School Leaving Examinations. In Proceedings of the 2023 Conference on United Kingdom & Ireland Computing Education Research (Swansea, Wales Uk,) (UKICER ’23). Association for Computing Machinery, New York, NY, USA, Article 2, 7 pages. https://doi.org/10.1145/3610969.3610982
[69]
Wojciech Malec. 2024. Investigating the quality of AI-generated distractors for a multiple-choice vocabulary test. https://doi.org/10.5220/0012762400003693
[70]
Lauri Malmi, Judy Sheard, Päivi Kinnunen, Simon, and Jane Sinclair. 2019. Computing Education Theories: What Are They and How Are They Used?. In Proceedings of the 2019 ACM Conference on International Computing Education Research (Toronto ON, Canada) (ICER ’19). Association for Computing Machinery, New York, NY, USA, 187–197. https://doi.org/10.1145/3291279.3339409
[71]
Nestor Maslej, Loredana Fattorini, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Helen Ngo, Juan Carlos Niebles, Vanessa Parli, Yoav Shoham, Russell Wald, Jack Clark, and Raymond Perrault. 2023. The AI Index 2023 Annual Report. Technical Report. Stanford University.
[72]
J Nathan Matias, Sayamindu Dasgupta, and Benjamin Mako Hill. 2016. Skill Progression in Scratch Revisited. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems(CHI ’16). ACM, New York, NY, USA, 1486–1490. https://doi.org/10.1145/2858036.2858349
[73]
Daniel McCaffrey, Jodi Casabianca, Kathryn Ricker-Pedley, René Lawless, and Cathy Wendler. 2021. Best Practices for Constructed-Response Scoring. Technical Report. Educational Testing Services.
[74]
Monica M McGill, Tom McKlin, and Errol Kaylor. 2019. Defining What Empirically Works Best: Dynamic Generation of Meta-Analysis for Computer Science Education. In Proceedings of the 2019 ACM Conference on International Computing Education Research (Toronto ON, Canada) (ICER ’19). Association for Computing Machinery, New York, NY, USA, 199–207. https://doi.org/10.1145/3291279.3339401
[75]
Samuel Messick. 1993. Validity. In Educational Measurement. Third Edition. American Council on Education Series on Higher Education. Oryx Press, 4041 North Central at Indian School Road, Phoenix, AZ 85012-3397., 13–103.
[76]
Samuel Messick. 1995. Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. The American psychologist 50, 9 (Sept. 1995), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
[77]
Briana B Morrison, Adrienne Decker, and Lauren E Margulieux. 2016. Learning Loops: A Replication Study Illuminates Impact of HS Courses. In Proceedings of the 2016 ACM Conference on International Computing Education Research (Melbourne, VIC, Australia) (ICER ’16). Association for Computing Machinery, New York, NY, USA, 221–230. https://doi.org/10.1145/2960310.2960330
[78]
Douglas R Mulford and William R Robinson. 2002. An inventory for alternate conceptions among first-semester general chemistry students. Journal of chemical education 79, 6 (2002), 739.
[79]
Moin Nadeem, Anna Bethke, and Siva Reddy. 2020. StereoSet: Measuring stereotypical bias in pretrained language models. (April 2020). arxiv:2004.09456 [cs.CL]
[80]
Greg L Nelson and Amy J Ko. 2018. On Use of Theory in Computing Education Research. In Proceedings of the 2018 ACM Conference on International Computing Education Research. ACM.
[81]
Eng Lieh Ouh, Benjamin Kok Siew Gan, Kyong Jin Shim, and Swavek Wlodkowski. 2023. ChatGPT, Can You Generate Solutions for my Coding Exercises? An Evaluation on its Effectiveness in an undergraduate Java Programming Course. In Proceedings of the 2023 Conference on Innovation and Technology in Computer Science Education V. 1 (Turku, Finland) (ITiCSE 2023). Association for Computing Machinery, New York, NY, USA, 54–60. https://doi.org/10.1145/3587102.3588794
[82]
Miranda C Parker, Mark Guzdial, and Shelly Engleman. 2016. Replication, Validation, and Use of a Language Independent CS1 Knowledge Assessment. In Proceedings of the 2016 ACM Conference on International Computing Education Research(ICER ’16). ACM, New York, NY, USA, 93–101. https://doi.org/10.1145/2960310.2960316
[83]
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson, Phu Mon Htut, and Samuel Bowman. 2022. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, Ireland, 2086–2105. https://doi.org/10.18653/v1/2022.findings-acl.165
[84]
Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao, Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong. 2017. Composite Task-Completion Dialogue Policy Learning via Hierarchical Deep Reinforcement Learning. (2017). https://doi.org/10.18653/v1/d17-1237
[85]
Gustavo Pinto, Isadora Cardoso-Pereira, Danilo Monteiro, Danilo Lucena, Alberto Souza, and Kiev Gama. 2023. Large Language Models for Education: Grading Open-Ended Questions Using ChatGPT. In Proceedings of the XXXVII Brazilian Symposium on Software Engineering (Campo Grande, Brazil) (SBES ’23). Association for Computing Machinery, New York, NY, USA, 293–302. https://doi.org/10.1145/3613372.3614197
[86]
Leo Porter, Daniel Zingaro, Soohyun Nam Liao, Cynthia Taylor, Kevin C Webb, Cynthia Lee, and Michael Clancy. 2019. BDSI: A Validated Concept Inventory for Basic Data Structures. In Proceedings of the 2019 ACM Conference on International Computing Education Research (Toronto ON, Canada) (ICER ’19). Association for Computing Machinery, New York, NY, USA, 111–119. https://doi.org/10.1145/3291279.3339404
[87]
James Prather, Paul Denny, Juho Leinonen, Brett A Becker, Ibrahim Albluwi, Michelle Craig, Hieke Keuning, Natalie Kiesler, Tobias Kohn, Andrew Luxton-Reilly, Stephen MacNeil, Andrew Petersen, Raymond Pettit, Brent N Reeves, and Jaromir Savelka. 2023. The Robots Are Here: Navigating the Generative AI Revolution in Computing Education. In Proceedings of the 2023 Working Group Reports on Innovation and Technology in Computer Science Education (Turku, Finland) (ITiCSE-WGR ’23). Association for Computing Machinery, New York, NY, USA, 108–159. https://doi.org/10.1145/3623762.3633499
[88]
Arif Rachmatullah, Bita Akram, Danielle Boulden, Bradford Mott, Kristy Boyer, James Lester, and Eric Wiebe. 2020. Development and validation of the middle grades computer science concept inventory (MG-CSCI) assessment. EURASIA Journal of Mathematics, Science and Technology Education 16, 5 (2020), em1841.
[89]
Inioluwa Deborah Raji, Emily M Bender, Amandalynne Paullada, Emily Denton, and Alex Hanna. 2021. AI and the Everything in the Whole Wide World Benchmark. (Nov. 2021). arxiv:2111.15366 [cs.LG]
[90]
Sam Saarinen, Shriram Krishnamurthi, Kathi Fisler, and Preston Tunnell Wilson. 2019. Harnessing the Wisdom of the Classes: Classsourcing and Machine Learning for Assessment Instrument Generation. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education (Minneapolis, MN, USA) (SIGCSE ’19). Association for Computing Machinery, New York, NY, USA, 606–612. https://doi.org/10.1145/3287324.3287504
[91]
Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, and Eneko Agirre. 2023. NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark. (Oct. 2023). arxiv:2310.18018 [cs.CL]
[92]
Eddie Antonio Santos, Prajish Prasad, and Brett A. Becker. 2023. Always Provide Context: The Effects of Code Context on Programming Error Message Enhancement. In Proceedings of the ACM Conference on Global Computing Education Vol 1 (Hyderabad, India) (CompEd 2023). Association for Computing Machinery, New York, NY, USA, 147–153. https://doi.org/10.1145/3576882.3617909
[93]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22, Vol. 1). Association for Computing Machinery, New York, NY, USA, 27–43. https://doi.org/10.1145/3501385.3543957
[94]
Sami Sarsa, Paul Denny, Arto Hellas, and Juho Leinonen. 2022. Automatic Generation of Programming Exercises and Code Explanations Using Large Language Models. In Proceedings of the 2022 ACM Conference on International Computing Education Research - Volume 1 (Lugano and Virtual Event, Switzerland) (ICER ’22). Association for Computing Machinery, New York, NY, USA, 27–43. https://doi.org/10.1145/3501385.3543957
[95]
Andreas Säuberli and Simon Clematide. 2024. Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models. (April 2024). arxiv:2404.07720 [cs.CL]
[96]
Jaromir Savelka, Arav Agarwal, Marshall An, Chris Bogart, and Majd Sakr. 2023. Thrilled by Your Progress! Large Language Models (GPT-4) No Longer Struggle to Pass Assessments in Higher Education Programming Courses. In Proceedings of the 2023 ACM Conference on International Computing Education Research - Volume 1 (Chicago, IL, USA,) (ICER ’23). Association for Computing Machinery, New York, NY, USA, 78–92. https://doi.org/10.1145/3568813.3600142
[97]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, and Majd Sakr. 2023. Large Language Models (GPT) Struggle to Answer Multiple-Choice Questions about Code. (March 2023). arxiv:2303.08033 [cs.CL]
[98]
Jaromir Savelka, Arav Agarwal, Christopher Bogart, Yifan Song, and Majd Sakr. 2023. Can Generative Pre-trained Transformers (GPT) Pass Assessments in Higher Education Programming Courses? (March 2023). arxiv:2303.09325 [cs.AI]
[99]
Stanford Center for Research on Foundation Models. 2022. Ecosystem Graphs for Foundation Models. https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table. Accessed: 2024-3-12.
[100]
Andrew Taylor, Alexandra Vassar, Jake Renzella, and Hammond Pearce. 2024. dcc –help: Transforming the Role of the Compiler by Generating Context-Aware Error Explanations with Large Language Models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V. 1 (Portland, Oregon, USA) (SIGCSE 2024). Association for Computing Machinery, New York, NY, USA, 1314–1320. https://doi.org/10.1145/3626252.3630822
[101]
Cynthia Taylor, Daniel Zingaro, Leo Porter, Kevin C Webb, Cynthia Bailey Lee, and Mike Clancy. 2014. Computer science concept inventories: past and future. Computer Science Education 24, 4 (2014), 253–276.
[102]
Allison Elliott Tew and Mark Guzdial. 2011. The FCS1: a language independent assessment of CS1 knowledge. In Proceedings of the 42nd ACM Technical Symposium on Computer Science Education (Dallas, TX, USA) (SIGCSE ’11). Association for Computing Machinery, New York, NY, USA, 111–116. https://doi.org/10.1145/1953163.1953200
[103]
The National Science Foundation and The Institute of Education Sciences. 2018. Companion Guidelines on Replication & Reproducibility in Education Research. Technical Report. NSF and IES.
[104]
Jan Vahrenhold and Wolfgang Paul. 2014. Developing and validating test items for first-year computer science courses. Computer Science Education 24, 4 (2014), 304–333. https://doi.org/10.1080/08993408.2014.970782 arXiv:https://doi.org/10.1080/08993408.2014.970782
[105]
Yaqing Wang, Quanming Yao, James T. Kwok, and Lionel M. Ni. 2020. Generalizing from a Few Examples: A Survey on Few-shot Learning. ACM Comput. Surv. 53, 3, Article 63 (jun 2020), 34 pages. https://doi.org/10.1145/3386252
[106]
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arxiv:2302.11382 [cs.SE]
[107]
R. Paul Wiegand, Anthony Bucci, Amruth N. Kumar, Jennifer L. Albert, and Alessio Gaspar. 2016. A Data-Driven Analysis of Informatively Hard Concepts in Introductory Programming. In Proceedings of the 47th ACM Technical Symposium on Computing Science Education (Memphis, Tennessee, USA) (SIGCSE ’16). Association for Computing Machinery, New York, NY, USA, 370–375. https://doi.org/10.1145/2839509.2844629
[108]
Ben Williamson. 2024. AI in education is a public problem. https://codeactsineducation.wordpress.com/2024/02/22/ai-in-education-is-a-public-problem/. Accessed: 2024-5-30.
[109]
Tianyu Wu, Shizhu He, Jingping Liu, Siqi Sun, Kang Liu, Qing-Long Han, and Yang Tang. 2023. A Brief Overview of ChatGPT: The History, Status Quo and Potential Future Development. IEEE/CAA Journal of Automatica Sinica 10, 5 (2023), 1122–1136. https://doi.org/10.1109/JAS.2023.123618
[110]
Changrong Xiao, Sean Xin Xu, Kunpeng Zhang, Yufang Wang, and Lei Xia. 2023. Evaluating Reading Comprehension Exercises Generated by LLMs: A Showcase of ChatGPT in Education Applications. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), Ekaterina Kochmar, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Nitin Madnani, Anaïs Tack, Victoria Yaneva, Zheng Yuan, and Torsten Zesch (Eds.). Association for Computational Linguistics, Toronto, Canada, 610–625. https://doi.org/10.18653/v1/2023.bea-1.52
[111]
Benjamin Xie. 2019. Supplementary Info for ”An Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment” (Xie et al. SIGCSE 2019). https://github.com/codeandcognition/archive-2019sigcse-xie. Accessed: 2024-1-15.
[112]
Benjamin Xie, Matthew J Davidson, Min Li, and Amy J Ko. 2019. An Item Response Theory Evaluation of a Language-Independent CS1 Knowledge Assessment. In Proceedings of the 50th ACM Technical Symposium on Computer Science Education (Minneapolis, MN, USA) (SIGCSE ’19). ACM, 699–705. https://doi.org/10.1145/3287324.3287370
[113]
Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. 2024. A survey on Large Language Model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confidence Computing (March 2024), 100211. https://doi.org/10.1016/j.hcc.2024.100211
[114]
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. (2023). arxiv:2303.18223 [cs.CL]
[115]
Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and Jiawei Han. 2023. Don’t Make Your LLM an Evaluation Benchmark Cheater. (Nov. 2023). arxiv:2311.01964 [cs.CL]

Cited By

View all
  • (2025)Priv-IQ: A Benchmark and Comparative Evaluation of Large Multimodal Models on Privacy CompetenciesAI10.3390/ai60200296:2(29)Online publication date: 6-Feb-2025

Index Terms

  1. Using Benchmarking Infrastructure to Evaluate LLM Performance on CS Concept Inventories: Challenges, Opportunities, and Critiques

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ICER '24: Proceedings of the 2024 ACM Conference on International Computing Education Research - Volume 1
      August 2024
      539 pages
      ISBN:9798400704758
      DOI:10.1145/3632620
      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 August 2024

      Check for updates

      Author Tags

      1. benchmarking
      2. computing education
      3. concept inventories
      4. large language models
      5. psychometrics

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • Stanford Institute for Human-Centered Artificial Intelligence
      • OpenAI
      • Stanford Accelerator for Learning
      • McCoy Family Center for Ethics in Society
      • Center for Research on Foundation Models

      Conference

      ICER 2024
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 189 of 803 submissions, 24%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)676
      • Downloads (Last 6 weeks)165
      Reflects downloads up to 13 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Priv-IQ: A Benchmark and Comparative Evaluation of Large Multimodal Models on Privacy CompetenciesAI10.3390/ai60200296:2(29)Online publication date: 6-Feb-2025

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media