Computer Science > Software Engineering

arXiv:2401.06461 (cs)

[Submitted on 12 Jan 2024 (v1), last revised 30 Jul 2024 (this version, v5)]

Title:Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Authors:Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu

Abstract:Large language models have catalyzed an unprecedented wave in code generation. While achieving significant advances, they blur the distinctions between machine- and human-authored source code, causing integrity and authenticity issues of software artifacts. Previous methods such as DetectGPT have proven effective in discerning machine-generated texts, but they do not identify and harness the unique patterns of machine-generated code. Thus, its applicability falters when applied to code. In this paper, we carefully study the specific patterns that characterize machine- and human-authored code. Through a rigorous analysis of code attributes such as lexical diversity, conciseness, and naturalness, we expose unique patterns inherent to each source. We particularly notice that the syntactic segmentation of code is a critical factor in identifying its provenance. Based on our findings, we propose DetectCodeGPT, a novel method for detecting machine-generated code, which improves DetectGPT by capturing the distinct stylized patterns of code. Diverging from conventional techniques that depend on external LLMs for perturbations, DetectCodeGPT perturbs the code corpus by strategically inserting spaces and newlines, ensuring both efficacy and efficiency. Experiment results show that our approach significantly outperforms state-of-the-art techniques in detecting machine-generated code.

Comments:	Accepted by the 47th International Conference on Software Engineering (ICSE 2025). Code available at this https URL
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2401.06461 [cs.SE]
	(or arXiv:2401.06461v5 [cs.SE] for this version)
	https://doi.org/10.48550/arXiv.2401.06461

Submission history

From: Yuling Shi [view email]
[v1] Fri, 12 Jan 2024 09:15:20 UTC (14,414 KB)
[v2] Wed, 24 Jan 2024 14:57:42 UTC (14,545 KB)
[v3] Sun, 24 Mar 2024 01:20:49 UTC (2,719 KB)
[v4] Mon, 8 Jul 2024 08:45:55 UTC (4,745 KB)
[v5] Tue, 30 Jul 2024 09:26:04 UTC (4,759 KB)

Computer Science > Software Engineering

Title:Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Software Engineering

Title:Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators