research-article

ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network

Authors:

Qi XiongAuthors Info & Claims

ACSAC '18: Proceedings of the 34th Annual Computer Security Applications Conference

Pages 542 - 552

https://doi.org/10.1145/3274694.3274742

Published: 03 December 2018 Publication History

Abstract

As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, in this paper, we bring an important new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose a novel network embedding model named snippet2vec for representation learning in HIN where both the HIN structures and semantics are maximally preserved. After that, a multi-view fusion classifier is constructed for insecure code snippet detection. To the best of our knowledge, this is the first work utilizing both code content and social coding properties to address the code security issues in modern software coding platforms. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of the developed system ICSD which integrates our proposed method in insecure code snippet detection by comparisons with alternative approaches.

References

[1]

Yasemin Acar, Michael Backes, Sascha Fahl, Doowon Kim, Michelle L. Mazurek, and Christian Stransky. 2016. You Get Where You're Looking For The Impact of Information Sources on Code Security. In IEEE Symposium on Security and Privacy (SP). 289--305.

[2]

Tanveer Ahmed and Abhishek Srivastava. 2017. Understanding and evaluating the behavior of technical users. A study of developer interaction at StackOverflow. Hum. Cent. Comput. Inf. Sci. 7, 8 (2017).

Digital Library

[3]

AttackFlow. 2017. Watch Out For Insecure StackOverflow Answers. In https://www.attackflow.com/Blog/StackOverflow.

[4]

Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, EC2 (1991).

[5]

Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2018. How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow. Information and Software Technology 94 (2018), 186--207.

[6]

Andrea Capiluppi, Alexander Serebrenik, and Leif Singer. 2013. Assessing technical candidates on the social web. In IEEE Software. 45--51.

Digital Library

[7]

Huseyin Cavusoglu, Zhuolun Li, and Ke-Wei Huang. 2015. Can Gamification Motivate Voluntary Contributions? The Case of StackOverflow Q&A Community. In Proceedings of the 18th ACM conference companion on computer supported cooperative work & social computing. 171--174.

Digital Library

[8]

Lingwei Chen, Shifu Hou, and Yanfang Ye. 2017. SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks. In Proceedings of the 33rd Annual Computer Security Applications Conference (ACSAC). 362--372.

Digital Library

[9]

Lingwei Chen, Shifu Hou, Yanfang Ye, and Shouhuai Xu. 2018. DroidEye: Fortifying Security of Learning-based Classifier against Adversarial Android Malware Attacks. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[10]

John Coogle, Jeet Gajjar, and Chase Greco. 2017. StackInTheFlow: StackOverflow Search Engine. In VCU Capstone Design Expo Posters.

[11]

Daniel Czyczyn-Egird and Rafal Wojszczyk. 2016. Determining the Popularity of Design Patterns Used by Programmers Based on the Analysis of Questions and Answers on Stackoverflow.com Social Network. In Communications in Computer and Information Science (CCIS). 421--433.

[12]

S. Deterding. 2012. Gamification: designing for motivation. Interactions 19, 4 (2012), 14--17.

Digital Library

[13]

Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). 135--144.

Digital Library

[14]

Yujie Fan, Shifu Hou, Yiming Zhang, Yanfang Ye, and Melih Abdulhayoglu. 2018. Gotcha-Sly Malware! Scorpion: A Metagraph2vec Based Malware Detection System. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD). ACM, 253--262.

Digital Library

[15]

Yujie Fan, Yiming Zhang, Yanfang Ye, and Xin Li. 2018. Automatic Opioid User Detection from Twitter: Transductive Ensemble Built on Different Meta-graph Based Similarities over Heterogeneous Information Network. In IJCAI. 3357--3363.

[16]

Yujie Fan, Yiming Zhang, Yanfang Ye, Xin Li, and Wanhong Zheng. 2017. Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies. In CIKM. ACM, 1259--1267.

Digital Library

[17]

Felix Fischer, Konstantin Bottinger, Huang Xiao, Christian Stransky, Yasemin Acar, Michael Backes, and Sascha Fahl. 2017. Stack Overflow Considered Harmful? The Impact of Copy and Paste on Android Application Security. In IEEE Symposium on Security and Privacy (SP). 121--136.

[18]

Tao-Yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM). 1797--1806.

Digital Library

[19]

Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855--864.

Digital Library

[20]

Peter D Hoff, Adrian E Raftery, and Mark S Handcock. 2002. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97, 460 (2002), 1090--1098.

[21]

Shifu Hou, Aaron Saas, Lifei Chen, and Yanfang Ye. 2016. Deep4MalDroid: A Deep Learning Framework for Android Malware Detection Based on Linux Kernel System Call Graphs. In WIW '16.

[22]

Shifu Hou, Aaron Saas, Yanfang Ye, and Lifei Chen. 2016. DroidDelver: An Android Malware Detection System Using Deep Belief Network Based on API Call Blocks. In International Conference on Web-Age Information Management (WAIM). 54--66.

[23]

Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). ACM, 1507--1515.

Digital Library

[24]

IDC. 2018. International Data Corporation (IDC). In http://www.idc.com.

[25]

Ilse CF Ipsen and Carl D Meyer. 1995. The angle between complementary subspaces. Amer. Math. Monthly (1995), 904--911.

[26]

David Kavaler, Sasha Sirovica, Vincent Hellendoorn, Raul Aranovich, and Vladimir Filkov. 2017. Perceived Language Complexity in GitHub Issue Discussions and Their Effect on Issue Resolution. In ASE. 72--83.

Digital Library

[27]

Roy Ka-Wei Lee and David Lo. 2017. GitHub and Stack Overflow: Analyzing developer interests across multiple social collaborative platforms. In International Conference on Social Informatics. Springer, 245--256.

[28]

Mario Linares-Vasquez, Gabriele Bavota, Massimiliano Di Penta, and Rocco Oliveto. 2014. How Do API Changes Trigger Stack Overflow Discussions? A Study on the Android SDK. In ICPC. 83--94.

Digital Library

[29]

Xuliang Liu and Hao Zhong. 2018. Mining Stack Overflow for Program Repair. In IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 118--129.

[30]

Lucintel. 2017. Growth Opportunities in the Global Software Market. In http://www.lucintel.com/software-market-2017.aspx.

[31]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In arXiv preprint arXiv:1301.3781.

[32]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.

Digital Library

[33]

StackOverflow. 2018. Stack Overflow. In https://stackoverflow.com/.

[34]

Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). 701--710.

Digital Library

[35]

Sebastian Poeplau, Yanick Fratantonio, Antonio Bianchi, Christopher Kruegel, and Giovanni Vigna. 2014. Execute This! Analyzing Unsafe and Malicious Dynamic Code Loading in Android Applications. In NDSS. 23--26.

[36]

Jingbo Shang, Meng Qu, Jialu Liu, Lance M. Kaplan, Jiawei Han, and Jian Peng. 2016. Meta-Path Guided Embedding for Similarity Search in Large-Scale Heterogeneous Information Networks. In arXiv:1610.09769.

[37]

Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2017. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2017), 17--37.

Digital Library

[38]

StackExchange. 2018. StackExchange Statistics. In https://stackexchange.com/sites#traffic.

[39]

Yizhou Sun, Rick Barber, Manish Gupta, Charu C Aggarwal, and Jiawei Han. 2011. Co-author relationship prediction in heterogeneous bibliographic networks. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 121--128.

Digital Library

[40]

Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery (SLDMKD) 3, 2 (2012), 1--159.

Digital Library

[41]

Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment (PVLDB) 4, 11 (2011), 992--1003.

Digital Library

[42]

Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1165--1174.

Digital Library

[43]

Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW'15 Proceedings of the 24th International Conference on World Wide Web. 1067--1077.

Digital Library

[44]

Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web?: Nier track. In 33rd International Conference on Software Engineering (ICSE). 804--807.

Digital Library

[45]

Bogdan Vasilescu, Vladimir Filkov, and Alexander Serebrenik. 2013. StackOverflow and GitHub: Associations Between Software Development and Crowd-sourced Knowledge. In International Conference on Social Computing(SocialCom). 188--195.

Digital Library

[46]

Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. 2007. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence (TPAMI) 29, 1 (2007), 40--51.

Digital Library

[47]

Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar. 2017. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR) 50, 3 (2017), 41.

Digital Library

[48]

Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). 635--644.

Digital Library

[49]

Guido Zuccon, Leif A Azzopardi, and CJ Van Rijsbergen. 2009. Semantic spaces: Measuring the distance between different subspaces. In International Symposium on Quantum Interaction. Springer, 225--236.

Digital Library

Cited By

Anwar ZAfzal HKadry SCheng X(2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
https://doi.org/10.4018/IJSWIS.358617
Khandelwal VKhandelwal S(2023)Identification and Mitigation of Unintentional Insider Information Leak Threats in Public Repositories2023 IEEE Pune Section International Conference (PuneCon)10.1109/PuneCon58714.2023.10450117(1-7)Online publication date: 14-Dec-2023
https://doi.org/10.1109/PuneCon58714.2023.10450117
Ndukwe ILicorish STahir AMacDonell S(2023)How have views on Software Quality differed over time? Research and practice viewpointsJournal of Systems and Software10.1016/j.jss.2022.111524195:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jss.2022.111524
Show More Cited By

Index Terms

ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
2. Security and privacy
  1. Software and application security
    1. Software security engineering

Recommendations

AI Writes, We Analyze: The ChatGPT Python Code Saga
MSR '24: Proceedings of the 21st International Conference on Mining Software Repositories

In this study, we quantitatively analyze 1,756 AI-written Python code snippets in the DevGPT dataset and evaluate them for quality and security issues. We systematically distinguish the code snippets as either generated by ChatGPT from scratch (ChatGPT-...
Continuous Code Reviews: A Social Coding tool for Code Reviews inside the IDE
Programming '17: Companion Proceedings of the 1st International Conference on the Art, Science, and Engineering of Programming

Code reviews play an important and successful role in modern software development. But usually they happen only once before new code is merged into the main branch. We present a concept which helps developers to continuously give feedback on their ...
Efficient code diversification for network reprogramming in sensor networks
WiSec '10: Proceedings of the third ACM conference on Wireless network security

As sensors in a network are mostly homogeneous in software and hardware, a captured sensor can easily expose its code and data to attackers and further threaten the whole network. To increase the survivability of a sensor network, code diversification ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

ACSAC '18: Proceedings of the 34th Annual Computer Security Applications Conference

December 2018

766 pages

ISBN:9781450365697

DOI:10.1145/3274694

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

ACSA: Applied Computing Security Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

U.S. National Science Foundation
WV HEPC

Conference

ACSAC '18

ACSAC '18: 2018 Annual Computer Security Applications Conference

December 3 - 7, 2018

PR, San Juan, USA

Acceptance Rates

Overall Acceptance Rate 104 of 497 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
352
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)4

Reflects downloads up to 27 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Anwar ZAfzal HKadry SCheng X(2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
https://doi.org/10.4018/IJSWIS.358617
Khandelwal VKhandelwal S(2023)Identification and Mitigation of Unintentional Insider Information Leak Threats in Public Repositories2023 IEEE Pune Section International Conference (PuneCon)10.1109/PuneCon58714.2023.10450117(1-7)Online publication date: 14-Dec-2023
https://doi.org/10.1109/PuneCon58714.2023.10450117
Ndukwe ILicorish STahir AMacDonell S(2023)How have views on Software Quality differed over time? Research and practice viewpointsJournal of Systems and Software10.1016/j.jss.2022.111524195:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jss.2022.111524
Hou SChen LJu MYe Y(2023)Leveraging Comment Retrieval for Code SummarizationAdvances in Information Retrieval10.1007/978-3-031-28238-6_34(439-447)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28238-6_34
Qian YZhang CZhang YWen QYe YZhang CKoyejo SMohamed SAgarwal ABelgrave DCho KOh A(2022)Co-modality graph contrastive learning for imbalanced node classificationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601424(15862-15874)Online publication date: 28-Nov-2022
https://dl.acm.org/doi/10.5555/3600270.3601424
Das DMathews NChimalakonda S(2022)Exploring Security Vulnerabilities in Competitive Programming: An Empirical StudyProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering10.1145/3530019.3530031(110-119)Online publication date: 13-Jun-2022
https://dl.acm.org/doi/10.1145/3530019.3530031
Hong HWoo SLee H(2021)Dicos: Discovering Insecure Code Snippets from Stack Overflow Posts by Leveraging User DiscussionsProceedings of the 37th Annual Computer Security Applications Conference10.1145/3485832.3488026(194-206)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3485832.3488026
Hazhirpasand MGhafari M(2021)Worrisome Patterns in Developers: A Survey in Cryptography2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)10.1109/ASEW52652.2021.00045(185-190)Online publication date: Nov-2021
https://doi.org/10.1109/ASEW52652.2021.00045
Ye YFan YHou SZhang YQian YSun SPeng QJu MSong WLoparo Kd'Aquin MDietze SHauff CCurry ECudre Mauroux P(2020)Community MitigationProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412753(2909-2916)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3340531.3412753
Xiao GDu XSui YYue T(2020)HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE5003.2020.00027(195-206)Online publication date: Oct-2020
https://doi.org/10.1109/ISSRE5003.2020.00027
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents