Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3274694.3274742acmotherconferencesArticle/Chapter ViewAbstractPublication PagesacsacConference Proceedingsconference-collections
research-article

ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network

Published: 03 December 2018 Publication History

Abstract

As the popularity of modern social coding paradigm such as Stack Overflow grows, its potential security risks increase as well (e.g., insecure codes could be easily embedded and distributed). To address this largely overlooked issue, in this paper, we bring an important new insight to exploit social coding properties in addition to code content for automatic detection of insecure code snippets in Stack Overflow. To determine if the given code snippets are insecure, we not only analyze the code content, but also utilize various kinds of relations among users, badges, questions, answers, code snippets and keywords in Stack Overflow. To model the rich semantic relationships, we first introduce a structured heterogeneous information network (HIN) for representation and then use meta-path based approach to incorporate higher-level semantics to build up relatedness over code snippets. Later, we propose a novel network embedding model named snippet2vec for representation learning in HIN where both the HIN structures and semantics are maximally preserved. After that, a multi-view fusion classifier is constructed for insecure code snippet detection. To the best of our knowledge, this is the first work utilizing both code content and social coding properties to address the code security issues in modern software coding platforms. Comprehensive experiments on the data collections from Stack Overflow are conducted to validate the effectiveness of the developed system ICSD which integrates our proposed method in insecure code snippet detection by comparisons with alternative approaches.

References

[1]
Yasemin Acar, Michael Backes, Sascha Fahl, Doowon Kim, Michelle L. Mazurek, and Christian Stransky. 2016. You Get Where You're Looking For The Impact of Information Sources on Code Security. In IEEE Symposium on Security and Privacy (SP). 289--305.
[2]
Tanveer Ahmed and Abhishek Srivastava. 2017. Understanding and evaluating the behavior of technical users. A study of developer interaction at StackOverflow. Hum. Cent. Comput. Inf. Sci. 7, 8 (2017).
[3]
AttackFlow. 2017. Watch Out For Insecure StackOverflow Answers. In https://www.attackflow.com/Blog/StackOverflow.
[4]
Léon Bottou. 1991. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91, EC2 (1991).
[5]
Fabio Calefato, Filippo Lanubile, and Nicole Novielli. 2018. How to ask for technical help? Evidence-based guidelines for writing questions on Stack Overflow. Information and Software Technology 94 (2018), 186--207.
[6]
Andrea Capiluppi, Alexander Serebrenik, and Leif Singer. 2013. Assessing technical candidates on the social web. In IEEE Software. 45--51.
[7]
Huseyin Cavusoglu, Zhuolun Li, and Ke-Wei Huang. 2015. Can Gamification Motivate Voluntary Contributions? The Case of StackOverflow Q&A Community. In Proceedings of the 18th ACM conference companion on computer supported cooperative work & social computing. 171--174.
[8]
Lingwei Chen, Shifu Hou, and Yanfang Ye. 2017. SecureDroid: Enhancing Security of Machine Learning-based Detection against Adversarial Android Malware Attacks. In Proceedings of the 33rd Annual Computer Security Applications Conference (ACSAC). 362--372.
[9]
Lingwei Chen, Shifu Hou, Yanfang Ye, and Shouhuai Xu. 2018. DroidEye: Fortifying Security of Learning-based Classifier against Adversarial Android Malware Attacks. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).
[10]
John Coogle, Jeet Gajjar, and Chase Greco. 2017. StackInTheFlow: StackOverflow Search Engine. In VCU Capstone Design Expo Posters.
[11]
Daniel Czyczyn-Egird and Rafal Wojszczyk. 2016. Determining the Popularity of Design Patterns Used by Programmers Based on the Analysis of Questions and Answers on Stackoverflow.com Social Network. In Communications in Computer and Information Science (CCIS). 421--433.
[12]
S. Deterding. 2012. Gamification: designing for motivation. Interactions 19, 4 (2012), 14--17.
[13]
Yuxiao Dong, Nitesh V Chawla, and Ananthram Swami. 2017. metapath2vec: Scalable representation learning for heterogeneous networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). 135--144.
[14]
Yujie Fan, Shifu Hou, Yiming Zhang, Yanfang Ye, and Melih Abdulhayoglu. 2018. Gotcha-Sly Malware! Scorpion: A Metagraph2vec Based Malware Detection System. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD). ACM, 253--262.
[15]
Yujie Fan, Yiming Zhang, Yanfang Ye, and Xin Li. 2018. Automatic Opioid User Detection from Twitter: Transductive Ensemble Built on Different Meta-graph Based Similarities over Heterogeneous Information Network. In IJCAI. 3357--3363.
[16]
Yujie Fan, Yiming Zhang, Yanfang Ye, Xin Li, and Wanhong Zheng. 2017. Social Media for Opioid Addiction Epidemiology: Automatic Detection of Opioid Addicts from Twitter and Case Studies. In CIKM. ACM, 1259--1267.
[17]
Felix Fischer, Konstantin Bottinger, Huang Xiao, Christian Stransky, Yasemin Acar, Michael Backes, and Sascha Fahl. 2017. Stack Overflow Considered Harmful? The Impact of Copy and Paste on Android Application Security. In IEEE Symposium on Security and Privacy (SP). 121--136.
[18]
Tao-Yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. HIN2Vec: Explore Meta-paths in Heterogeneous Information Networks for Representation Learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (CIKM). 1797--1806.
[19]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 855--864.
[20]
Peter D Hoff, Adrian E Raftery, and Mark S Handcock. 2002. Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97, 460 (2002), 1090--1098.
[21]
Shifu Hou, Aaron Saas, Lifei Chen, and Yanfang Ye. 2016. Deep4MalDroid: A Deep Learning Framework for Android Malware Detection Based on Linux Kernel System Call Graphs. In WIW '16.
[22]
Shifu Hou, Aaron Saas, Yanfang Ye, and Lifei Chen. 2016. DroidDelver: An Android Malware Detection System Using Deep Belief Network Based on API Call Blocks. In International Conference on Web-Age Information Management (WAIM). 54--66.
[23]
Shifu Hou, Yanfang Ye, Yangqiu Song, and Melih Abdulhayoglu. 2017. Hindroid: An intelligent android malware detection system based on structured heterogeneous information network. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). ACM, 1507--1515.
[24]
IDC. 2018. International Data Corporation (IDC). In http://www.idc.com.
[25]
Ilse CF Ipsen and Carl D Meyer. 1995. The angle between complementary subspaces. Amer. Math. Monthly (1995), 904--911.
[26]
David Kavaler, Sasha Sirovica, Vincent Hellendoorn, Raul Aranovich, and Vladimir Filkov. 2017. Perceived Language Complexity in GitHub Issue Discussions and Their Effect on Issue Resolution. In ASE. 72--83.
[27]
Roy Ka-Wei Lee and David Lo. 2017. GitHub and Stack Overflow: Analyzing developer interests across multiple social collaborative platforms. In International Conference on Social Informatics. Springer, 245--256.
[28]
Mario Linares-Vasquez, Gabriele Bavota, Massimiliano Di Penta, and Rocco Oliveto. 2014. How Do API Changes Trigger Stack Overflow Discussions? A Study on the Android SDK. In ICPC. 83--94.
[29]
Xuliang Liu and Hao Zhong. 2018. Mining Stack Overflow for Program Repair. In IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). 118--129.
[30]
Lucintel. 2017. Growth Opportunities in the Global Software Market. In http://www.lucintel.com/software-market-2017.aspx.
[31]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. In arXiv preprint arXiv:1301.3781.
[32]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. 3111--3119.
[33]
StackOverflow. 2018. Stack Overflow. In https://stackoverflow.com/.
[34]
Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD). 701--710.
[35]
Sebastian Poeplau, Yanick Fratantonio, Antonio Bianchi, Christopher Kruegel, and Giovanni Vigna. 2014. Execute This! Analyzing Unsafe and Malicious Dynamic Code Loading in Android Applications. In NDSS. 23--26.
[36]
Jingbo Shang, Meng Qu, Jialu Liu, Lance M. Kaplan, Jiawei Han, and Jian Peng. 2016. Meta-Path Guided Embedding for Similarity Search in Large-Scale Heterogeneous Information Networks. In arXiv:1610.09769.
[37]
Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. 2017. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering 29, 1 (2017), 17--37.
[38]
StackExchange. 2018. StackExchange Statistics. In https://stackexchange.com/sites#traffic.
[39]
Yizhou Sun, Rick Barber, Manish Gupta, Charu C Aggarwal, and Jiawei Han. 2011. Co-author relationship prediction in heterogeneous bibliographic networks. In International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 121--128.
[40]
Yizhou Sun and Jiawei Han. 2012. Mining heterogeneous information networks: principles and methodologies. Synthesis Lectures on Data Mining and Knowledge Discovery (SLDMKD) 3, 2 (2012), 1--159.
[41]
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. 2011. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment (PVLDB) 4, 11 (2011), 992--1003.
[42]
Jian Tang, Meng Qu, and Qiaozhu Mei. 2015. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 1165--1174.
[43]
Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW'15 Proceedings of the 24th International Conference on World Wide Web. 1067--1077.
[44]
Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web?: Nier track. In 33rd International Conference on Software Engineering (ICSE). 804--807.
[45]
Bogdan Vasilescu, Vladimir Filkov, and Alexander Serebrenik. 2013. StackOverflow and GitHub: Associations Between Software Development and Crowd-sourced Knowledge. In International Conference on Social Computing(SocialCom). 188--195.
[46]
Shuicheng Yan, Dong Xu, Benyu Zhang, Hong-Jiang Zhang, Qiang Yang, and Stephen Lin. 2007. Graph embedding and extensions: A general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence (TPAMI) 29, 1 (2007), 40--51.
[47]
Yanfang Ye, Tao Li, Donald Adjeroh, and S Sitharama Iyengar. 2017. A survey on malware detection using data mining techniques. ACM Computing Surveys (CSUR) 50, 3 (2017), 41.
[48]
Huan Zhao, Quanming Yao, Jianda Li, Yangqiu Song, and Dik Lun Lee. 2017. Metagraph based recommendation fusion over heterogeneous information networks. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'2017). 635--644.
[49]
Guido Zuccon, Leif A Azzopardi, and CJ Van Rijsbergen. 2009. Semantic spaces: Measuring the distance between different subspaces. In International Symposium on Quantum Interaction. Springer, 225--236.

Cited By

View all
  • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
  • (2023)Identification and Mitigation of Unintentional Insider Information Leak Threats in Public Repositories2023 IEEE Pune Section International Conference (PuneCon)10.1109/PuneCon58714.2023.10450117(1-7)Online publication date: 14-Dec-2023
  • (2023)How have views on Software Quality differed over time? Research and practice viewpointsJournal of Systems and Software10.1016/j.jss.2022.111524195:COnline publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. ICSD: An Automatic System for Insecure Code Snippet Detection in Stack Overflow over Heterogeneous Information Network

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ACSAC '18: Proceedings of the 34th Annual Computer Security Applications Conference
      December 2018
      766 pages
      ISBN:9781450365697
      DOI:10.1145/3274694
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • ACSA: Applied Computing Security Assoc

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 December 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Code Security
      2. Heterogeneous Information Network
      3. Multi-view Fusion
      4. Network Representation Learning
      5. Social Coding

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      • U.S. National Science Foundation
      • WV HEPC

      Conference

      ACSAC '18

      Acceptance Rates

      Overall Acceptance Rate 104 of 497 submissions, 21%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)22
      • Downloads (Last 6 weeks)4
      Reflects downloads up to 27 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Semantic Web Approaches in Stack OverflowInternational Journal on Semantic Web and Information Systems10.4018/IJSWIS.35861720:1(1-61)Online publication date: 9-Nov-2024
      • (2023)Identification and Mitigation of Unintentional Insider Information Leak Threats in Public Repositories2023 IEEE Pune Section International Conference (PuneCon)10.1109/PuneCon58714.2023.10450117(1-7)Online publication date: 14-Dec-2023
      • (2023)How have views on Software Quality differed over time? Research and practice viewpointsJournal of Systems and Software10.1016/j.jss.2022.111524195:COnline publication date: 1-Jan-2023
      • (2023)Leveraging Comment Retrieval for Code SummarizationAdvances in Information Retrieval10.1007/978-3-031-28238-6_34(439-447)Online publication date: 17-Mar-2023
      • (2022)Co-modality graph contrastive learning for imbalanced node classificationProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601424(15862-15874)Online publication date: 28-Nov-2022
      • (2022)Exploring Security Vulnerabilities in Competitive Programming: An Empirical StudyProceedings of the 26th International Conference on Evaluation and Assessment in Software Engineering10.1145/3530019.3530031(110-119)Online publication date: 13-Jun-2022
      • (2021)Dicos: Discovering Insecure Code Snippets from Stack Overflow Posts by Leveraging User DiscussionsProceedings of the 37th Annual Computer Security Applications Conference10.1145/3485832.3488026(194-206)Online publication date: 6-Dec-2021
      • (2021)Worrisome Patterns in Developers: A Survey in Cryptography2021 36th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW)10.1109/ASEW52652.2021.00045(185-190)Online publication date: Nov-2021
      • (2020)Community MitigationProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412753(2909-2916)Online publication date: 19-Oct-2020
      • (2020)HINDBR: Heterogeneous Information Network Based Duplicate Bug Report Prediction2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE5003.2020.00027(195-206)Online publication date: Oct-2020
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media