Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3372297.3423360acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

RTFM! Automatic Assumption Discovery and Verification Derivation from Library Document for API Misuse Detection

Published: 02 November 2020 Publication History

Abstract

To use library APIs, a developer is supposed to follow guidance and respect some constraints, which we call integration assumptions (IAs). Violations of these assumptions can have serious consequences, introducing security-critical flaws such as use-after-free, NULL-dereference, and authentication errors. Analyzing a program for compliance with IAs involves significant effort and needs to be automated. A promising direction is to automatically recover IAs from a library document using Natural Language Processing (NLP) and then verify their consistency with the ways APIs are used in a program through code analysis. However, a practical solution along this line needs to overcome several key challenges, particularly the discovery of IAs from loosely formatted documents and interpretation of their informal descriptions to identify complicated constraints (e.g., data-/control-flow relations between different APIs).
In this paper, we present a new technique for automated assumption discovery and verification derivation from library documents. Our approach, called Advance, utilizes a suite of innovations to address those challenges. More specifically, we leverage the observation that IAs tend to express a strong sentiment in emphasizing the importance of a constraint, particularly those security-critical, and utilize a new sentiment analysis model to accurately recover them from loosely formatted documents. These IAs are further processed to identify hidden references to APIs and parameters, through an embedding model, to identify the information-flow relations expected to be followed. Then our approach runs frequent subtree mining to discover the grammatical units in IA sentences that tend to indicate some categories of constraints that could have security implications. These components are mapped to verification code snippets organized in line with the IA sentence's grammatical structure, and can be assembled into verification code executed through CodeQL to discover misuses inside a program. We implemented this design and evaluated it on 5 popular libraries (OpenSSL, SQLite, libpcap, libdbus and libxml2) and 39 real-world applications. Our analysis discovered 193 API misuses, including 139 flaws never reported before.

Supplementary Material

MOV File (Copy of CCS2020_fpe450_TaoLv - Ami Eckard-Lee.mov)
Presentation video

References

[1]
2016. stanford Parser. https://nlp.stanford.edu/software/dependencies_manual.pdf. (2016).
[2]
2020. AFL fuzzer. https://lcamtuf.coredump.cx/afl/. (2020).
[3]
2020. Atril for MATE. https://mate-desktop.org/. (2020).
[4]
2020. CodeQL. https://securitylab.github.com/tools/codeql. (2020).
[5]
2020. confirmed bug. https://gitlab.gnome.org/GNOME/anjuta/-/issues/12.(2020).
[6]
2020. confirmed bug. https://gitlab.kitware.com/vtk/vtk/issues/17818. (2020).
[7]
2020. confirmed bug. https://bz.apache.org/bugzilla/show_bug.cgi?id=64264.(2020).
[8]
2020. confirmed bug. https://github.com/hughsie/colord/issues/110. (2020).
[9]
2020. confirmed bug. https://github.com/darktable-org/darktable/issues/6051.(2020).
[10]
2020. confirmed bug. https://github.com/mate-desktop/atril/issues/485. (2020).
[11]
2020. confirmed bug. https://gitlab.gnome.org/GNOME/at-spi2-core/-/issues/24.(2020).
[12]
2020. CVE-2015--8867. https://nvd.nist.gov/vuln/detail/CVE-2015--8867. (2020).
[13]
2020. github. https://github.com/. (2020).
[14]
2020. gitlab. https://about.gitlab.com/. (2020).
[15]
2020. Google translation. https://translate.google.cn. (2020).
[16]
2020. lxml. https://lxml.de/. (2020).
[17]
2020. man3. https://linux.die.net/man/3/. (2020).
[18]
2020. National Vulnerability Datase. https://nvd.nist.gov/vuln/search. (2020).
[19]
2020. sourceforge. https://sourceforge.net/. (2020).
[20]
2020. ubuntu. https://packages.ubuntu.com/en/xenial/libs/. (2020).
[21]
Mithun Acharya and Tao Xie. 2009. Mining API error-handling specifications from source code. In International Conference on Fundamental Approaches to Software Engineering. Springer, 370--384.
[22]
Arianna Blasi, Alberto Goffi, Konstantin Kuznetsov, Alessandra Gorla, Michael DErnst, Mauro Pezzè, and Sergio Delgado Castellanos. 2018. Translating code comments to procedure specifications. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, 242--253.
[23]
Yi Chen, Luyi Xing, Yue Qin, Xiaojing Liao, XiaoFeng Wang, Kai Chen, and WeiZou. 2019. Devils in the guidance: predicting logic vulnerabilities in payment syndication services through automated documentation analysis. In 28th USENIX Security Symposium(USENIX Security 19). 747--764.
[24]
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Y. Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling.(12 2014).
[25]
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding Back-Translation at Scale. (08 2018).
[26]
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Taf jord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A Deep Semantic Natural Language Processing Platform.arXiv:arXiv:1803.07640
[27]
Alberto Goffi, Alessandra Gorla, Michael D Ernst, and Mauro Pezzè. 2016. Automatic generation of oracles for exceptional behaviors. In Proceedings of the 25th International Symposium on Software Testing and Analysis. 213--224.
[28]
Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al.2017. Program synthesis. Foundations and Trends® in Programming Languages 4, 1--2 (2017), 1--119.
[29]
huggingface. 2020. neural coref. https://github.com/huggingface/neuralcoref .(2020).
[30]
Yuan Kang, Baishakhi Ray, and Suman Jana. 2016. APEx: Automated Inference of Error Specifications for C APIs. In 31st IEEE/ACM International Conference on Automated Software Engineering(ASE). Singapore.
[31]
Maria Kechagia, Xavier Devroey, Annibale Panichella, Georgios Gousios, and Arie van Deursen. 2019. Effective and efficient API misuse detection via exception propagation and search-based testing. In Proceedings of the 28th ACM SIGSOFT International Symposiumon Software Testing and Analysis. 192--203.
[32]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 (2014).
[33]
George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2123--2138.
[34]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence.
[35]
Chi Li, Zuxing Gu, Min Zhou, Jiecheng Wu, Jiarui Zhang, and Ming Gu. 2019. API Misuse Detection in C Programs: Practice on SSL APIs. International Journal of Software Engineering and Knowledge Engineering 29, 11&12 (2019), 1761--1779. https://doi.org/10.1142/S0218194019400205
[36]
Chi Li, Min Zhou, Zuxing Gu, Ming Gu, and Hongyu Zhang. 2019. Ares: inferring error specifications through static analysis. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering(ASE). IEEE, 1174--1177.
[37]
Zhenmin Li and Yuanyuan Zhou. 2005. PR-Miner: automatically extracting implicit programming rules and detecting violations in large software code. In Proceedings of the 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2005, Lisbon, Portugal, September 5--9, 2005, Michel Wermelinger and Harald C. Gall (Eds.). ACM, 306--315. https://doi.org/10.1145/1081706.1081755
[38]
Lynten. 2018. stanford corenlp. https://github.com/Lynten/stanford-corenlp.(2018).
[39]
Afsaneh Mahanipour and Hossein Nezamabadi-Pour. 2019. GSP: an automatic programming technique with gravitational search algorithm. Applied Intelligence 49, 4 (2019), 1502--1516.
[40]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[41]
Hoan Anh Nguyen, Robert Dyer, Tien N Nguyen, and Hridesh Rajan. 2014. Mining preconditions of APIs in large-scale code corpus. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 166--177.
[42]
Rahul Pandita, Kunal Taneja, Laurie A. Williams, and Teresa Tung. 2016. ICON: Inferring Temporal Constraints from Natural Language API Descriptions. 2016 IEEE International Conference on Software Maintenance and Evolution(ICSME) (2016), 378--388.
[43]
Rahul Pandita, Xusheng Xiao, Wei Yang, William Enck, and Tao Xie. 2013. WHY-PER: Towards Automating Risk Assessment of Mobile Applications. In 22nd USENIX Security Symposium(USENIX Security 13). USENIX Association, Washington, D.C., 527--542. https://www.usenix.org/conference/usenixsecurity13/technical-sessions/presentation/pandita
[44]
Rahul Pandita, Xusheng Xiao, Hao Zhong, Tao Xie, Stephen Oney, and Amit Paradkar. 2012. Inferring method specifications from natural language API descriptions. In 2012 34th International Conference on Software Engineering(ICSE). IEEE, 815--825.
[45]
Zhengyang Qu, Vaibhav Rastogi, Xinyi Zhang, Yan Chen, Tiantian Zhu, and Zhong Chen. 2014. Autocog: Measuring the description-to-permission fidelity in android applications. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security. 1354--1365.
[46]
Fei Sha and Fernando Pereira. 2003. Shallow Parsing with Conditional Random Fields. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics. 213--220. https://www.aclweb.org/anthology/N03--1028
[47]
Lin Tan, Ding Yuan, Gopal Krishna, and Yuanyuan Zhou. 2007. /* iComment: Bugs or bad comments?*. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 145--158.
[48]
Shin Hwei Tan, Darko Marinov, Lin Tan, and Gary T Leavens. 2012. @tcomment: Testing javadoc comments to detect comment-code inconsistencies. In 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation. IEEE, 260--269.
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30, I. Guyon,U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-nett (Eds.). Curran Associates, Inc., 5998--6008. http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[50]
Ming Wen, Yepang Liu, Rongxin Wu, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exposing library API misuses via mutation analysis. In 2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE). IEEE, 866--877.
[51]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics:human language technologies. 1480--1489.
[52]
Insu Yun, Changwoo Min, Xujie Si, Yeongjin Jang, Taesoo Kim, and Mayur Naik. 2016. APISan: Sanitizing API Usages through Semantic Cross-Checking. In 25th USENIX Security Symposium, USENIX Security 16, Austin, TX, USA, August10--12,2016, Thorsten Holz and Stefan Savage (Eds.). USENIX Association, 363--378. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/yun
[53]
Mohammed Javeed Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering 17, 8 (2005), 1021--1035.
[54]
Hao Zhong, Lu Zhang, Tao Xie, and Hong Mei. 2009. Inferring resource specifications from natural language API documentation. In 2009 IEEE/ACM International Conference on Automated Software Engineering. IEEE, 307--318.
[55]
Sebastian Zimmeck, Ziqi Wang, Lieyong Zou, Roger Iyengar, Bin Liu, Florian Schaub, Shomir Wilson, Norman Sadeh, Steven M. Bellovin, and Joel Reidenberg. 2017. Automated Analysis of Privacy Requirements for Mobile Apps. Korea Society of Internet Information, Korea, Republic of. https://doi.org/10.14722/ndss.2017.23034
[56]
2019. gensim. https://radimrehurek.com/gensim/. (2019).

Cited By

View all
  • (2024)Discovering API usage specifications for security detection using two-stage code miningCybersecurity10.1186/s42400-024-00224-w7:1Online publication date: 3-Oct-2024
  • (2024)API Misuse Detection via Probabilistic Graphical ModelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652112(88-99)Online publication date: 11-Sep-2024
  • (2024)APP-Miner: Detecting API Misuses via Automatically Mining API Path Patterns2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00043(4034-4052)Online publication date: 19-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '20: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security
October 2020
2180 pages
ISBN:9781450370899
DOI:10.1145/3372297
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. API misuse
  2. documentation analysis
  3. integration assumption
  4. verification code generation

Qualifiers

  • Research-article

Conference

CCS '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)18
Reflects downloads up to 18 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Discovering API usage specifications for security detection using two-stage code miningCybersecurity10.1186/s42400-024-00224-w7:1Online publication date: 3-Oct-2024
  • (2024)API Misuse Detection via Probabilistic Graphical ModelProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652112(88-99)Online publication date: 11-Sep-2024
  • (2024)APP-Miner: Detecting API Misuses via Automatically Mining API Path Patterns2024 IEEE Symposium on Security and Privacy (SP)10.1109/SP54263.2024.00043(4034-4052)Online publication date: 19-May-2024
  • (2023)UVSCANProceedings of the 32nd USENIX Conference on Security Symposium10.5555/3620237.3620429(3421-3438)Online publication date: 9-Aug-2023
  • (2023)Jeu de mots paronomasia: a StackOverflow-driven bug discovery approachCybersecurity10.1186/s42400-023-00153-06:1Online publication date: 3-Apr-2023
  • (2023)"Get in Researchers; We're Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security ConferencesProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security10.1145/3576915.3623130(3433-3459)Online publication date: 15-Nov-2023
  • (2023)APICad: Augmenting API Misuse Detection through Specifications from Code and DocumentsProceedings of the 45th International Conference on Software Engineering10.1109/ICSE48619.2023.00032(245-256)Online publication date: 14-May-2023
  • (2023)NLP-based Cross-Layer 5G Vulnerabilities Detection via Fuzzing Generated Run-Time Profiling2023 IEEE 12th International Conference on Cloud Networking (CloudNet)10.1109/CloudNet59005.2023.10490042(194-202)Online publication date: 1-Nov-2023
  • (2022)The inconsistency of documentation: a study of online C standard library documentsCybersecurity10.1186/s42400-022-00118-95:1Online publication date: 2-Jul-2022
  • (2022)FUM - A Framework for API Usage constraint and Misuse Classification2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER53432.2022.00085(673-684)Online publication date: Mar-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media