Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Public Access

Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code

Published: 04 April 2017 Publication History

Abstract

There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this paper, we revisit the scalability problem of interprocedural static analysis from a "Big Data" perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We develop Graspan, a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. Moreover, we show that these analyses can be used to augment the existing checkers; these augmented checkers uncovered 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.

References

[1]
The findbugs Java static checker. http://findbugs.sourceforge.net/, 2015.
[2]
The Coverity code checker. http://www.coverity.com/, 2016.
[3]
The GrammaTech CodeSonar static checker, 2016.
[4]
The HP Fortify static checker, 2016.
[5]
The KlocWork static checker, 2016.
[6]
The LLVMLinux project. http://llvm.linuxfoundation.org/, 2016.
[7]
The LogicBlox Datalog engine. http://www.logicblox.com/, 2016.
[8]
Personal communication with John Criswell, 2016.
[9]
A. Aiken, S. Bugrara, I. Dillig, T. Dillig, B. Hackett, and P. Hawkins. An overview of the Saturn project. In PASTE, pages 43--48, 2007.
[10]
R. Alur. Marrying words and trees. In PODS, pages 233--242, 2007.
[11]
R. Alur, M. Benedikt, K. Etessami, P. Godefroid, T. Reps, and M. Yannakakis. Analysis of recursive state machines. ACM Trans. Program. Lang. Syst., 27(4):786--818, 2005.
[12]
R. Alur and P. Madhusudan. Visibly pushdown languages. In STOC, pages 202--211, 2004.
[13]
M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte. Min-max heaps and generalized priority queues. Commun. ACM, 29(10):996--1000, 1986.
[14]
T. Ball, B. Cook, V. Levin, and S. K. Rajamani. SLAM and static driver verifier: Technology transfer of formal methods inside microsoft. In IFM, pages 1--20, 2004.
[15]
T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of c programs. In PLDI, pages 203--213, 2001.
[16]
O. Bastani, S. Anand, and A. Aiken. Specification inference using context-free language reachability. In POPL, pages 553--566, 2015.
[17]
M. Bravenboer and Y. Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. In OOPSLA, pages 243--262, 2009.
[18]
F. Brown, A. Notzli, and D. Engler. How to build static checking systems using orders of magnitude less code. In ASPLOS, pages 143--157, 2016.
[19]
S. Bugrara and A. Aiken. Verifying the safety of user pointer dereferences. In IEEE S&P, pages 325--338, 2008.
[20]
C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008.
[21]
C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. EXE: Automatically generating inputs of death. In CCS, pages 322--335, 2006.
[22]
R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, pages 215--226, 2014.
[23]
R. Chen, J. Shi, Y. Chen, and H. Chen. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In EuroSys, pages 1:1--1:15, 2015.
[24]
A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In SOSP, pages 73--88, 2001.
[25]
R. DeLine and M. F\"ahndrich. Enforcing high-level protocols in low-level software. In PLDI, pages 59--69, 2001.
[26]
N. Dor, S. Adams, M. Das, and Z. Yang. Software validation via scalable path-sensitive value flow analysis. In ISSTA, pages 12--22, 2004.
[27]
D. Engler. Making finite verification of raw C code easier than writing a test case. In RV. Invited talk.
[28]
D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specific, programmer-written compiler extensions. In OSDI, pages 1--1, 2000.
[29]
D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In SOSP, pages 57--72, 2001.
[30]
S. Fink, E. Yahav, N. Dor, G. Ramalingam, and E. Geay. Effective typestate verification in the presence of aliasing. In ISSTA, pages 133--144, 2006.
[31]
J. S. Foster, M. F\"ahndrich, and A. Aiken. A theory of type qualifiers. In PLDI, pages 192--203, 1999.
[32]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.
[33]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.
[34]
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.
[35]
S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system and language for building system-specific, static analyses. In PLDI, pages 69--82, 2002.
[36]
W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu. TurboGraph: A fast parallel graph engine handling billion-scale graphs in a single PC. In KDD, pages 77--85, 2013.
[37]
M. Hind. Pointer analysis: Haven't we solved this problem yet? In PASTE, pages 54--61, 2001.
[38]
S. Horwitz, T. Reps, and M. Sagiv. Demand interprocedural dataflow analysis. In FSE, pages 104--115, 1995.
[39]
G. F. Italiano. Amortized efficiency of a path retrieval data structure. Theor. Comput. Sci., 48(2--3):273--281, 1986.
[40]
R. Johnson and D. Wagner. Finding user/kernel pointer bugs with type inference. In USENIX Security, pages 9--9, 2004.
[41]
G. Kastrinis and Y. Smaragdakis. Hybrid context-sensitivity for points-to analysis. In PLDI, pages 423--434, 2013.
[42]
J. Kodumal and A. Aiken. The set constraint/CFL reachability connection in practice. In PLDI, pages 207--218, 2004.
[43]
J. Kodumal and A. Aiken. Regularly annotated set constraints. In PLDI, pages 331--341, 2007.
[44]
A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, pages 31--46, 2012.
[45]
M. S. Lam, S. Guo, and J. Seo. SociaLite: Datalog extensions for efficient social network analysis. In ICDE, pages 278--289, 2013.
[46]
B. W. Lampson. Hints for computer system design. In SOSP, pages 33--48, 1983.
[47]
C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In PLDI, pages 278--289, 2007.
[48]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In OSDI, pages 20--20, 2004.
[49]
Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. In FSE, pages 306--315, 2005.
[50]
Z. Lin, M. Kahng, K. M. Sabrin, D. H. P. Chau, H. Lee, and U. Kang. MMap: Fast billion-scale graph computation on a pc via memory mapping. In BigData, pages 159--164, 2014.
[51]
Y. Liu and A. Milanova. Static analysis for inference of explicit information flow. In PASTE, pages 50--56, 2008.
[52]
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, 2012.
[53]
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, and G. Inc. Pregel: A system for large-scale graph processing. In SIGMOD, pages 135--146, 2010.
[54]
D. Melski and T. Reps. Interconvertibility of a class of set constraints and context-free-language reachability. Theoretical Computer Science, 248:29--98, 2000.
[55]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455. ACM, 2013.
[56]
G. C. Necula, J. Condit, M. Harren, S. McPeak, and W. Weimer. CCured: Type-safe retrofitting of legacy software. ACM Trans. Program. Lang. Syst., 27(3):477--526, 2005.
[57]
D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, pages 456--471, 2013.
[58]
Y. Padioleau, J. Lawall, R. R. Hansen, and G. Muller. Documenting and automating collateral evolutions in linux device drivers. In EuroSys, pages 247--260, 2008.
[59]
N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller. Faults in linux: Ten years later. In ASPLOS, pages 305--318, 2011.
[60]
M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In SoCC, pages 195--208, 2015.
[61]
J. Rehof and M. F\"ahndrich. Type-based flow analysis: From polymorphic subtyping to CFL-reachability. In POPL, pages 54--66, 2001.
[62]
T. Reps. Solving demand versions of interprocedural analysis problems. In CC, pages 389--403, 1994.
[63]
T. Reps. Shape analysis as a generalized path problem. In PEPM, pages 1--11, 1995.
[64]
T. Reps. Program analysis via graph reachability. Information and Software Technology, 40(11--12):701--726, 1998.
[65]
T. Reps, S. Horwitz, and M. Sagiv. Precise interprocedural dataflow analysis via graph reachability. In POPL, pages 49--61, 1995.
[66]
T. Reps, S. Horwitz, M. Sagiv, and G. Rosay. Speeding up slicing. In FSE, pages 11--20, 1994.
[67]
L. Roditty and U. Zwick. A fully dynamic reachability algorithm for directed graphs with an almost linear update time. In STOC, pages 184--191, 2004.
[68]
A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In SOSP, pages 410--424, 2015.
[69]
A. Roy, I. Mihailovic, and W. Zwaenepoel. X-Stream: Edge-centric graph processing using streaming partitions. In SOSP, pages 472--488, 2013.
[70]
C. Rubio-González, H. S. Gunawi, B. Liblit, R. H. Arpaci-Dusseau, and A. C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, pages 270--280, 2009.
[71]
C. Rubio-González and B. Liblit. Defective error/pointer interactions in the linux kernel. In ISSTA, pages 111--121, 2011.
[72]
M. Sagiv, T. Reps, and S. Horwitz. Precise interprocedural dataflow analysis with applications to constant propagation. Theoretical Computer Science, 167(1--2):131--170, 1996.
[73]
M. Sharir and A. Pnueli. Two approaches to interprocedural data flow analysis. In S. Muchnick and N. Jones, editors, Program Flow Analysis: Theory and Applications, pages 189--234. Prentice Hall, 1981.
[74]
A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with datalog queries on spark. In SIGMOD, pages 1135--1149, 2016.
[75]
J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In PPoPP, pages 135--146, 2013.
[76]
Y. Smaragdakis, M. Bravenboer, and O. Lhoták. Pick your contexts well: Understanding object-sensitivity. In POPL, pages 17--30, 2011.
[77]
Y. Smaragdakis, G. Kastrinis, and G. Balatsouras. Introspective analysis: Context-sensitivity, across the board. In PLDI, pages 485--495, 2014.
[78]
M. Sridharan and R. Bodik. Refinement-based context-sensitive points-to analysis for Java. In PLDI, pages 387--400, 2006.
[79]
M. Sridharan, D. Gopan, L. Shan, and R. Bodik. Demand-driven points-to analysis for Java. In OOPSLA, pages 59--76, 2005.
[80]
H. Tang, X. Wang, L. Zhang, B. Xie, L. Zhang, and H. Mei. Summary-based context-sensitive data-dependence analysis in presence of callbacks. In POPL, pages 83--95, 2015.
[81]
K. Vora, R. Gupta, and G. Xu. Synergistic analysis of evolving graphs. ACM Trans. Archit. Code Optim., 13(4):32:1--32:27, 2016.
[82]
K. Vora, R. Gupta, and G. Xu. Kickstarter: Fast and accurate computations on streaming graphs via trimmed approximations. In ASPLOS, 2017.
[83]
K. Vora, S. C. Koduru, and R. Gupta. ASPIRE: Exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based dsm. In OOPSLA, pages 861--878, 2014.
[84]
K. Vora, G. Xu, and R. Gupta. Load the edges you need: A generic I/O optimization for disk-based graph processing. In USENIX ATC, pages 507--522, 2016.
[85]
G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In CIDR, 2013.
[86]
J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542--1553, 2015.
[87]
K. Wang, G. Xu, Z. Su, and Y. D. Liu. GraphQ: Graph query processing with abstraction refinement\textemdashprogrammable and budget-aware analytical queries over very large graphs on a single PC. In USENIX ATC, pages 387--401, 2015.
[88]
X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama. Towards optimization-safe systems: Analyzing the impact of undefined behavior. In SOSP, pages 260--275, 2013.
[89]
C. Weiss, C. Rubio-González, and B. Liblit. Database-backed program analysis for scalable error propagation. In ICSE, pages 586--597, 2015.
[90]
J. Whaley and M. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In PLDI, pages 131--144, 2004.
[91]
G. Xu, A. Rountev, and M. Sridharan. Scaling CFL-reachability-based points-to analysis using context-sensitive must-not-alias analysis. In ECOOP, pages 98--122, 2009.
[92]
D. Yan, G. Xu, and A. Rountev. Demand-driven context-sensitive alias analysis for Java. In ISSTA, pages 155--165, 2011.
[93]
J. Yang, C. Sar, and D. Engler. EXPLODE: A lightweight, general system for finding serious storage system errors. In OSDI, pages 10--10, 2006.
[94]
M. Yannakakis. Graph-theoretic methods in database theory. In PODS, pages 230--242, 1990.
[95]
D. M. Yellin. Speeding up dynamic transitive closure for bounded degree graphs. Acta Inf., 30(4):369--384, 1993.
[96]
S. Yong, S. Horwitz, and T. Reps. Pointer analysis for programs with structures and casting. In PLDI, pages 91--103, 1999.
[97]
Q. Zhang, M. R. Lyu, H. Yuan, and Z. Su. Fast algorithms for Dyck-CFL-reachability with applications to alias analysis. In PLDI, pages 435--446, 2013.
[98]
Q. Zhang and Z. Su. Context-sensitive data dependence analysis via linear conjunctive language reachability. In POPL, pages 344--358, 2017.
[99]
Q. Zhang, X. Xiao, C. Zhang, H. Yuan, and Z. Su. Efficient subcubic alias analysis for C. In OOPSLA, pages 829--845, 2014.
[100]
D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. FlashGraph: processing billion-node graphs on an array of commodity ssds. In FAST, pages 45--58, 2015.
[101]
X. Zheng and R. Rugina. Demand-driven alias analysis for C. In POPL, pages 197--208, 2008.
[102]
X. Zhu, W. Han, and W. Chen. GridGraph: Large scale graph processing on a single machine using 2-level hierarchical partitioning. In USENIX ATC, pages 375--386, 2015.

Cited By

View all
  • (2024)Boosting the Performance of Alias-Aware IFDS Analysis with CFL-Based Environment TransformersProceedings of the ACM on Programming Languages10.1145/36898048:OOPSLA2(2633-2661)Online publication date: 8-Oct-2024
  • (2024)Iterative-Epoch Online Cycle Elimination for Context-Free Language ReachabilityProceedings of the ACM on Programming Languages10.1145/36498628:OOPSLA1(1437-1462)Online publication date: 29-Apr-2024
  • (2023)Recursive State Machine Guided Graph Folding for Context-Free Language ReachabilityProceedings of the ACM on Programming Languages10.1145/35912337:PLDI(318-342)Online publication date: 6-Jun-2023
  • Show More Cited By

Index Terms

  1. Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 52, Issue 4
      ASPLOS '17
      April 2017
      811 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3093336
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
        April 2017
        856 pages
        ISBN:9781450344654
        DOI:10.1145/3037697
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 04 April 2017
      Published in SIGPLAN Volume 52, Issue 4

      Check for updates

      Author Tags

      1. disk-based systems
      2. graph processing
      3. static analysis

      Qualifiers

      • Research-article

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)321
      • Downloads (Last 6 weeks)46
      Reflects downloads up to 21 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Boosting the Performance of Alias-Aware IFDS Analysis with CFL-Based Environment TransformersProceedings of the ACM on Programming Languages10.1145/36898048:OOPSLA2(2633-2661)Online publication date: 8-Oct-2024
      • (2024)Iterative-Epoch Online Cycle Elimination for Context-Free Language ReachabilityProceedings of the ACM on Programming Languages10.1145/36498628:OOPSLA1(1437-1462)Online publication date: 29-Apr-2024
      • (2023)Recursive State Machine Guided Graph Folding for Context-Free Language ReachabilityProceedings of the ACM on Programming Languages10.1145/35912337:PLDI(318-342)Online publication date: 6-Jun-2023
      • (2022)Taming transitive redundancy for context-free language reachabilityProceedings of the ACM on Programming Languages10.1145/35633436:OOPSLA2(1556-1582)Online publication date: 31-Oct-2022
      • (2021)RisGraph: A Real-Time Streaming System for Evolving Graphs to Support Sub-millisecond Per-update Analysis at Millions Ops/sProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3457263(513-527)Online publication date: 9-Jun-2021
      • (2021)Research on the Optimization Algorithm of Big Data Computing System2021 International Wireless Communications and Mobile Computing (IWCMC)10.1109/IWCMC51323.2021.9498813(1783-1787)Online publication date: 28-Jun-2021
      • (2020)Large-scale graph processing systems: a surveyFrontiers of Information Technology & Electronic Engineering10.1631/FITEE.190012721:3(384-404)Online publication date: 1-Apr-2020
      • (2024)Exploring Scalability of Value-Flow Graph ConstructionProceedings of the 2024 4th International Conference on Artificial Intelligence, Automation and High Performance Computing10.1145/3690931.3690951(112-117)Online publication date: 19-Jul-2024
      • (2024)HardTaint: Production-Run Dynamic Taint Analysis via Selective Hardware TracingProceedings of the ACM on Programming Languages10.1145/36897688:OOPSLA2(1615-1640)Online publication date: 8-Oct-2024
      • (2024)Context-Free Language Reachability via Skewed TabulationProceedings of the ACM on Programming Languages10.1145/36564518:PLDI(1830-1853)Online publication date: 20-Jun-2024
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media