research-article

Open access

SEAL: Integrating Program Analysis and Repository Mining

Authors:

Florian Sattler,

Sebastian Böhm,

Philipp Dominik Schubert,

Norbert Siegmund,

Sven ApelAuthors Info & Claims

ACM Transactions on Software Engineering and Methodology, Volume 32, Issue 5

Article No.: 121, Pages 1 - 34

https://doi.org/10.1145/3585008

Published: 24 July 2023 Publication History

PDF eReader

Abstract

Software projects are complex technical and organizational systems involving large numbers of artifacts and developers. To understand and tame software complexity, a wide variety of program analysis techniques have been developed for bug detection, program comprehension, verification, and more. At the same time, repository mining techniques aim at obtaining insights into the inner socio-technical workings of software projects at a larger scale. While both program analysis and repository mining have been successful on their own, they are largely isolated, which leaves considerable potential for synergies untapped. We present SEAL, the first integrated approach that combines low-level program analysis with high-level repository information. SEAL maps repository information, mined from the development history of a project, onto a low-level intermediate program representation, making it available for state-of-the-art program analysis. SEAL’s integrated approach allows us to efficiently address software engineering problems that span multiple levels of abstraction, from low-level data flow to high-level organizational information. To demonstrate its merits and practicality, we use SEAL to determine which code changes modify central parts of a given software project, how authors interact (indirectly) with each other through code, and we demonstrate that putting static analysis’ results into a socio-technical context improves their expressiveness and interpretability.

1 Introduction

Software systems are among the most complex human-made systems today. To understand the inner workings and external qualities of complex software systems, researchers and developers have devised a variety of program analysis techniques to extract relevant information, including bug finders, program verifiers, and code metric tools.

The rise of open-source software has triggered the development of repository mining techniques that extract organizational information from software repositories. For example, researchers have investigated how code changes evolve on different platforms [59] or analyzed the characteristics of uncompilable code [21]. Other approaches analyze the socio-technical interaction around a software project to better understand how developers collaborate [7, 24, 25, 26].

Such high-level repository mining approaches analyze code typically at file, textual, or syntactical level and thus miss important information that is only encoded in the underlying program semantics. Information on the program semantics is often only accessible by low-level static program analysis techniques such as data-flow analyses. While high-level repository analyses are sufficient to get good first results, information that is hidden in the program’s semantics, e.g., information on data-flow, is overlooked. This insufficiency precludes important use cases such as accounting bugs and security vulnerabilities directly to the commits/developers that introduced or know best how to fix them. For illustration, let us consider the example in Figure 1: Alice commits a change \(c_{\mathrm{1}}\) , later Bob also changes the same file with commit \(c_{\mathrm{2}}\) . With a purely textual or syntactical repository analysis, no connection between the two changes can be inferred, except that both change the same file. By mapping the change to a control-flow graph, we can leverage powerful program analysis techniques to detect previously hidden data dependencies between the changes and, in turn, infer that \(c_{\mathrm{2}}\) could be related to a bug in \(c_{\mathrm{1}}\) .

Fig. 1.

Our central hypothesis is that a combination of low-level program analysis (in particular, data-flow analysis) with high-level repository information empowers us to answer practically- and scientifically-relevant questions that neither can answer alone. Many questions are difficult to answer solely based on data from a high-level repository analysis, for example, “Which developers are affected by a change?” or “Which portion of a code base is influenced by a particular change?” The underlying problem is that repository analyses often do not have precise data-flow information at their disposal, if at all, or try to include this information in an ad hoc manner, making it difficult to apply them to a wide range of practical settings. Conversely, state-of-the-art program analysis techniques and tools do not have organizational information (e.g., about versions or developers) at their disposal. This problem is hard to solve by just combining different tools, because the high-level information needs to map correctly to the program operational semantics and a fairly low-level program representation. We seek to integrate high-level repository information with low-level information in a principled way and make the combined information available for interpretation.

For this purpose, we have developed SEAL, a parameterizable approach that combines change information from a project’s history with information on the program semantics computed by an inter-procedural, flow- and context-sensitive data-flow analysis. SEAL’s integrated approach allows us to precisely analyze data-flow interactions between commits—a previously infeasible endeavor. More generally, SEAL establishes a mapping from a high-level information source (in our case, commit information) to an intermediate program representation that is suited for writing precise, low-level program analyses, conceptually decoupling the information source from the actual analysis. SEAL defines a general and parameterizable relation between data flows and high-level information and allows us, for the first time, to embed statements on the program semantics into the socio-technical context of a software project.

We implemented SEAL on top of LLVM [35] and PhASAR [51] in the form of a parameterizable framework that adds commit information to the compiler’s intermediate representation (LLVM IR) and combines it with an inter-procedural data-flow analysis to determine which commits interact with each other at the level of data-flow. We designed SEAL to be modular and reusable to enable the commit information to be used by different static program analyses that target LLVM IR.

By means of a divers set of 13 open-source projects, we demonstrate the practicality of SEAL. Specifically, we apply it to four relevant software analysis problems and demonstrate how a combination of commit and data-flow information can be leveraged to solve these problems: (1) We use SEAL to detect potentially impactful changes that modify central code (i.e., code that interacts with lots of other code) in a software project. (2) We demonstrate the applicability of SEAL to socio-technical analyses by analyzing the socio-technical structure that arises from interactions among authors via commits. (3) We apply our approach to analyze which authors could be affected by a change. (4) Furthermore, we discuss how data generated by SEAL can be used to enrich existing data-flow analyses by making them change-aware and putting their results into a socio-technical context without paying the full costs for follow-up analyses.

In summary, we make the following contributions:

•

A novel approach, called SEAL, that combines high-level repository mining and precise low-level program analysis.

•

An open-source implementation of SEAL built on top of LLVM [35] and PhASAR [51] as well as a modified implementation of Clang that allows one to inject repository change information during compilation, which is integrated with PhASAR, allowing for precise data-flow analyses.

•

An evaluation consisting of four different studies, three that demonstrate the applicability of SEAL, evaluated on 13 open-source projects, and a showcase of how SEAL can be used to make existing data-flow analyses change-aware.

All results and a replication package are publicly available.¹ Our implementation and additional evaluation tools are open source and available on our project website.²

2 SEAL at A Glance

In this section, we provide an overview of SEAL and show how mapping repository information onto a compiler IR enables SEAL to combine high-level data with a specialized data-flow analysis. We define the concept of commit interactions, an abstraction that merges both kinds of information and enables us to reason about how data flows connect seemingly unrelated commits. Then, we describe in detail how SEAL can compute these commit interactions using the mapped repository information and a static taint analysis.

2.1 Code Annotation

In a preparation step, we map information from the version control system into a representation on which we later conduct the program analysis. As program representation, we use the compiler’s intermediate representation (IR), which is a common abstraction used in modern compilers and static analysis tools. It is important to note that SEAL and the definitions in Section 2.2 are based on, but not restricted to, a given IR. A key mechanism of our approach is that, during the construction of the IR, we add information to the specific IR instructions that relate them to the commit that introduced the corresponding code. The compiler determines the last change for each source-code line by accessing repository meta-data (e.g., git-blame³) and then annotates the commit hash to the respective instruction. Each commit itself is a snapshot of the repository that represents the changes compared to the previous commit.

Figure 2(a) lists a C program that serves as our running example. The hash of the commit that introduced each source-code line is shown on the right ( \(\triangleright\) ). During compilation, SEAL adds this information to the IR, as shown in Figure 2(b). Each of the IR instructions is annotated by its corresponding commit hash (right side).

Fig. 2.

2.2 Commit Interactions

We devise a formal framework that serves as a basis for SEAL. In particular, the framework defines the abstract structure of commit interactions and commit-interaction paths, based on a given relationship between program elements, data flows, in our case.

A program \(\mathcal {P}\) is a composition of instructions \(i_{\mathrm{1}},\ \ldots ,\ i_{\mathrm{n}} \in \mathcal {P}\) stemming from a sequence of commits \(c_{\mathrm{1}},\ \ldots ,\ c_{\mathrm{m}} \in \mathcal {C}\) , where \(\mathcal {C}\) is the set of all commits from the repository.

Definition 1.

Let base be a function that maps an IR instruction \(i \in \mathcal {P}\) to the corresponding base commit \(c \in C\) , which gets attached during compilation.

The base commit represents the initial commit that introduced the source line in question to the code base, as illustrated on the right-hand side of Figure 2(b).

Definition 2.

Function instructions computes the set of all instructions added by commit c (i.e., the set of instructions whose base commit is c).

\begin{equation*} {\mbox{instructions}}(\mathit {c}) = \lbrace \: i\ \,|\ \,{\mbox{base}}(i) = c \ \,\wedge \ \, i \in \mathcal {P} \:\rbrace \end{equation*}

Applied to our example of Figure 2(b), instructions returns for commit \(\texttt {3e8882e}\) a set containing the two instructions alloca in Line 3 and store in Line 6.

Next, we define interactions between commits. We define the relation \(\rightsquigarrow\) (relates to) to represent data-flow interactions. In general, our framework abstracts from any concrete relationship and does not require specific properties of the relation \(\rightsquigarrow\) . However, in a concrete instantiation, certain properties can be defined by the user to express desired analysis semantics. In our case, we require \(\rightsquigarrow\) to be transitive and not symmetric, because the effects of data-flow interactions are directional, meaning, a write to a variable does only affect subsequent reads and not preceding ones.

Function interactions applied to \(\texttt {3e8882e}\) and \(\texttt {0872f49}\) , for our running example in Figure 2(b), would return an empty set, since there are no data flows between instructions added by \(\texttt {3e8882e}\) and instructions added by \(\texttt {0872f49}\) . Applied to \(\texttt {ea8426c}\) and \(\texttt {0872f49}\) , we obtain: \(\lbrace \:(\,\color{green}{\texttt{alloca}}\!:\!4, \color{green}{\texttt{load}}\!:\!9\,), (\,\color{green}{\texttt{store}}\!:\!7, \color{green}{\texttt{load}}\!:\!9\,),\:\dots \:\rbrace\) , where \(\color{green}{\texttt{inst}}\) \(\!:\!n\) denotes instruction inst at line number n from Figure 2(b).

Once we created a fixed mapping from commit information to IR and determined the interactions for our program, we can investigate interactions between commits derived from interactions between instructions.

Intuitively, two commits interact when code added by the first commit interacts with code from the second commit. In our example, we obtain the commit interaction \((\, \texttt {ea8426c}\,,\ \texttt {0872f49} \,)\) for the pair \((\, \color{green}{\texttt{alloca}}\!:\!4, \color{green}{\texttt{load}}\!:\!9 \,)\) of instructions. That is, code changes from \(\texttt {ea8426c}\) interact with changes introduced by \(\texttt {0872f49}\) .

To further aggregate information from commit interactions and add a context-specific meaning, we group commit interactions for every instruction into a commit-interaction path.

A commit-interaction path aggregates over multiple commit interactions based on a target instruction i. All commits, except the base commit of the target instruction i, belonging to the interaction set of i are merged into a single set. For the ret instruction in Line 12 from Figure 2(b), we obtain the corresponding \(\mathrm{CIP} (\color{green}{\texttt{ret}}:12) =\) \((\lbrace \, \texttt {c4d9b1a},\,\texttt {ea8426c} \,\rbrace \,,\ \texttt {0872f49})\) , which indicates that two other commits, c4d9b1a and ea8426c, interact with the base commit 0872f49 at instruction ret.

2.3 Computing Commit Interactions

For computing commit interactions, we have implemented a flow- and fully context-sensitive, alias-aware, inter-procedural taint analysis based on Interprocedural Distributive Environments (IDE) [50].

IDE is an algorithmic framework to implement data-flow analysis. To check whether a property of interest holds at a certain point in a program, IDE constructs a so-called exploded super-graph (ESG). An ESG is constructed by replacing each node in the program’s inter-procedural control-flow graph with a bipartite graph representation of the corresponding flow function. Flow functions for identity (id), generating (gen), and removing (kill) data-flow facts are distributive and can be represented as bipartite graphs, as Figure 3 shows. Thus, all gen/kill-problems such as uninitialized variables, available expressions, reaching definitions, and taint analysis can be expressed in IDE. If a node \((i,d)\) in the ESG is reachable from a special tautological node \(\Lambda\) , then the data-flow fact d ( \(\in D\) , the data-flow fact domain) holds at instruction i ( \(\in I\) , the set of program instructions). In addition, ESG edges can be annotated with lambda functions to specify value computations that are solved over a separate value domain V. These so-called edge functions allow one to encode an additional value computation problem that is solved while performing a reachability check. The runtime of IDE is \(\mathcal {O}(|N| \cdot |D|^3)\) [50], where \(|N|\) is the number of nodes in the inter-procedural control-flow graph and \(|D|\) is the size of the data-flow domain D. Thus, the analysis efficiency highly depends on the size of the underlying data-flow domain. The value domain V can even be infinite and does not affect the algorithm’s complexity. Rather than encoding a linear-constant propagation using flow functions that operate on the data-flow fact domain \(D := \langle v, c \rangle\) , with \(v \in \mathcal {V}\) the set of program variables and \(c \in \mathbb {Z}\) that comprises tuples of program variables and their constant integer values, a linear-constant propagation can be instead encoded much more efficiently using \(D := \mathcal {V}\) and \(V := \mathbb {Z}\) . This enables the IDE framework to propagate only constant program variables as data-flow facts while computing their constant values on the separate edge function domain. The effect of a set of instructions can be summarized by composing flow (and edge) functions. The composition \(h = g \circ f\) of two flow functions f and g, called jump function, can be obtained by combining their bipartite graph representations. h can be produced by merging the nodes of g with the corresponding nodes of the domain of f. Once a summary for a complete procedure p has been constructed, it can be (re)applied in each subsequent context where the procedure p is called. Figure 4 shows an excerpt of a program and its respective ESG for the taint analysis that SEAL uses to compute commit interactions.

Fig. 3.

Fig. 4.

Taint analysis is a parameterizable analysis that tracks values that have been tainted by one or more sources through the program and that reports potential leaks if a tainted value reaches a sink. Sources and sinks may comprise functions and instructions. The taint analysis \(\mathcal {T}\) that we use in our experiments tracks data flows between the instructions of a given target program. It treats all variable declarations as sources and propagates these variables through the program. As we are interested in all instructions that interact with tainted variables, our set of sinks is empty. We then lift the data flows (i.e., the interactions of instructions with each other) to their respective commits, such that we can determine commit interactions (cf. Section 2.2).

We define the relevant intra-procedural (normal) flow and edge functions, which utilize our definitions from Section 2.2 to access commit information, formally in Figures 5 and 6. For the sake of brevity, we omit a formal description for inter-procedural (i.e., call, return, call-to-return) flow and edge functions and describe them only informally! The call and return flow functions map actual parameters onto the formal parameters at a call site, and vice versa at a callee’s exit instructions (return or throw instructions). The call-to-return flow function generates flow facts for calls to heap-allocating functions, such as, malloc() or operator new(), and propagates all data-flow facts alongside a call site that are not involved in the function call under analysis. The call and return edge function implementations are realized as identity, and the call-to-return edge function implementation forwards to the normal edge function implementation.

Fig. 5.

Fig. 6.

The analysis \(\mathcal {T}\) , starting at the program’s entry point main, taints the target program’s variables (e.g., alloca instructions, cf. Figure 2(b)) as they occur and propagates them as data-flow facts through the program. When analyzing libraries, the analysis treats every publicly accessible function as an entry point. Each tainted variable (i.e., data-flow fact) is associated with a set of commits that is encoded in lambda calculus using IDE’s edge functions. Initially, it contains only a data-flow fact’s base commit produced by \(base(i)\) . Whenever an instruction i interacts with one of the data-flow facts d, \({\mbox{base}}(i)\) is added to d’s associated set of commits. A set of commits of a data-flow fact d can be overwritten if it is used as a target of a store instruction. In this case, all elements of d are removed and the commit of the store instruction itself as well as all elements of the set whose associated data-flow fact is to be stored are added. An excerpt of the exploded super-graph for our taint analysis \(\mathcal {T}\) conducted on the program from Figure 2(b) is shown in Figure 4.

In the context of our general definitions from Section 2.2, \(\mathcal {T}\) computes the interactions (data flows) and constructs a \(\mathrm{CIP} _{i}\) for every instruction \(i \in \mathcal {P}\) as output.

2.3.1 Indirect Function Calls.

Our IDE-based taint analysis can “see” through indirect call sites [50]. The IDE algorithm is guided through the program under analysis with help of an inter-procedural control-flow graph (ICFG) that includes call-graph information. We use a call-graph algorithm that resolves indirect calls to function pointers or virtual functions using points-to information computed by a scalable (inter-procedural) Andersen-Style [1] points-to analysis. There is no difference between C and C++ according to the analysis. Calls to function pointers are resolved by computing the points-to set of the respective function pointer. Calls to virtual functions are resolved by computing the points-to set of the respective receiver object to find the corresponding virtual function table for statically determining potential callee targets.

2.3.2 Soundness and Completeness.

Our taint analysis presented in Section 2.3 is unsound. This has good reasons: Implementing an analysis that computes a more complex semantic property on realistic C/C++ programs in a sound manner and in an inter-procedural (i.e., whole program) setting is virtually impossible or would introduce so much imprecision that it renders the analysis results unusable [40]. Instead, our analysis aims at soundiness [40], a well-known term in static analysis. Soundy analyses apply sensible underapproximations to compute meaningful results in an inter-procedural analysis setting and are widely accepted in the static analysis community [40, 56]. A soundy analysis, for instance, would sanely assume that system calls and calls to libC behave as expected: Calls to such functions are not analyzed and instead, a summary that models their effects is consulted when they have a relevant effect on the client analysis. This is also why all static analyses used for compiler optimization that aim at computing more complex properties are intra-procedural only.

With respect to completeness, our taint analysis is set up to analyze all functions whose definitions are available. The call targets of system calls and calls to libC are typically available only as declarations or are modeled as intrinsic functions by the LLVM framework. LLVM represents specific low-level functions, such as memcpy or memset as intrinsic functions for which there are no definitions. Instead, these function declarations are used to describe only semantics. So, it is up to the code generator to replace them with a software or hardware implementation when generating machine code for the desired target architecture. Our taint analysis hence models function calls to the system and libC by applying summaries that describe their effects on points-to, call-graph, and data-flow information. All other call sites, for which the corresponding call targets are available only as declarations, are soundily [40] modeled using the identity transformation.

3 Implementation

In this section, we explain how we instantiated SEAL on top of Clang and LLVM, creating Vlang ⁴ as a modified version of Clang. In what follows, we provide an overview of the full commit-analysis pipeline and how we compile and analyze real-world projects, with the help of Vlang and the analysis framework PhASAR [51].

3.1 Lowering Commit Information to LLVM-IR

The first step of our analysis pipeline (cf. Figure 7) is the lowering step from an abstract syntax tree (AST) to LLVM-IR, in which we enrich LLVM-IR with commit meta-data. During the compilation of a translation unit, Vlang computes for each AST node the last commit that modified the related code and adds it as meta-data to the corresponding generated IR instructions. When lowering an instruction, Vlang queries our framework for relevant commit information, providing a corresponding file, line number, and line offset from the AST node’s expansion location (i.e., the location in the file before macro expansion). The framework then computes the blame of the file using the library libgit2.⁵ It is important to note that the method for identifying last modifying commits can be configured by the user. In our study, we use git blame. Then, Vlang generates a commit meta-data object based on the line number’s commit hash and the commit meta-data returned back to Vlang and attaches it to the instruction.

Fig. 7.

For illustration, consider our example of Figure 2(a): On the right, we show commit hashes per code line. During compilation, Vlang creates an AST and lowers it to LLVM-IR. Figure 2(b) shows the LLVM-IR output after lowering. Let us start with the first statement in Line 3. During lowering, Vlang creates two instructions for this statement: one allocation for the stack variable (alloca in Line 4 of Figure 2(b)) and one to initialize it to 20 (store in Line 7). As one can see, Vlang added commit meta-data to both of these instructions referenced by the meta-data tag !Commit and an identifier ID. The ID !9 points to the meta-data section of the file containing the commit hash. For the sake of simplicity, we leave out the actual meta-data nodes and depict the hash on the right-hand side of Figure 2(b).

After Vlang has processed the input file, all generated LLVM-IR that are related to commits from the project’s Git repository are tagged with the corresponding commit meta-data. The enriched LLVM-IR serves as input to our data-flow analysis (cf. Section 3.3).

3.2 Creating a Whole-program Bitcode File

A precise commit-interaction analysis requires the data-flow analysis to be inter-procedural, i.e., whole program and context sensitive [53]. Analyzing every compilation unit in separate bitcode file leads to approximations whenever callees of a call site are defined in another translation unit. We thus implemented our analysis to be whole program. To create a whole-program bitcode file from the project’s source code, we inject our tool Vlang into the build process. This enables us to reuse the existing build scripts by using using Whole Program LLVM (WLLVM)⁶ as a compiler wrapper, which invokes Vlang and generates and links bitcode files.

3.3 Taint Analysis for Commit Interactions

We implement our taint analysis \(\mathcal {T}\) as an IDE [50] analysis in the PhASAR [51] framework. PhASAR has been built on top of LLVM and provides, among others, a generic IDE solver implementation and all required infrastructure (e.g., control-flow analysis) to solve concrete client data-flow analysis problems.

PhASAR’s generic IDE [50] solver operates on a problem interface type whose implementations correspond to concrete data-flow analysis problems. The interface mainly comprises flow and edge function factories that are queried by the data-flow solver to construct the exploded super-graph and solve the value computation problems that are specified along the edges. We implemented these flow and edge functions factories according to our descriptions in Section 2.3.

3.4 Data Overview

SEAL aims to be extendable, and the results shall be integratable into existing study setups and tool. For this purpose, SEAL has a four-layered design, where at each layer the relevant information can be extracted for use by external tools. In what follows, we give a short overview of the data that is produced by each layer and describe how other tools can access it.

AST-level commit information. . In the preparation step, before data-flow analysis, SEAL’s compiler extension Vlang makes commit information accessible through LLVM’s AST. The commit data is provided by a general abstraction in Vlang that offers an interface to map AST nodes to commit hashes. Through this interface, tools have an easy way to query commit information related to a particular AST node (e.g., to combine commit information with error messages).

LLVM-IR with commit information. . During LLVM-IR code generation, Vlang attaches commit information provided by the AST interface to the generated llvmir information in form of meta-data. This way, commit information is attached to LLVM’s internal representation and can be queried, like any other meta data, through the usual framework API (e.g., to attribute LLVM’s warnings about missed optimizations with author information).

Data-flow-based commit interactions. . After commit interactions have been computed, they can be accessed within LLVM’s analysis infrastructure, enabling other analyses to query this information. As described in Section 4.2, this enables other analyses to attach socio-technical information to their analysis results (e.g., an analysis that detected an SQL injection can automatically determine developers interacting with the vulnerable code and include them in the resolving process). In addition, these raw commit-interaction data can be exported into a yaml for further analysis.

Aggregated socio-technical information. . To ease the analysis of data-flow-based commit interactions, SEAL provides different graph aggregations of the raw commit interaction data, including: commit, author, and commit-author graphs, which we use in Section 4.1. With these graphs, existing approaches have a straightforward way to integrate the data-flow-based commit information.

4 Commit Interactions: Applications

To illustrate the merits and potential of SEAL, we discuss research problems from different domains that can be addressed only by a combination of high-level repository information and low-level data-flow information. More importantly, some problems can be analyzed only if the high-level information is already available during analysis. We will use these problems in the next section to evaluate SEAL from two angles: qualitatively, using scenarios we found in real-world software projects demonstrating how the questions can be answered using SEAL, and quantitatively, using a number of real-world case studies demonstrating that our approach is indeed practical.

4.1 Commit Interaction Graph Analysis

First, we address three problems related to the field of repository mining and similar areas of research that are hard to approach using only high-level or low-level information, but become much more tangible when combining these two. For our evaluation, we aggregate commit interactions in a commit interaction graph. Not only does a graph representation come very naturally, because of the data-flow relation that is used to compute commit interactions, but it also allows us to use methods from graph theory to reason about commit interactions. We can also apply transformations to the graph to access different kinds of information, e.g., information about commit authors.

Central code . With commit interactions, we are able to quantify the impact of the changes made by a commit on the program’s data-flow dependency structure. This is related to the area of change impact analysis, where researchers have developed a multitude of techniques on how to estimate the impact of a change on a software project, using either high-level information or dependency information [36, 38]. While there are approaches that use both [15, 18, 20, 31], with SEAL, we can combine both kinds of information at the same time in a joint analysis.

Previously, Zimmermann and Nagappan [63] used dependency graphs to identify central program units in a software project and highlight their role as a proxy for defect-prone code. In a similar fashion, Ferreira et al. [14] investigated how interactions between functions in the presence of preprocessor directives relate to the occurrence of vulnerabilities. With commit interactions, we can identify code locations that are central in the dependency structure of the software system under analysis at a much finer granularity and, in addition, link them to commit meta-data (e.g., when or by whom the code was introduced). Changes to such central code are interesting, because their effect on the data that flows through the program is likely very high. For example, the function int align(int i) from the audio codec opus—one of the subject projects of our evaluation in Section 5—calculates how much memory an object of size i needs when it is stored with proper memory alignment (Figure 8). The function receives the size of the object via the parameter i. That parameter is then used to calculate the size with alignment, which is then returned from the function, meaning that there is a data flow from the parameter to the return value. As a consequence, there are also data flows between values passed to the function and usages of its return value. Also, the function is widely used throughout opus with 67 usages across 8 of 22 source files, which causes it to have many data flow connections to many different locations in the system. So, it is fair to say that this function is indeed central. To show how commit interactions can be used to identify such central code, we formulate the following problem:

Fig. 8.

P₁: Which fraction of commits affects central code?

It is important to reiterate at this point that this problem, while being interesting in itself, serves here as a showcase of demonstrating SEAL’s ability to address this and similar problems in practice and research in a more systematic and efficient way to what is possible so far. Before we measure the centrality of the code introduced by a commit, we first define how we represent commit interactions in a commit interaction graph.

Definition 6.

The commit interaction graph of a program is a directed graph with commits as nodes and interactions among commits as directed edges: \(\textrm {CIG} = (\:\mathcal {C^{\prime }}, \mathcal {CI}\:).\)

In this definition, \(\mathcal {CI}\) refers to the set of all commit interactions of a program, and \(\mathcal {C^{\prime }} \subseteq \mathcal {C}\) is the set of commits that participates at least in one commit interaction, that is, \(\mathcal {C^{\prime }}=\,\lbrace \: c_{\mathrm{1}}, c_{\mathrm{2}}\ \,|\ \,(c_{\mathrm{1}}, c_{\mathrm{2}}) \in \mathcal {CI} \:\rbrace\) . Note that this graph may contain multi-edges, since there may be commit interactions that originate from different instructions but have the same base commits.

With commit interactions represented in a graph, we can now employ methods from graph theory to identify commits affecting central code. We identify such commits by identifying central nodes in the commit interaction graph using the node degrees of the commits. We use the node degree to measure centrality, since it relates to our definition of central code quite well; other centrality measures are possible as well. A high node degree means that a commit participates in interactions with many other commits, which is an indicator that the code introduced by that commit is central in the dependency structure of the software system under analysis. We are particularly interested in commits that introduce a relatively small change, since such commits are more easily overlooked than very large commits. This does not mean, however, that large commits cannot introduce central code. In fact, just because of their size, we expect large commits to be very likely to introduce, at least, some central code. For our example from opus, each usage of function align produces incoming interactions (size of the object) and outgoing interactions (size with alignment), and thus, commits associated with that function have a high node degree. Therefore, any—even small—change to align that affects its return value affects every location the function is used at. At the same time, the function itself and consequently any commit that touches only that function is only a few lines long. This scenario is illustrated in Figure 9(a), where a small change to central code introduces many outgoing interactions with other commits. In Figure 9(b), new code is introduced that is not central by itself, but merely interacts with central code. In this case, only a few interactions are attributed to the new commit. This scenario illustrates that commits touching central code can have a large effect on the rest of a software project and, if such commits can be detected, this information might guide testing and review efforts on these critical changes to the code base. SEAL helps to identify such cases.

Fig. 9.

Author interactions . Commits consist not only of code changes, but they also contain meta-data, such as when or by which author a commit was created. From these data, SEAL can extract information about socio-technical interactions in a software project. Socio-technical interactions are often analyzed using communication or collaboration relationships between developers at the file or function level and may also include dependencies between artifacts [17, 24, 25, 26, 42]. However, indirect dependencies are missed this way, which may still be relevant to properly characterize certain aspects of the socio-technical interactions in a project, e.g., classifying the roles of developers.

For example, consider the function ssh_handle_packets from libssh in Figure 10. In this function, a context object, which is returned by ssh_poll_get_ctx, is passed to ssh_poll_ctx_dopoll, creating an indirect dependency between the two functions. This indirect dependency is interesting, because information carried by the context object could be relevant for the computation in ssh_poll_ctx_dopoll. Thus, if a developer modifies the context object in ssh_poll_get_ctx, then he/she also indirectly influences the function ssh_poll_ctx_dopoll but might not be aware of that. With SEAL, we can incorporate such indirect dependencies in socio-technical analyses that can only be detected with a data-flow analysis and, thus, uncover hidden dependencies between developers. We formulate the following problem to demonstrate that SEAL’s information on commit interactions are useful for answering socio-technical questions about a software project:

Fig. 10.

P₂: Which authors interact via commit interactions and what are the characteristics of the arising socio-technical structure?

By lifting SEAL’s commit interaction graph to author information, we can easily identify which authors interact with each other indirectly via data flow. For this purpose, we project on the commit interaction graph such that the authors of the commits become nodes and interactions between authors become directed edges. This projection can be implemented with vertex identification, that is, all nodes whose commits have the same author are identified with each other. The remaining edges then represent interactions between authors at a data-flow level. With this information, we can not only see which authors interact with each other, but also which authors interact with especially many (or few) other authors, for example, to determine an author’s role in a software project [8, 25].

Commit–author interactions . A commit can interact with commits from one (Figure 11(a)) or multiple authors (Figure 11(b)). The fact that a high number of authors participate in interactions for a commit suggests that the author needs to be familiar with code from many different developers rather than with their own code or the code of only few developers. This again may have implications for the bug-proneness of the associated source code [23] and emerging coordination requirements between authors [22]. Another example where commit–author interaction data is useful is to select potential candidates for code review by determining which authors’ code is affected by the commit to be reviewed. To collect this information, we need both, commit interaction data (data flow) and commit meta-data (author names). We demonstrate that with SEAL, we indeed can combine these, addressing the following problem: P₃: How many authors interact via commit interactions?

Fig. 11.

To address this problem, we combine the commit and author interaction graphs. The resulting graph contains all commits and authors as nodes, and a directed edge from a commit \(c_{\mathrm{1}}\) to an author a if and only if there is an edge \((c_{\mathrm{1}}, c_{\mathrm{2}})\) in the program’s commit interaction graph and a is the author of commit \(c_{\mathrm{2}}\) . Note that this graph can also be constructed directly from information contained in the commit interaction graph. The outgoing node degree of commit nodes gives us the number of interacting authors showing whether a commit interacts with many or only a few authors.

4.2 Socio-technical Data-flow Analysis

Data-flow analysis is typically not concerned with commit or socio-technical information, but it can highly benefit from this additional information. First, computing commit interactions helps solve a multitude of additional data-flow problems with virtually no additional overhead compared to computing commit interactions only. Second, combining information on commit interactions computed by SEAL with information of a client data-flow analysis provides new insights that were previously locked away. As an example, taint analysis is frequently used to detect code injection vulnerabilities, such as SQL injections, but it can report only the detected potential security issues; it cannot attribute its findings to a specific project version, author(s), or development team.

As described in Section 2.3, our commit interaction analysis \(\mathcal {T}\) needs to exhaustively compute the precise, full exploded supergraph using IDE [50] to generate the commit data. It propagates all variables of a given target program and therefore computes all data flows for all program variables. Thus, \(\mathcal {T}\) also solves all data-flow problems that are concerned with data flows of program variables. All variations of taint analysis—for any given set of sources and sinks—can therefore be directly answered using the exploded supergraph that has already been constructed for \(\mathcal {T}\) . Reusing the parts of a previously computed exploded supergraph for a new analysis is beneficial, since IDE’s [50] runtime complexity is \(\mathcal {O}(|N|\cdot |D|^3)\) .

More importantly, a taint analysis that is set up to detect SQL injections, for example, cannot only be solved on \(\mathcal {T}\) ’s exploded supergraph, but, in contrast to a traditional data-flow analysis, also access commit information. SEAL allows one to compose different data-flow analyses with commit information and, for the first time, allows us to embed statements obtained by program analysis into the socio-technical context of a software project. To demonstrate that augmenting client data-flow analysis results with information on commit interactions indeed provides novel insights, we formulate the following problem:

P₄: Can commit information be utilized to gain additional insights from traditional data-flow analyses by making them change-aware?

To showcase how SEAL can be used to enrich existing data-flow analyses, we employ an existing taint analysis from PhASAR. We parameterize the taint analysis for detecting SQL injection vulnerabilities, and we enrich it with commit information. For an example, consider Figure 12, which depicts a program snippet that is vulnerable to SQL injection attacks. Any user input is considered as tainted and must be sanitized by a call to function sanitizeSQLString before it is sent to the SQL database server using the sink function executeQuery.

Fig. 12.

By allowing the taint analysis to exchange information with the commit analysis \(\mathcal {T}\) , SEAL can attribute the findings directly to the commits, authors, as well as development teams that are involved in the critical data flows (and potential security vulnerabilities) as reported by the taint analysis. This allows one to determine which commit introduced a potential vulnerability, which authors worked on the code that caused the data flows involved, and which developers should look into and the reported issue. A further illustration of how commit data can help to find code authors related to an SQL injection vulnerability is given in Section 7.2. In practice, static analysis—especially if conducted in a whole program manner—produces lots of potential findings, many of which may be false positives [2, 19], such that it has become a real challenge to prioritize and check them. It is important to note that SEAL cannot determine per se whether a finding is a true or false positive with respect to the original analysis’ semantics. But with socio-technical information, one can contextualize the results to asses their likelihood, and SEAL helps to prioritize and distribute them by providing socio-technical context information. For example, findings that concern multiple people are potentially more complex and could therefore be processed later, or findings could be filtered to involve developers only from a specific team, so people who potentially understand the issue better can look at it.

5 SEAL in Action

To demonstrate SEAL’s merits and potential, we apply it to investigate the problems presented in Section 4. Again, our goal is not to evaluate these problems in full detail, as this would surely require an entire research paper on its own. Instead, we aim at demonstrating that, with SEAL, we are indeed able to tackle such problems in a way that was hard to achieve without it.

Experimental setup . We use PhASAR in its most precise configuration for our analysis \(\mathcal {T}\) . It uses a call graph based on points-to information computed by an Andersen-style [1] pointer analysis to resolve indirect call sites, and it provides our client taint analysis with control-flow, type-hierarchy, and points-to information. Note that this setting aims at soundyness. Soundy analyses, introduced by Livshits et al. [39], use sensible under-approximations to cope with hard-to-analyze language features that would otherwise produce unduly imprecise results.

We selected 13 open-source C/C++ projects from a diverse set of application domains and with different sizes to increase external validity. Table 1 lists these projects along with relevant information. It also lists the revisions that we used for our analysis.

Table 1.

	Domain	LOC	Commits	Authors	Revision
bison	Parser	591,687	26,281	253	849ba01b8b
brotli	Compression	34,833	1,030	87	ce222e317e
curl	Web tools	195,685	26,949	871	1803be5746
grep	UNIX utils	619,598	23,794	262	70517057c9
gzip	Compression	622,480	22,193	242	7d3a3c0a12
htop	UNIX utils	25,775	2,243	156	44d1200ca4
libpng	File format	74,571	4,098	58	a37d483651
libssh	Protocol	95,235	5,126	115	cd15043656
libtiff	File format	88,561	3,470	45	1373f8dacb
lrzip	Compression	19,215	935	25	465afe830f
lz4	Compression	18,813	2,541	130	bdc9d3b0c1
opus	Codec	70,267	4,077	107	7b05f44f4b
xz	Compression	38,441	1,298	22	e7da44d515

Table 1. Subject Projects

Metrics for bison, grep, and gzip include submodules.

5.1 Commit Interaction Graph Analysis

In the first part of our evaluation, we address the problems P₁–P₃ of Section 4: (1) quantitatively using our subject projects and (2) qualitatively highlighting interesting insights we obtained.

Central code. P₁ is concered with the fraction of commits that affects code that is central in the dependency structure of a program. Thereby, we are interested in commits that introduce a relatively small change (cf. Section 4).

As an example, Figure 13 depicts a scatterplot of the two relevant variables—commit size and node degree in the commit interaction graph—for opus and htop.⁷ The horizontal line denotes the 20-percentile and the vertical line the 80-percentile, respectively, putting small commits with a high node degree in the upper left quadrant. The marginal distributions show that most commits are relatively small and have a low node degree. While the distribution of the commit size has a similar shape for all of our subject projects, there are some differences when it comes to node degrees. For example, for opus, we have three clusters: The biggest cluster consists of commits with a low node degree (below 400), followed by a smaller cluster with medium-sized node degrees (400–800), and an even smaller one where the commits have very high node degrees (above 800). It is this latter cluster that is relevant to identify small commits with central code.

Fig. 13.

As an example for a change to central code, we qualitatively inspect commit \(\texttt {348e694}\) from opus (marked in Figure 13(a)). This commit has a very high node degree but touches only one line that belongs to the function int align(int i) (cf. Figure 14), which we already established to play a central role in the program’s data flow in Section 4. This demonstrates how even a small change can have a huge influence on the data that flows through a program, also showing that blame interactions carry different information than code churn.

Fig. 14.

A second example is commit \(\texttt {5e4b182}\) from htop. This commit refactors functions that are intended to replace C’s memory allocation functions and shows a very high node degree. These functions were initially introduced back in commit \(\texttt {a1f7f28}\) , and its successor commit \(\texttt {b54d2dd}\) replaced all occurrences of the original memory allocation functions in htop with the new substitutes contributing to their central role in the project. Commit \(\texttt {b54d2dd}\) also has a high node degree, but it is much larger and it touches 42 different files. This example matches the scenario described in Figure 9(a) that we used to motivate our notion of central code: Commit \(\texttt {5e4b182}\) modifies central code and, thus, affects many commit interactions despite being relatively small.

Investigating P₁ demonstrates that, with SEAL, one can gain deep insights into the structure of a software project by combining high-level repository information and low-level data-flow information into novel software metrics, e.g., estimating the importance of a change.

Author interactions. P₂ is concerned with the socio-technical structure of a software project demonstrating how meta-data associated with commits can be utilized with SEAL. In particular, we investigate how developers interact with each other at a data-flow level.

As an example, Figure 15 shows for each author of opus, libssh, and libtiff (blue dots) the number of surviving commits (commits that occur in the commit interaction graph) and the number of other authors his/her commits interact with at a data-flow level. It is immediately apparent that, in opus, there is one main developer who authored the vast majority of commits and hence, interacts with code from all other authors. All other authors submitted only comparatively few commits to the project. This pattern can be observed in many of our subject projects, especially the smaller ones. One notable exception to this pattern is libtiff (Figure 15(c)), which has more authors that contributed a larger number of commits. Interestingly, despite this inequality in the distribution of number of commits, the number of interacting authors is more evenly distributed. That is, there are authors who introduce interactions to code from many other authors with only a few commits, whereas the other authors’ code interacts with only very few other authors. Such information is useful to identify which authors contributed to central or only peripheral parts of a project [25], which is interesting, since changes to central code can have a bigger impact on the project.

Fig. 15.

SEAL is able to identify interactions between authors that cannot be detected by purely textual or syntactical approaches, which are commonly used when analyzing socio-technical aspects of software projects [17, 24, 25, 26]. Figure 16 shows the difference between author interactions computed by a file-based approach, where one considers two commits interacting if they edit the same file (co-edits), and \(\textrm {CI}\) -based author interactions as computed by SEAL. We notice that SEAL identifies less interactions as compared to the file-based approach, which is apparent in the negative range of the y-axis (orange). This result is in line with recent findings that a file-based approach reports many spurious links that using a more precise static analysis can avoid [18]. In addition to removing spurious interactions, for almost all of our subject projects, SEAL also finds unknown interactions between authors that the file-based approach could not detect, especially in larger projects. This includes interactions across files and among distant code fragments that are connected via data flow.

Fig. 16.

As an example, let us consider the author of commit \(\texttt {3659e8c}\) of libssh. This author contributed only one commit affecting source code to the project, which happens to implement asynchronous socket handling and, according to the commit message, “is intended as a ground work for making libssh asynchronous.” The implementation is mostly restricted to a single file, so approaches based on file- or function-level co-edits can only ever find author interactions within that file. However, since socket handling is a very integral part to libssh, this commit actually interacts with code from all over the code base written by many different authors (we consider this a central commit according to P₁). Some of the interactions even originate from indirect dependencies that can only be detected with a precise data-flow analysis. Indeed, the indirect dependency in the function ssh_handle_packets as described in Section 4 involves the commit in this example. With SEAL, we not only detect this case, indeed, we find 50 additional authors that interact with code from the author in question, indirectly, via data flow.

Another common approach for calculating interactions between authors relies on call relations between functions. Figure 16 and 17 compares SEAL to an analysis that extracts call-graph data directly from LLVM (i.e., that establishes a link between two authors if a function where one contributed code to calls another where the other author contributed code to). It is important to note that both the call-based approach and SEAL report less author interactions than the file-based approach. This is further evidence that the links inferred by the file-based approach are spurious. Furthermore, the \(\textrm {CI}\) -based approach finds considerably more links than the call-graph-based approach, especially for larger projects. This is because SEAL also considers indirect dependencies that can propagate across functions that are not connected via a call-relation. We refer the reader to Section 7.1 for a detailed example illustrating the role of indirect dependencies.

Fig. 17.

Commit–author interactions. P₃ is concerned with identifying which other authors’ code a commit affects via data flow. To address this problem, we need both commit interactions and author information. Figure 18 shows for each commit its number of interacting authors, normalized by the number of distinct authors per subject project that have at least one commit participating in a commit interaction. The violin plot visualizes the associated probability density. There are significant differences across projects. An extreme is xz: All surviving commits are from the same author and, hence, every commit interacts with commits from that one author. For other projects with few authors, such as gzip and lrzip, most commits interact only with code from few other authors. As projects grow and gain more contributors, most commits tend to interact with more authors (relative to the total number of authors of a project), as can be observed, for example, for bison and lz4. However, we do not only observe differences between projects of different size, but also between projects of similar size and a similar number of contributors. For example, while for opus most commits interact with about half of the authors, the results for libssh show two groups of commits—one where commits interact with comparatively few authors and one where commits interact with comparatively many authors. There could be various reasons for such differences, and they cannot all be explained by commit interactions alone. For example, the roles of the authors in a project can influence with which authors their commits interact [25], and the architecture of a project could also be of relevance. The crucial point is that, with SEAL, we are able to study these kinds of socio-technical problems in the first place.

Fig. 18.

5.2 Socio-technical Data-flow Analysis

To illustrate that making a data-flow analysis change-aware provides additional insights, let us apply SEAL’s augmented taint analysis (Section 4.2) on the program shown in Figure 12. The taint analysis will reveal an SQL injection vulnerability in Line 11. The analysis is able to detect the undesired data flow by checking the data-flow information for variable argv already computed by the commit analysis. The data-flow path that causes the potential SQL injection vulnerability comprises the following sequence of instructions: \(p = i_{\mathrm{9}} \rightarrow i_{\mathrm{10}}^{\mathrm{callsite}} \rightarrow i_{\mathrm{2}} \rightarrow i_{\mathrm{10}}^{\mathrm{retsite}} \rightarrow i_{\mathrm{11}}\) .

Besides being able to answer data-flow queries directly, the taint analysis is able to access any information on commit interactions at any point in the program. For example, the analysis can query the commit that generated an instruction with \({\mbox{base}}(i)\) . It can therefore determine that the commit \({\mbox{base}}(i_{\mathrm{3}})\) does—contrary to the intended semantics of sanitize—circumvent the sanitization of variable s. Or, using \(\mathrm{CIP} (i)\) , the analysis can determine all commits—and thus all authors—involved in the disallowed data-flow path. This is highly interesting in actual software development practice, since the findings of any data-flow analysis can now be associated with the commits and developers that have been involved in the code for which an issue has been found. Combining commit and data-flow information opens up a multitude of useful scenarios: The taint analysis, in our example, is able to compute \(\mathrm{CIP} (i_{\mathrm{13}})\) , and it can therefore report that the authors of the commits \(\texttt {3e8882e}\) and \(\texttt {ea8426c}\) have been working on the code between which the undesired data flow has been found. It can report the potential SQL injection directly to these authors. Since the commit information is data-flow sensitive, the taint analysis can relate commits and their corresponding authors to the SQL injection even when they did not touch the part of the code that executes the SQL statement directly. Furthermore, the analysis is able to issue that commit \(\texttt {ea8426c}\) and its respective author (knowingly or unknowingly) introduced the vulnerability by only sanitizing the input when the test switch is disabled. This ability has a huge potential for addressing a large number of interesting follow-up research questions.

6 Threats to Validity

Internal validity: . Lowering commit information to LLVM-IR is complex due to the inherent technical complexity and, therefore, might introduce errors in our commit mapping. So, to validate the correctness of our commit meta-data mapping, we devised a validation procedure based on LLVM’s debug meta-data that allows us to validate the lowered information. The validation procedure additionally computes the commit information on the fly, based on debug meta-data, and compares these to the annotated commit meta-data from our lowering strategy. This way, we guarantee that our meta-data are, at least, as precise as if we had used LLVM debug information for calculation. Note that using LLVM debug information in general (and not only for evaluation) would constrain our approach to only being usable with debug builds, so, building our own lowering strategy is necessary to support release builds.

We use git blame to determine the last modifying commit, which only offers line-based precision. From what we have seen, this does not distort the overall picture, as we where still able to locate many interesting cases in our subject projects, demonstrating the general merits and potential of SEAL. Nevertheless, to open up our framework for further improvements, we have set up it in a way that it enables users to exchange the commit querying functionality.

Another source that may introduce imprecision in our mapping are compiler optimizations. For certain code transformations, it is not clear how the commit meta-data should be handled, for example, when the common-subexpression-elimination pass removes code where two subexpressions originate from code added by different commits. To circumvent the influence of compiler optimization, we run our analysis passes before all optimization passes. A more general but laborious solution would be to modify all optimization passes to preserve and update commit meta-data, but running our analysis first renders this unnecessary, producing the same results.

External validity: . In our evaluation, we test our approach on several relevant problems and scenarios with the goal to cover a wide range of different use cases, which we evaluated on different real-world projects, varying in size, age, and maturity. We selected 13 common C/C++ open source projects from different application domains and analyzed them qualitatively and quantitatively. Our results revealed many interesting events in the analyzed real-world software projects, which demonstrates the potential of our approach and shows that combining repository information with precise data-flow analysis can uncover previously unobservable interactions.

Since our approach and a large part of our implementation is language-independent, we see no principle roadblock for using SEAL on projects written in languages close to C/C++, such as Rust, Go, or Swift, which already have mature LLVM front-ends.

7 Discussion

In the following, we illustrate the advantages of determining commit interactions based on data flow, highlight key strengths of SEAL, and discuss potential limitations. Furthermore, we lay out potential applications of SEAL and discuss how existing study setups and tools could benefit from data-flow-based \(\textrm {CI}\) .

7.1 Data-flow-based Commit Interactions

Example . SEAL leverages a data-flow analysis to determine which commits interact, and by that, which code written by one developer influences the code of other developers. In Section 5.1, we demonstrated that SEAL can identify small but central changes to a code base by using information on data flow. We also showed that, by lifting SEAL’s commit interaction graph to author information, we can build an author-interaction graph that establishes interactions based on data-flow relationships between their code. Author interactions hint at coordination requirements between authors [4, 5, 34] in that when one author’s code consumes data produced by another author, the two should coordinate. Compared to existing approaches to compute coordination requirements, such as file-based [26] or call-graph-based approaches [41], considering data flow has two clear benefits: First, using data-flow information, spurious connections can be excluded (if there is no data flow between two code parts, they do not influence each other). Second, data flows, especially inter-procedural data flows, reveal dependencies between code that current approaches cannot find.

For illustration, let us highlight conceptual differences in the data produced by existing approaches and data-flow-based commit interactions on an exemplary study in which we want to determine coordination requirements between developers based on code dependencies. Consider the example code in Figure 19: A data processing service consisting of the service itself (Figure 19(a)), a user-data access layer (Figure 19(b)), and a computation worker (Figure 19(c)). Like in the real world, the different parts of the code base are implemented by different developers (names shown on the right).

Fig. 19.

To demonstrate key differences in the data produced, we compare the interactions computed by SEAL against the two commonly used approaches for computing artifact coupling, file-based and call-graph-based, which we did also use in our evaluation (Section 8). The coordination graphs computed by the different approaches are depicted in Figure 20. When comparing the file-based graph (Figure 20(a)) to the other two, we notice that link 1 between Eric and Ada is included only in the file-based graph. The point is that link 1 is actually spurious as, at the code level, there is no coordination requirement between the two functions in the file. The reason is that the file-based approach over-approximates by considering everything in a file as related without taking program semantics into account [18]. A programmer can easily see that these two functions are alternatives and, therefore, Eric and Ada can work independently. Furthermore, the file-based graph, compared to the other two graphs, does not contain the links, such as the one between Sven and Leonie and between Sven and Eric. This is due to the limitation of not considering dependencies across files.

Fig. 20.

When we compare the call-graph-based graph with SEAL, we notice that SEAL inverts the direction of link 2 and finds two additional links (3, 4). SEAL inverts the direction of link 2 because, from a data-flow perspective. Sven’s code consumes data produced only by Eric but does not supply input to Eric’s code.⁸ When we consider link 3, we see a bidirectional connection between Sven and Leonie, capturing the fact that data from Sven are used as an input to the code of Leonie and vice versa. SEAL differentiates between a function call with and without data. In contrast, when using a call-graph approach, we find only that there is a connection between Sven and Leonie and between Sven and Eric. With SEAL, however, we now know that Eric’s code does not depend on Sven’s code, since no input is passed to Eric’s implementation, but Leonie’s code does depend on Sven’s code, as her code works on input provided by Sven.

Many studies treat collaboration links between developers as undirected [41] or cannot infer a direction in links between artifacts, such as in co-changes retrieved from files that are commonly committed within one commit [33]. With SEAL, we obtain information about the direction that is based on the underlying data flows. This code-based directionality adds additional value, compared to temporal directionality [28], as it encodes who is using whose code. Parts of this information is also present in call graphs; we can infer only who is calling whose code but not, for instance, which data are passed to the function and how the returned data are used. Ignoring this information, we would not be able to notice indirect data dependencies, such as the one between Eric and Leonie that produces link 4. This link arises from the fact that Sven forwards data computed by Eric’s loading function to Leonie’s compute implementation. This information hints at an important coordination requirement that should not be missed, as when Eric adapts what user data are loaded, Leonie’s implementation needs to handle it. Hence, coordination between Eric and Leonie is important before Eric makes a change. This kind of coordination requirement is found only with SEAL and can be found only by utilizing a whole-program inter-procedural data-flow analysis.

Integration of SEAL into other research . A good example for illustrating SEAL’s merits arises from the work from Mauerer et al. [41]: In a large-scale empirical study, they explore the relationship between software quality metrics and socio-technical congruence by analyzing the alignment of social communication structures with technical dependencies. Important for our discussion is that they determine these dependencies by extracting them from the version control system: Two artifacts are related when they appear in the same commit, and two artifacts are dependent when they have static language-level dependencies, such as call-graph connections and type references. As explained in our example above, data flows carry additional information compared to file or call-graph-based approaches and thus enrich the artifact dependency networks of Mauerer et al. [41].

SEAL can not only help to determine artifact relationships, but with our author-interaction graph, approaches that analyze developer communities [28] or organizational structures [27] can profit from detailed data-flow-based interaction data, such as the work from Joblin et al. [28], who mine developer communities based on commit information and source-code structure. Specifically, they use a function-based and a committer-author-based approach to determine related authors. This way, they potentially miss important non-local links between developers, as conceptually shown in our example. With SEAL, they are able to extend their information on developers with data-flow-based author interactions and expand their author network as well as remove potential spurious links.

Another example in which SEAL’s commit interactions can be beneficial is in coordinating bug fixes or semantic code changes in large code bases or ecosystems. It has been reported that, for large industry code bases, even small code changes can cause severe bugs in other parts of the code [62] (e.g., a pointer can now be null, an integer variable can now have values larger than x, or a list that was previously sorted is now unsorted). As preventing changes is not an option (applying a bug fix that patches a vulnerability can not be delayed indefinitely), developers need to find ways to coordinate and let others know about the change. With commit-author interaction data produced by SEAL, we can easily figure out whose code could be affected by a change. Think about link 4 from Eric to Leonie that SEAL found because data produced by Eric are passed via Sven to Leonie’s code. With this information, Eric can proactively contact Leonie to look at his proposed change to determine that her code works well with it.

7.2 Socio-technical Data-flow Analysis

Example . Interpreting or acting on program analysis findings is a difficult task, especially in larger software projects [2]. One reason is that common program analysis tools focus on technical aspects, showing what is wrong in the code or highlighting conceptual problems [44, 45]. Typically, program analysis tools do not put their findings into a socio-technical context, ignoring the social structure around the code. Manually fitting this information post hoc onto the analysis results is cumbersome and difficult, because developers do not have this information in their heads and also because tools such as git blame provide only raw information that is not contextualized with regard to the analysis semantics. With SEAL, we build a bridge and automatically attach socio-technical information onto low-level analysis findings.

Consider the example in Figure 21 (cf. Section 4.2), where a program analysis tool, such as PhASAR, identified an SQL injection.

Fig. 21.

After the tool analyzed the code, it reports that there is a possible SQL injection vulnerability at Line 11, arising from a data flow from variable sani (Line 10), which contains unsanitized user input. Following this report, an engineer from the company’s security team can investigate the problem and finds that the recent change 5341f7b, depicted in Figure 22, introduced the offending variable in Line 10. Based on this information, Leonie can be contacted by the security engineer and asked to fix the SQL injection. However, this initial conclusion is wrong! Leonie’s code did not introduce the SQL injection, since the actual problem is in the implementation of sanitize, where unsanitized data are leaked when running in test mode. This is likely found out only later when she starts digging into her code and the implementation of sanitize and refers the problem back to the security engineer.

Fig. 22.

Misattributions like this arise due to missing information and cost valuable developer time. From the point of the security engineer, the tool report was plausible, and Leonie’s change seemed related, so the ticket was forwarded to her. With SEAL, this scenario could have worked out differently if the tool’s initial report would be combined with socio-technical information, obtained by SEAL’s data-flow analysis. By attaching the commit-interaction path of the offending call instruction for executeQuery, which contains all commits that had influence on the data flowing into the call, we can determine all authors that are involved. Figure 23 depicts an exemplary error message, where the SQL injection tool’s finding is contextualized with socio-technical information. So, in our example, the security engineer would see the report about the SQL injection, together with the information that the two developers Leonie and Eric are involved. The security engineer would then assign both the ticket and involve Eric from the beginning.

Fig. 23.

Integration of SEAL to other research . Inspired by our example, a number of existing program analysis tools qualify to be extended with additional socio-technical information. For example, Bessey et al. present an extensive experience report [2] that details on how industrial-grade static analyzers are used to find bugs in the real world and how these tools are perceived by companies and software developers: When confronted with analysis findings, developers tend to get emotional; at the end, it is their code that is supposedly flawed. Experience shows that the more peers are involved in discussing and dealing with analysis findings, the more likely it is that someone can diagnose an error (or identify a reported error as a false positive), report on experiences of similar errors, and eventually fix it [2]. SEAL provides this additional socio-technical information. For instance, it reports the set of authors involved in a given analysis finding, which makes it easier to assign errors to the relevant developers.

Furthermore, the information produced by program analysis tools can be filtered according to the attached socio-technical information (e.g., an analysis tool used within an integrated development environment could show only the findings that actually have a socio-technical connection to the developer using it). In this vein, Harman and O’Hearn report on the difficulty of making bug reports more actionable [19]. In particular, they debunk the implicit assumption that analysis and testing technology that follow the “ROFL” (Report Only Failure List) strategy is enough to get engineers to fix reported bugs. Software developers, however, are never short on lists of fail and failure reports that need to be taken care of. Any additional piece of information that helps developers prioritize these todos are helpful. The report of Harman and O’Hearn underlines the importance of the following information to find and prioritize bugs: (i) relevance, the developer to whom the bug report is sent is one of the set of suitable people to fix the bug; (ii) context, the bug can be understood effectively; (iii) timeliness, the information arrives in time to allow an effective bug fix; and (iv) debug payload, the information provided by the tool makes the fix process efficient (reproduced from Reference [19]). SEAL helps to deliver such information.

The socio-technical interaction provided by SEAL may benefit scheduling code reviews, e.g., developers with a lot of data dependencies to the changed code can be suggested as reviewers, as the changes could potentially interact with their code and introduce bugs. In 2008, a Debian developer accidentally broke a random number generator in a particular version of OpenSSL with what was thought to be a fix [49]. This shows that—in practice—it can be quite difficult to assign suitable reviewers for a given pull-request. With help of information as computed by SEAL, this could possibly have been avoided by allowing peers that are familiar with this complex part of OpenSSL’s code base to intervene.

7.3 Limitations

A bottleneck of using SEAL is clearly the computational cost of the underlying data-flow analysis. Table 2 depicts the overhead generated by SEAL for the projects of our study (cf. Section 5.1). We see that, for the average project, computing the blame data adds roughly two minutes. Interesting to note, computing blame information correlates with the history length of a project ( \(\rho _{\text{pearson}} = 0.75\) and \(\rho _{\text{spearman}} = 0.90\) ), so projects with longer histories should expect a bit more overhead.⁹ Overall, we can see that for nearly all projects the analysis time dominates the overhead. As described in Section 2.3, we tuned SEAL’s analysis to be as precise as practically feasible. Specifically, we made our analysis context-sensitive, alias-aware, and inter-procedural. From our point of view, this is not too problematic, since SEAL is designed to run once to capture a full and precise picture of a given software project. Nevertheless, the underlying data-flow analysis is highly configurable in the sense that PhASAR allows one to select different helper analyses and to change each analysis’ parameters to trade off precision and performance. For instance, PhASAR allows its users to choose a less precise but faster points-to analysis. Similarly, one can choose a call-graph algorithm that underapproximates information and does not resolve indirect function calls instead of the one we chose. SEAL’s call-graph algorithm more expensively tries to identify potential call targets at indirect function calls to improve analysis precision, using points-to information and type hierarchies. Especially C++ developers seem to minimize the amount of indirect jumps [52] such that underapproximating call-graph algorithms may still provide enough precision for a user’s needs. Through these tuning knobs, users can reduce the analysis time if (slightly) less precise results are acceptable for their setting.

Table 2.

Another limitation of SEAL is that the granularity of commit data are currently line-based, which is the common granularity used by git and other tools that work on git data. Previous work already demonstrated for different use cases [16, 37] that a syntax-based approach could, with a additional cost, improve the precision. With token-based blame information, as proposed by Germán et al. [16], SEAL could be even more precise.

Commit interactions based on data-flow dependencies provide a new way to infer dependencies between developers that is based on how their code interacts. However, there might be other kinds of dependencies that give rise to coordination requirements or are otherwise of interest. For example, external communication via file system or network, interactions through non-functional properties, side channels, and operating-system-level functionality could also require coordination between developers. Therefore, even if we can include data-flow dependencies with SEAL, we should still keep broadening the scope and exploring new kinds of information that is currently not available to determine developer interactions.

8 Related Work

SEAL can be applied to different areas, since many techniques can profit from either incorporating repository information or detailed low-level program information. So, we discuss work that is directly related to our evaluation, and we put our work in context to related areas that can benefit from our commit analysis.

Code complexity metrics. . Various code complexity metrics have been proposed [13]. Tornhill [57] discusses multiple software metrics and analysis approaches that are used in companies on real-world software projects to drive decision making and help with software maintenance. Software metrics help developers to focus on important code regions and maximize their improvement efforts. In their systematic mapping study, Varela et al. [58] categorized almost 300 different source code metrics from 226 studies into programming paradigms for which they are used. They compare how and from which systems the metrics are extracted and rank them by the number of occurrences in studies.

A close look at Tornhill [57] and Varela et al. [58] reveals that nearly all common software metrics rely only on syntactical information. Very few software metrics go beyond syntax and, even if they do, they mostly consider information that is simple to obtain, like inheritance relationships. We see a potential for improvement here by incorporating more combined analysis approaches, such as commit interactions, especially when the analyses incorporate change information as well, which can help to refine results by putting them into a historical context.

Bug prediction. . Bug prediction focuses on identifying and predicting potentially buggy code locations. D’Ambros et al. [9] compare a wide range of different bug prediction approaches. Many of the approaches focus on high-level information, such as change metrics (e.g., number of revisions), source code metrics (e.g., depth of inheritance tree), and code churn, and do not incorporate lower-level information, such as data flow, which can be used to approximate the program semantics. Khatri and Singh [32] performed a SWOT analysis on cross-project defect prediction (CPDP), analyzing a wide range of approaches. Interestingly, one key opportunity for improvement of CPDP that they highlight is the integration of more process metrics, such as number of developers working on a module. Our work can enable CPDP approaches to integrate process metrics that also incorporate low-level interactions between code changes or authors.

Program analysis. . Program analysis techniques have been used successfully to prove specific properties about a program [12] and to detect bugs [48]. Program analysis tools, such as SpotBugs or clang-tidy, build on these techniques to catch bugs early in the development cycle. However, typical program analysis techniques, including srcML, analyze only one specific version of a program and do not incorporate version control information, which precludes detecting evolutionary problems, for example, detecting architectural decay by measuring the gradual deterioration of the coupling between classes. In addition, one could also extend already existing regression analyses by incorporating version control information directly into the analysis semantics. By utilizing both high-level repository information and data-flow information in a joined analysis, one could bridge this gap.

Socio-technical software analytics. . In a special issue about software analytics, Menzies and Zimmermann highlight the importance of using analytical methods that incorporate data from real-world software projects to reason about software development processes [43]. They predict that the field will develop and benefit from more and different data sources. In the same vein, in a meta-analysis on socio-technical software-engineering research, Storey et al. [54] examine a wide range of publications to determine the current state-of-the-art and areas for further improvement. They highlight that many research papers employ a data-driven approach. So far, the used data sources (e.g., issue, bug, or commits) often do not incorporate low-level dependency information as provided by data-flow analysis. For example, the state-of-the-art in socio-technical analyses is to use function-level semantic coupling [26] or fine-grained co-edits [17] to represent interdependencies between software artifacts. Our work tries to address this shortcoming by providing a conceptual framework that combines the existing data sources with program semantics. This information could then be used to discover previously hidden socio-technical connections and coupling between developers that are invisible when only looking at syntactical information.

Software repository mining. . Software repository mining focuses on gathering, modeling, and studying the data and software artifacts produced by developers during the software development process [10]. Kagdi et al. [29] state that source code changes are the fundamental unit of software evolution but also mention that current version control systems do not provide information about code semantics. This moved the initial focus of the research area to consider mostly change meta-data; a shortcoming that we address with SEAL.

Change impact analysis. . The goal of change-impact analysis is to determine the consequences of a change using dependency analysis techniques, such as data-flow or control-flow analyses [3].

In a survey, Li et al. [38] analyze 23 different code-based change-impact analyses and build a framework to compare them. The authors state that many approaches use traditional program analysis techniques. They further highlight that useful information for improving change-impact analysis can be obtained with repository mining techniques. For example, Kagdi et al. [30] ran their conceptual coupling analysis on the current and previous version of source code to improve their analysis precision. Lehnert [36] lists only few approaches that combine repository mining with traditional program analysis. For example, Kagdi and Maletic [31] use high-level information and dependency information separately and look whether predictions based on either of these two agree. Conceptually, SEAL is able to improve existing change-impact analysis approaches, as we combine syntactic and semantic information into one joint analysis.

In a user study, Hanam et al. [18] have shown that semantic relations computed by static analysis helped users in completing code review tasks faster compared to using only syntactic relations. Their approach uses abstract interpretation in combination with an AST-based diff to extract semantic relations from JavaScript projects to reduce unwanted “noise” in purely syntactic relations. In contrast to their approach, SEAL has information about all commits available during the analysis, allowing for even more control in determining which interactions are noise and which are not. Still, the results of Hanam et al. support our claims that combining static analysis with change information can be beneficial for change impact analysis.

AST-based analysis. . There are tools that combine repository information with syntax information. With such light-weight syntax-based repository mining tools, most notably Boa [11], researchers can gather repository metrics and high-level code information about a wide range of different software projects. However, those do not model language semantics and do not allow us to attribute socio-technical information to the results of other more sophisticated program analyses.

Some research in this direction builds on srcML, “an infrastructure for the exploration, analysis, and manipulation of source code” [6]. srcML has been used, among other things, for type checking [46], program slicing [47], and pointer analysis [64]. However, srcML only provides an AST-based view on the code and does not come with support for more sophisticated analyses, such as a data-flow analysis; the reason is that srcML by itself does not model language semantics. For example, srcML does not run the preprocessor or model C/C++ language feature, such as overload resolution or template instantiation, which are important to infer semantics. That is why we built SEAL on top of Clang and LLVM, a industry-strength compiler framework, enabling us to combining program analysis with repository mining. In any case, syntax analysis is less precise than a data-flow analysis in determining dependencies between program parts.

Variability-aware analysis. . Variability-aware analysis aims at efficiently analyzing variant-rich software systems [55]. The key is that, instead of analyzing all variants individually, a variational program representation (i.e., a program representation that retains all points of variability) is analyzed [61]. The goal is to save analysis effort by efficiently reusing analysis results across variants [60]. Variability-aware analysis is related to SEAL’s approach in that it incorporates multiple variants of a software system (possibly generated by different variability implementation or configuration mechanisms) in a program analysis run. In contrast to SEAL, the goal is performance; incorporating historical and socio-technical repository information is not in scope.

9 Conclusion

State-of-the-art software repository analyses often do not have precise information about a program’s operational semantics at their disposal, if at all, or they try to include selected information in an ad hoc manner. On the flip side, program analyses such as data-flow analysis do not have access to repository information, which restricts the interpretability of their results by excluding the socio-technical context. SEAL bridges this gap by conceptually mapping repository-specific information into the compiler’s internal representation. The mapped information can be used by specialized data-flow analyses to infer relationships between commits (i.e., determine commit interactions) or by existing data-flow analyses to augment their results with repository information.

In a evaluation on 13 open-source projects, we have demonstrated that SEAL and the generated data-flow-aware repository information can be utilized to answer relevant questions in research and practice. The first part of our evaluation shows that, with SEAL, we can obtain new insights that could not have be found with existing methods that do not integrate both repository information and data-flow information. Our qualitative analysis has uncovered interesting cases, for example, where textually small changes have a far-reaching impact, which could only be pinned down by considering data flows. The second part of the evaluation demonstrates how repository information computed by SEAL can be utilized to augment existing program analyses. This allows us to put the analysis results into a socio-technical context for further interpretation, for example, by relating SQL injection vulnerability directly to the involved developers.

Overall, our evaluation demonstrates SEAL can be used to track down previously hidden interactions and to augment existing analyses with socio-technical information, which enables us to gain insights that could previously not be tracked down.

Footnotes

Supplementary Website: https://se-sic.github.io/paper-SEAL/ and on Zenodo: https://doi.org/10.5281/zenodo.7595363.

https://vara.readthedocs.io/en/vara-dev/.

Git-blame is a versioning system mechanism that annotates each line in a file with the commit that last modified it.

⁴

https://github.com/se-sic/vara-llvm-project/.

⁵

https://github.com/libgit2/libgit2/.

⁶

https://github.com/travitch/whole-program-llvm/.

⁷

The plot excludes a few very large commits that can be considered outliers (e.g., import from the old repository, large-scale code reformatting). We excluded outliers using Tukey’s fence ( \(k=3\) ), which removes data points that lie more than k times the interquartile range above or below the first or third quartile.

⁸

This is in itself is neither a drawback nor advantage of SEAL compared to the call-graph-based approaches, as the meaning of directionality depends on the use case and follows from the actual research question (e.g., where determining a link between to artifacts does not require directionality, a link between developers does if we want to determine who is using whose code).

⁹

For the history-length correlation, we observed only curl as an outlier, which took particularly long to compute its blame data. A manual inspection of the project revealed that many files in curl contain very old code (committed pre-2007), which, combined with the amount of commits in the history, means that the blame computation often needs to traverse a very large part of the history.

References

[1]

Lars Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. University of Copenhagen.

Abstract

1 Introduction

2 SEAL at A Glance

2.1 Code Annotation

2.2 Commit Interactions

2.3 Computing Commit Interactions

2.3.1 Indirect Function Calls.

2.3.2 Soundness and Completeness.

3 Implementation

3.1 Lowering Commit Information to LLVM-IR

3.2 Creating a Whole-program Bitcode File

3.3 Taint Analysis for Commit Interactions

3.4 Data Overview

4 Commit Interactions: Applications

4.1 Commit Interaction Graph Analysis

4.2 Socio-technical Data-flow Analysis

5 SEAL in Action

5.1 Commit Interaction Graph Analysis

5.2 Socio-technical Data-flow Analysis

6 Threats to Validity

7 Discussion

7.1 Data-flow-based Commit Interactions

7.2 Socio-technical Data-flow Analysis

7.3 Limitations

8 Related Work

9 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining

A Fast and Precise Static Loop Analysis Based on Abstract Interpretation, Program Slicing and Polytope Models

Persisting and Reusing Results of Static Program Analyses on a Large Scale

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations