CN111045920B

CN111045920B - Workload-aware multi-branch software change-level defect prediction method

Info

Publication number: CN111045920B
Application number: CN201910967466.1A
Authority: CN
Inventors: 蔡亮; 钟文枫; 刘力华; 张昕东; 鄢萌; 夏鑫; 李善平
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2021-05-04
Anticipated expiration: 2039-10-12
Also published as: CN111045920A

Abstract

The invention discloses a workload-aware multi-branch software change-level defect prediction method, which belongs to the field of change-level code defect prediction and comprises the following steps: extracting change meta information, labeling data, calculating the characteristics of each branch and multi-branch processing, and training an application workload perception model. The method adds the workload sensing module, helps the developer to find more defects in as little time as possible, and has operability.

Description

Workload-aware multi-branch software change-level defect prediction method

Technical Field

The invention belongs to the field of change level code defect prediction, and particularly relates to a workload-aware multi-branch software change level defect prediction method.

Background

Taking a Commit Guru tool as an example, (C.Rosen, R.Graw, E.Shihab. Commit Guru: analytical and Risk prediction of software Commitment. in procedures of the Joint Meeting of software Engineering 2015), the tool uses 14 measures of software change (including increasing and decreasing the number of file code lines, developer experience, etc.) to perform defect prediction on the software change through a logistic regression model. It has the following disadvantages

1. Only the main branch is analyzed and is not consistent with the common software development mode of the company

2. Without the workload-aware module, the results obtained still require a lot of time to review, and lack operability.

These problems are widely present in other code clone detection tools.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a workload-aware multi-branch software change-level defect prediction method.

The invention is realized by the following technical scheme: a workload-aware multi-branch software change-level defect prediction method specifically comprises the following steps:

the method comprises the following steps: extracting change meta information:

extracting code change meta-information from a code repository using a method provided by a GIT tool, the code change meta-information including: adding code line number, deleting code line number, author, time, diff information, change relation, change file name, change description, etc.

Step two: data annotation:

code changes are marked by using an SZZ algorithm and are divided into two types, namely defect changes and non-defect changes:

and (2.1) performing keyword analysis on the change description in the step one, and finding and marking the change of the repair defect, namely the repair type change.

And (2.2) changing the repair type in the step 2.1 according to the code line deleted according to the front diff information and the back diff information in the step one, and removing noise in the code line to obtain a defect code line.

And (2.3) finding out the change of the introduced defect code line by using the frame command carried by the GIT tool, namely the defect change. Except for the defect change, all changes are marked as non-defect changes.

Step three: calculate the characteristics of each branch:

and calculating the following ten characteristics by using the change meta information obtained in the step one and the label of the code change in the step two: the number of subsystems involved in change NS, the number of files involved in change NF, the degree of dispersion of changes at file level Encopy, the number of developers NDEV of changed files NDEV, the average time interval AGE between the last change of the changed files AGE, the average number of deduplication of the historical changes involved in the changed files NUC, the average number of lines of code LT of the files before change, whether the files are repair-type changes FIX or not, the experience EXP of the developers EXP, and the experience SEXP of the developers at subsystem level. The NDEV, AGE, NUC, EXP and SEXP are subjected to multi-branch treatment as follows:

(3.1) Multi-branch treatment of NDEV, AGE, NUC:

(a) and establishing a front and back change relation graph of the code warehouse according to the front and back change relation obtained in the first step.

(b) The first code change is processed, and a first data structure is newly created for recording history information of all files of a branch where the current code change is located, wherein the history information comprises file-related changes, file-related authors, the number of current code lines of the file and the like.

(c) And sequentially processing code changes according to the time obtained in the step one. When the current code change is processed, first, the number of child changes and the number of parent changes of the current code change are obtained from the previous and subsequent change relation diagram in step (a). If the number of the sub-changes of the current code change is more than or equal to 2, the current code change relates to a new branch, namely the current branch needs to be updated, then a blank is established for the new branch, and all data of the current branch are copied to the new branch; and if the parent change number of the current code change is more than or equal to 2, the current code change relates to merging branches, namely merging the related branches in all merging branch operations. The file relates to change, the file relates to two data of an author, and the two data are obtained by solving a union set through branches, and the current code line number of the file is obtained by adding the number of the old file code lines to the number of the newly added code lines and subtracting the number of the deleted code lines; and if the number of the child changes of the current code change is less than 2 or the number of the parent changes of the current code change is less than 2, updating the current branch, and sequentially modifying the file related change, the file related author and the number of the current code lines of the file of the current change modification file.

(d) Repeating step (c) to process all code changes.

(3.2) multi-branch processing of EXP and SEXP:

(i) and establishing a front and back change relation graph of the code warehouse according to the front and back change relation obtained in the first step.

(ii) And processing the first code change, and creating a second data structure for recording all developer experience data of the branch where the current code change is located, wherein the all developer experience data comprises developer related changes, developer related subsystems, subsystem related changes and the like.

(iii) And sequentially processing code changes according to the time obtained in the step one. When processing the current code change, firstly, the number of child changes and the number of parent changes of the current change can be obtained from the previous and next change relation graph. If the number of the sub-changes of the current code change is more than or equal to 2, the current code change relates to a new branch, namely the current branch needs to be updated, then a blank is established for the new branch, and all data of the current branch are copied to the new branch; if the parent change number of the current code change is more than or equal to 2, the current change relates to merging branches, merging the related branches in all merging branch operations, wherein three data related to change by developers, related to subsystems by developers and related to change by subsystems are obtained by solving a union set of the branches; if the number of child changes of the current code change is less than 2 or the number of parent changes of the current code change is less than 2, the current branch is updated. And sequentially updating developers related to the current change modification file, developers related to the change, subsystems related to the developers and subsystems related to the change.

(iiii) repeating step (iii) to process all code changes.

Step four: the workload perception model comprises a classification module and a sorting module:

(4.1) a classification module:

inputting the 10 features obtained in the step three into a classification module, wherein the classification module performs classification by using logistic regression, and the formula is as follows:

where p is the defect probability, ω^TFor the training of the resulting set of weight values, χ is the altered set of features

In the training stage, each change sequentially passes through the first step, the second step and the third step to obtain 10 features and labels, the 10 features and the labels are input into a classifier, and the classifier is trained until a loss function is minimum, so that a trained classifier module is obtained.

In the prediction stage, the change to be predicted sequentially passes through the first step, the second step and the third step to obtain 10 characteristics. The 10 features of the code change are input into a trained classifier, and the classifier gives the defect probability p of the change.

(4.2) a sorting module:

adding the number of the newly added code lines and the number of the deleted code lines to obtain the number of the changed code lines, then dividing the defect possibility p and the number of the changed code lines to obtain the density of the code defects, sequencing the defect change according to the density of the code defects from large to small, and finally returning the sequenced change sequence.

Compared with the prior art, the invention has the following beneficial effects: and establishing a front-back change relation graph by utilizing the front-back relation of the software change, and sequentially processing all changes of all branches according to the sequence of the change time on the basis of the relation graph, so that the software multi-branch development mode is more met, and the implementation process is as described in the step three. And adding a sorting module, dividing the defect possibility predicted by the classification module by the number of the changed code lines to obtain the code defect density, sorting the code defect density from large to small, and finally feeding the software defect prediction result back to developers. By introducing the sequencing module, codes with high defect density are preferentially fed back to developers, and the software defect repair efficiency is improved.

Drawings

FIG. 1 is a workload-aware multi-branch software change-level fault prediction algorithm model framework diagram of the present invention.

Detailed Description

Functional expression:

giving original code change information extracted from a Git warehouse, and inputting the information into a workload perception model to obtain a sequenced defect change sequence. And simultaneously, obtaining diff information corresponding to the diff by using a git diff command, and obtaining a code line by using a git frame command to introduce changes so as to finish the data marking work.

The specific process of the method is shown in fig. 1, and comprises the following steps:

the method comprises the following steps: extracting Change Meta information

Step two: data annotation

Changes are marked by using a data marking algorithm in the field of software engineering defect prediction, and the changes are divided into defect changes and non-defect changes. The invention adopts a common data labeling algorithm SZZ algorithm to label the change:

And (2.2) changing the code line deleted according to the front diff information and the back diff information in the step one for the replica in the step 2.1, and removing noise in the code line to obtain a defect code line.

Step three: computing characteristics of individual branches

The total number of the change characteristics is 14, the change characteristics are divided into 5 dimensions, and the characteristics are described as follows

And calculating the following ten characteristics by using the change meta information obtained in the step one and the labels of the changes in the step two pairs: the number of subsystems involved in change NS, the number of files involved in change NF, the degree of dispersion of changes at file level Encopy, the number of developers NDEV of changed files NDEV, the average time interval AGE between the last change of the changed files AGE, the average number of deduplication of the historical changes involved in the changed files NUC, the average number of lines of code LT of the files before change, whether the files are repair-type changes FIX or not, the experience EXP of the developers EXP, and the experience SEXP of the developers at subsystem level. The NDEV, AGE, NUC, EXP and SEXP are subjected to multi-branch treatment as follows:

(3.1) Multi-branch treatment of NDEV, AGE, NUC:

the features NDEV, AGE and NUC are collectively called history dimension features, and when the history dimension features are calculated, each branch of the code warehouse maintains a data structure for storing file history information, including all related changes, authors, the number of current code lines and the like.

When a change involving a newly created branch or a merged branch is processed, the data structure of each branch needs to be split or merged. The method is creatively used for engineering implementation, and a set of data structure special-record history dimension related data is maintained for each file and developer. When branches are newly created, the data structures are replicated and each branch maintains a separate set of data structures. And merging the data structures during branch merging to obtain the merged branch data structure. However, the memory of the data structure is huge, and an author realizes a simple memory scheduling program by imitating a memory scheduling algorithm, so that the whole model can be ensured to normally run in a large project.

The method comprises the following specific steps:

(d) Repeating step (c) to process all code changes.

(3.2) multi-branch processing of EXP and SEXP:

the characteristics EXP and SEXP are collectively called as experience dimension characteristics, when the experience dimension characteristics are calculated, each branch needs to construct a data structure for storing development data of developers, including changes, related subsystems and subsystem related changes, and the characteristic of the experience dimension is calculated by using the data structure in the whole process and the data structure is continuously updated. When branches are newly created, the data structures are replicated and each branch maintains a separate set of data structures. And merging the data structures during branch merging to obtain the merged branch data structure. The method comprises the following specific steps:

(iiii) repeating step (iii) for all changes.

(4.1) a classification module:

because ND and NF, REXP and EXP are highly correlated, the features ND and REXP were removed. Also, because LA and LD will be used in the ranking module, the features LA and LD are removed. After removing ND, REXP, LA, LD, the remaining 10 characteristics are input to the classification module, will be step three get 10 characteristics are input to the classification module, the said classification module adopts the logistic regression to classify, the formula is as follows:

(4.2) a sorting module:

adding the number of the newly added code lines and the number of the deleted code lines to obtain the number of the changed code lines, dividing the defect possibility p by the number of the changed code lines to obtain a ratio, sequencing the defect change according to the ratio from large to small, and finally returning the sequenced change sequence.

Examples

The time-wise cross validation method is used for validation, the experimental results are compared with the existing EALR and OneWay methods in the field of five-dimensional defect prediction of call, precision, f1-score, pci and ifa, and the results are shown in tables 1 and 2. In six-project testing, the method herein found about 15% more defects than the EALR method, with approximately 47% improvement in recall on average. The detailed results are as follows:

TABLE 1 comparison of results of the methods herein with those of the EALR method

TABLE 2 comparison of results from the methods herein with those of OneWay (OW)

Claims

1. A workload-aware multi-branch software change-level defect prediction method is characterized by comprising the following steps:

the method comprises the following steps: extracting change meta information:

extracting code change meta-information from a code repository using a method provided by a GIT tool, the code change meta-information including: adding code line number, deleting code line number, author, time, front and back diff information, front and back change relationship, changing file name and changing description;

step two: data annotation:

(2.1) carrying out keyword analysis on the change description in the step one, and finding out and marking the change of the repair defect, namely the repair type change;

(2.2) changing the repair type in the step (2.1) according to the code line deleted according to the front diff information and the back diff information in the step one, and removing noise in the code line to obtain a defect code line;

(2.3) finding out the change of the introduced defect code line by using a frame command carried by a GIT tool, wherein the change is the defect change; except for the defect change, the other changes are marked as non-defect changes;

step three: calculate the characteristics of each branch:

and calculating the following ten characteristics by using the change meta information obtained in the step one and the label of the code change in the step two: changing the number NS of related subsystems, the number NF of related files, the dispersity Encopy at the file level, the number NDEV of file developers, the average time interval AGE between the last change of the changed files, the average number NUC of the historical changes related to the changed files, the number LT of the average code lines of the files before the change, whether the files are repair-type changes FIX or not, the experience EXP of the developers and the experience SEXP of the developers at the subsystem level; the NDEV, AGE, NUC, EXP and SEXP are subjected to multi-branch treatment as follows:

(3.1) Multi-branch treatment of NDEV, AGE, NUC:

(a) establishing a front and back change relation graph of the code warehouse according to the front and back change relation obtained in the step one;

(b) processing the first code change, and creating a first data structure for recording history information of all files of a branch where the current code change is located, wherein the history information comprises file related change, file related author and current code line number of the file;

(c) sequentially processing code changes according to the time obtained in the first step in a time sequence; when processing the current code change, firstly obtaining the sub change number and the parent change number of the current code change from the front-back change relation graph in the step (a); if the number of the sub-changes of the current code change is more than or equal to 2, the current code change relates to a new branch, namely the current branch needs to be updated, then a blank is established for the new branch, and all data of the current branch are copied to the new branch; if the parent change number of the current code change is more than or equal to 2, the current code change relates to merging branches, namely merging the related branches in all merging branch operations; the file relates to change, the file relates to two data of an author, and the two data are obtained by solving a union set through branches, and the current code line number of the file is obtained by adding the number of the old file code lines to the number of the newly added code lines and subtracting the number of the deleted code lines; if the number of the child changes of the current code change is less than 2 or the number of the parent changes of the current code change is less than 2, the current branch is updated, and the number of the file related changes, the file related authors and the current code lines of the file of the current change modification file are modified in sequence;

(d) repeating step (c) to process all code changes;

(3.2) multi-branch processing of EXP and SEXP:

(i) establishing a front and back change relation graph of the code warehouse according to the front and back change relation obtained in the step one;

(ii) processing the first code change, and creating a second data structure for recording all developer experience data of a branch where the current code change is located, wherein the developer experience data comprises developer related change, developer related subsystem and subsystem related change;

(iii) sequentially processing code changes according to the time obtained in the first step in a time sequence; when processing the current code change, firstly, the number of the child changes and the number of the parent changes of the current change can be obtained from the previous and next change relation graph; if the number of the sub-changes of the current code change is more than or equal to 2, the current code change relates to a new branch, namely the current branch needs to be updated, then a blank is established for the new branch, and all data of the current branch are copied to the new branch; if the parent change number of the current code change is more than or equal to 2, the current change relates to merging branches, merging the related branches in all merging branch operations, wherein three data related to change by developers, related to subsystems by developers and related to change by subsystems are obtained by solving a union set of the branches; if the number of the child changes of the current code change is less than 2 or the number of the parent changes of the current code change is less than 2, updating the current branch; sequentially updating developers related to the current change modification file to relate to changes, developers to relate to subsystems and subsystems to relate to changes;

(iiii) repeating step (iii) to process all code changes;

(4.1) a classification module:

inputting the 10 features obtained in the step three into a classification module, wherein the classification module performs classification by adopting logistic regression, and the formula is as follows:

where p is the defect probability, ω^TFor a set of weight values obtained from training, χ is a set of altered features;

in the training stage, each change sequentially passes through the first step, the second step and the third step to obtain 10 features and labels, the 10 features and the labels are input into a classifier, and the classifier is trained until a loss function is minimum to obtain a trained classifier module;

in the prediction stage, the change to be predicted sequentially passes through the first step, the second step and the third step to obtain 10 characteristics; inputting 10 characteristics of code change into a trained classifier, wherein the classifier gives a changed defect possibility p;

(4.2) a sorting module: