CN114816997B - Defect prediction method based on graph neural network and bidirectional GRU feature extraction - Google Patents
Defect prediction method based on graph neural network and bidirectional GRU feature extraction Download PDFInfo
- Publication number
- CN114816997B CN114816997B CN202210323112.5A CN202210323112A CN114816997B CN 114816997 B CN114816997 B CN 114816997B CN 202210323112 A CN202210323112 A CN 202210323112A CN 114816997 B CN114816997 B CN 114816997B
- Authority
- CN
- China
- Prior art keywords
- node
- software
- vector
- code
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000007547 defect Effects 0.000 title claims abstract description 38
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 25
- 230000002457 bidirectional effect Effects 0.000 title claims abstract description 13
- 238000000605 extraction Methods 0.000 title claims abstract description 11
- 238000012549 training Methods 0.000 claims abstract description 22
- 238000003062 neural network model Methods 0.000 claims abstract description 14
- 239000013598 vector Substances 0.000 claims description 75
- 239000011159 matrix material Substances 0.000 claims description 29
- 230000001419 dependent effect Effects 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 17
- 230000004913 activation Effects 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000003058 natural language processing Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000005295 random walk Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 abstract description 4
- 230000008569 process Effects 0.000 description 9
- 230000002950 deficient Effects 0.000 description 6
- 238000012360 testing method Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000000284 extract Substances 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 230000002776 aggregation Effects 0.000 description 3
- 238000004220 aggregation Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 238000000691 measurement method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003012 network analysis Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000000547 structure data Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3684—Test management for test design, e.g. generating new test cases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Stored Programmes (AREA)
Abstract
The application provides a defect prediction method based on graph neural network and bidirectional GRU feature extraction, which comprises the steps of firstly, carrying out network modeling on a software system, and learning a network structure by using a neural network model to obtain software dependency characteristics; then constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system; and combining the obtained software dependency characteristics with the semantic characteristics of the obtained source codes to obtain final mixed characteristics, training a classifier for defect prediction based on the obtained mixed characteristics, constructing a prediction model by combining the two different types of measurement indexes, extracting deep semantic characteristics of the software source codes by using a bidirectional GRU, extracting dependency characteristics among software modules by using a graph neural network, and combining the two characteristics for predicting defects, thereby improving the accuracy of prediction.
Description
Technical Field
The application relates to the field of software engineering software defect prediction, in particular to a defect prediction method based on a graph neural network and bidirectional GRU feature extraction.
Background
With the increasing size and complexity of software, the probability of defects in the software increases, as does the cost of testing source code and producing high quality software products. In practice, the test resources are limited, some serious failures may lead to significant functional losses, large amounts of data corruption, and even to a crash of the entire software system, and therefore limited resources should be allocated to the high risk modules that are more prone to failure.
In early work, researchers proposed a variety of metrics to build predictive models, typically using static code properties and defect locations, by which defects were predicted by manually defined features such as number of lines of code, previous errors, number of methods in a file, etc., which were easily obtained from source code using automated tools. However, the characteristics of the traditional manual design only consider the complexity of codes, in order to extract richer source code semantics and grammar information, researchers model the software source code to be similar to a text sequence in natural language processing, such as a neural network model based on an abstract grammar tree, so that the source code can be well simulated, meanwhile, the structural information of a program is also reserved, the abstract grammar tree is traversed, the node types needing to be reserved are selected, each source file is parsed into a series of code marks, and the code marks are input into a deep neural network, so that the semantic characteristics of the software source code are obtained.
A disadvantage of conventional code metrics and semantic features is that they focus on only a single element, and rarely consider the interaction information between elements, the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics.
Therefore, the method in the prior art has the technical problem of low software defect prediction accuracy.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which extracts richer software features and improves the accuracy of prediction.
In order to achieve the above object, the following technical scheme is adopted:
the defect prediction method based on the graph neural network and the bidirectional GRU feature extraction comprises the following steps:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics.
In one embodiment, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: and (3) constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic.
In one embodiment, step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, if there is a dependency between two nodes, there is a relation between themStrip connecting edge e ij, wherein ,eij =(v i ,v j )∈E。
In one embodiment, step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
In one embodiment, S1.3 comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,i V is the number of nodes, d is the dimension of the node embedding vector, N (V) = { u E V i (V, u) E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a degree matrix-> Representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is a parameter of the first layer, sigma (·) represents the activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
In one embodiment, after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
In one embodiment, constructing a software system source file into an abstract syntax tree, extracting a token sequence, includes:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
In one embodiment, modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing, entering into one of the GRU units, training each GRU unit by predicting a next code token in the sequence;
trained GRU units generate a status output for each code tagQuantity s i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
The above technical solutions in the embodiments of the present application at least have one or more of the following technical effects:
firstly, modeling a software system as a software network by utilizing a complex network theory, and then carrying out embedded learning on the network by combining a graph neural network model of a deep learning technology to obtain network measurement characteristics, namely software dependency characteristics; secondly, extracting rich semantic information in a source file of the software system by using a two-way gating recursion unit; third, two types of software metrology features are combined and applied to software defect prediction. That is, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency features between the software modules, and combines the two features to predict the defects, thereby improving the accuracy of prediction.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction according to an embodiment of the present application;
FIG. 2 is a process diagram of building a software dependent network in accordance with an embodiment of the present application;
FIG. 3 is a process diagram of the method of the present application applied to defect prediction.
Detailed Description
The present inventors have found through a great deal of research and practice that:
some methods in the prior art adopt the traditional code measurement method, and the defects of the code measurement method index and the semantic features are that only single elements are focused, interaction information among the elements is rarely considered, and the available information content is limited. In recent years, network metrics derived from concepts in the field of social network analysis have attracted attention from a wide range of researchers. The analysis based on the network takes the modules as nodes, extracts the dependency relationship among the modules as edges to form a software source code network, and establishes a prediction model by using the obtained network model. Network metrics take into account interactions between modules to model information flow and topology in software, which are not captured by software code metrics. However, in other research works it has been shown that network metrics are not more advantageous than code metrics and are not commonplace.
In the field of software engineering, the embedded learning model for processing the network structure data includes deepflk, node2vec, struct 2vec and the like, and the principle is that the network structure is converted into a series of node sequences, and then the word vector model is used for learning the node representation. The graph neural network model in the deep learning field can capture network structural characteristics as well and is not applied to the software defect prediction field. The idea is to multiply the adjacent matrix of the whole network structure with the original node characteristic matrix to obtain the embedded characteristics of the hidden layer, and iterate in sequence, thereby realizing a multi-layer network. Compared with the traditional embedded learning method, the graph neural network can learn the structural relation of each node and the neighborhood thereof, and can fuse the characteristic attribute carried by each node into the graph neural network to perform more comprehensive learning.
Based on the above consideration, the application adopts the combination of the two different types of measurement indexes to construct a prediction model, uses the bidirectional GRU to extract the deep semantic features of the software source codes, uses the graph neural network to extract the dependency relationship features between the software modules, and combines the two features to predict the defects.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The embodiment of the application provides a defect prediction method based on a graph neural network and bidirectional GRU feature extraction, which comprises the following steps:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics.
In general, step S1 is to extract the software dependency characteristics by constructing a graph neural network, step S2 is to extract the deep semantic characteristics by a bi-directional gating recursion unit, and step S3 combines the two characteristics for predicting defects.
Referring to fig. 1, a frame diagram of a defect prediction method based on a graph neural network and bidirectional GRU feature extraction is provided in an embodiment of the present application.
In one embodiment, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: and (3) constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic.
In one embodiment, step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, and if there is a dependency relationship between two nodes, there is a continuous edge e between them ij, wherein ,eij =(v i ,v j )∈E。
In the specific implementation process, a software system developed in Java language is used as an analysis object, and a dependency relation scanning is carried out on a class file generated after source code compiling, a jar file formed by source code packaging or a zip compression package containing source code through a Dependencyfinder tool and is stored as an XML file. The XML file stores the analysis result of the DependencyFinder on Java source code, and the dependency relationship between the basic information and elements of three granularity elements of the package, the class and the method in the source code is represented by a nested structure. The < package > tag of the outermost layer represents a packet, < class > represents a class, < feature > represents a method/field, and the < outload > and < inbound > tags of the innermost layer represent a dependency and a depended relationship, respectively. The embodiment uses an autonomously developed parsing program to parse the tags of the parsed XML file, extracts the dependency relationships among the software classes from the parsed XML file, and stores the dependency relationships as a net network file format for downstream work.
Referring to fig. 2, for the process of modeling a network, part (a) of fig. 2 is 5 Java file code fragments, part (b) of fig. 2 is a class-dependent network corresponding to each other, each node in the network represents a class, and the edges are the dependency relationships between two class nodes. For example, class B is a subclass of class A, and according to the inheritance of the class B and the class A, there is a continuous edge (B-A) pointed to A by B. C-I represents an interface implementation relationship, C-D represents a parameter type dependency relationship, and A-C, D-A represents an aggregation relationship.
In SDN, the dependency of two class nodes mainly considers the following 3 cases:
(1) Inheritance, provided class v c1 Inheritance to class v c2 Or implement interface v c2 Then there is a directed edge e 12 =<v c1 ,v c2 >;
(2) Polymerization, if class v c1 Comprises class v c2 The attribute of (2) then has a directed edge e 12 =<v c1 ,v c2 >;
(3) Parameters, provided that class v c1 Class v is invoked by the method of (1) c2 The method of (2) has a directed edge e 12 =<v c1 ,v c2 >。
In one embodiment, step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
Specifically, the Node2vec method is a graph data embedding method, and S1.2.2 is to embed each Node v in the network i (v i E V) into d-dimensional token vectorI V I is the number of nodes, where some target node V is characterizedVector is->These node feature vectors will serve as initial inputs to the neural network of the subsequent graph.
In one embodiment, S1.3 comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,i V is the number of nodes, d is the dimension of the node embedding vector, N (V) = { u E V i (V, u) E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a degree matrix-> Representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is a parameter of the first layer, sigma (·) represents the activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
Specifically, the node vector mode is iteratively updated by GCN using convolution operation, and the target node v aggregates (Aggregate) the features of the neighboring nodes a, b, c, d around the target node v and then combines with the feature x of the target node v v Combining (Combine) to obtain an embedded vector of a new layer, which is expressed as follows:
where k represents the current layer, AGGREGATE (·) represents the aggregate function, CONCAT (·) represents the join function, W is a trainable weight matrix, σ (·) represents the activate function, such as the ReLU function,an embedding vector of a neighbor node u representing the target node v at the k-1 layer +.>An embedded vector aggregate representation at the k-1 layer for all neighbor nodes of the target node v,is the embedded vector of the target node v at the current layer, is the embedded vector obtained from the target node v at the previous layer +.>And the aggregate vector of its neighbor nodes at the current layer +.>Combined by the COMBINE (·) function.
Common aggregation functions include summation (sum), average (MEAN), maximum (max) and the like, and the method adopts a graph convolution neural network in a graph neural network model, adopts an element-averaged aggregation function MEAN and uses a ReLU activation function. The function of COMBINE (·) is to splice the feature aggregated by the neighboring node with the feature transferred by the layer on the target node.
In one embodiment, after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
Specifically, since the GCN is a supervised model, the training process needs to provide a real label of the node, so that the model parameters are continuously adjusted in the back propagation process, and the model parameters are optimal. The corresponding node characteristics obtained finally are the characteristics updated in the hidden layer of the GCN, namely, the characteristics of the node updated in S1.3.2 (such as 32 dimensions). And outputting a probability value (also called a predicted value, which is 2-dimensional in the embodiment, and finally predicts a softmax probability value, and continuously making the probability value closer to a real tag) for parameter tuning optimization in GCN back propagation.
In one embodiment, constructing a software system source file into an abstract syntax tree, extracting a token sequence, includes:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
Specifically, the source file is converted into an abstract syntax tree using the javalang tool, and the input may be a code fragment or a complete Java file, but must be all complete code. The application adopts class granularity conversion, namely each java file is converted into an abstract syntax tree.
Then determining the extracted node types, and only extracting three types of nodes as token marks by the method:
(1) Method call nodes (method invocation) and class instantiation nodes (classdeclartion), i.e. those nodes that record method names and class names;
(2) Declaration nodes, such as method declarations, class declarations, etc.;
(3) Control flow nodes, including IfStatement, whileStatement, forStatement, throwStatement, catchClause, etc., are simply labeled as their node type.
Since the names of methods, classes and types are generally specific to a particular item, methods with the same name are either rare or have different functions in different items. Thus for cross-project prediction, the extracted tokens are node types of AST, not specific names, thereby enabling generalization of the model.
Traversing the abstract syntax tree, only preserving the nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size by using the code marks, converting each code mark into a low-dimensional continuous real-value vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code label.
In one embodiment, modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing input into a GRU unit by predicting the next in sequenceTraining each GRU unit by a code token;
trained GRU units generate a state output vector s for each code tag i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
In particular, since the GRU is a cyclic network, the model parameters for all GRU units are the same. Output vector s i Can be combined with the input vector x i To simplify the training model, the dimensions of both are kept the same in this embodiment. s is(s) i Representing capturing its semantic distribution according to the context of the code tag. Code mark w i Output state vector s of (2) i Marking w from previous code by calculating posterior distribution 1:i Context semantics of (2) to predict the next code marker w i+1 。
In the implementation process, a Java file is converted into a series of code marks Each code mark w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]And (3) representing. At time i, the transition equation for the GRU is:
r i =σ(W r w i +U r h i-1 +b r )
z i =σ(W z w i +U z h i-1 +b z )
wherein ri Is to reset the gate, determine how much information of the previous time, z i Is an update gate for determining the current memory contentAnd information h of the previous time i-1 How to pass on to the next unit. These two gating vectors determine which information can ultimately be the output of the gating loop. />Is the current memory content, h i Is the information that is ultimately input to the next cell. />Are weight matrices, d is the input dimension of the real-valued vector, and m is the output dimension. b r ,b z ,b h Is the offset value.
To further enhance the ability of the loop layer to capture dependent information, the present application employs a bi-directional GRU in which hidden states in two directions are connected to form a new state:
wherein ,is the element information obtained in forward GRU at time i,/and>the cell information obtained by the reverse GRU is finally input into the pooling layer for sampling. Since the importance of different statements is considered to be different, e.g., a method call statement may contain more information, the largest pool is employed by default to capture the most important semantic information. The vector produced finally->The semantic feature vector representation of the Java file is obtained.
The implementation process of step S3 is as follows:
first, each java file is marked as defective or non-defective according to the defects posted in each project. And then, carrying out different types of feature extraction on the software system according to the steps S1 and S2 (as shown in figure 1), and combining the extracted software dependency relationship features with the source code semantic features for training a machine learning classifier. Finally, the trained model is used to predict whether the new instance is defective or non-defective. When intra-project defect prediction is performed, the instances of the training set and the test set are from the same project. When performing cross-project defect prediction, the training instance and the test instance come from different projects.
In the specific implementation process, the application adopts a multi-layer perceptron as a prediction model. The trained model is used to predict whether the new instance is defective or non-defective. As shown in FIG. 3, when intra-project defect prediction is performed, the instances of the training set and the test set are from the same project, namely project A. When performing cross-project defect prediction, a prediction model is trained using the instances in project A and test instances in project B are predicted using the model.
The scope of the present application is not limited to the above-described embodiments, and it is apparent that various modifications and variations can be made to the present application by those skilled in the art without departing from the scope and spirit of the application. It is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (6)
1. The defect prediction method based on the graph neural network and the bidirectional GRU feature extraction is characterized by comprising the following steps of:
s1: performing network modeling on the software system, and learning a network structure by using a neural network model to obtain the software dependency characteristics;
s2: constructing a software system source file into an abstract syntax tree, extracting a token sequence, and modeling by using a two-way gating recursion unit to obtain deep semantic features of the software system;
s3: combining the software dependency characteristics obtained in the step S1 with the semantic characteristics of the source codes obtained in the step S2 to obtain final mixed characteristics, and training a classifier for defect prediction based on the obtained mixed characteristics;
wherein, step S1 includes:
s1.1: analyzing a software source code file by using an open source tool to construct a software dependent network model;
s1.2: learning a software dependent network by using a network embedding method to obtain an embedding vector of a node;
s1.3: constructing a graph neural network model, and inputting the software dependent network model constructed in the step S1.1 and the embedded vector of the node obtained in the step S1.2 as the graph neural network model to obtain the software dependent relationship characteristic;
s1.3 specifically comprises:
s1.3.1: taking the software dependent network model obtained in the step S1.1 and the node embedded vector X obtained in the step S1.2 as inputs of the graph neural network,v is the number of nodes, d is the dimension of the node embedded vector,n (V) = { u ε V| (V, u) ∈E) represents the neighbor set of node V, +.>Implicit embedded vector representing target node v at model k-th layer, initially input, ++>
S1.3.2: the node vector mode is iteratively updated through a graph neural network GCN by using convolution operation, each node in the network aggregates the neighbor node characteristics in iteration and is combined with an embedded vector obtained by iteration of the node on the previous layer to obtain a new layer of embedded vector, each node is sequentially iterated until the last layer is output to obtain updated node characteristics, wherein the convolution layer of the GCN is expressed as follows:
wherein A is an adjacency matrix,is to add a self-connecting adjacency matrix, I N Is an identity matrix>Is a matrix of degrees that is a function of the degree, representation matrix->The value of the element corresponding to the ith row and the jth column (l) Is the output of layer I, Θ (l) Is of the first layerParameters, σ (·) represent activation function, H (l+1) The output of the layer 1 is the output of the layer 1, and the updated node characteristic is the software dependency characteristic.
2. The defect prediction method of claim 1, wherein step S1.1 comprises:
s1.1.1: and performing dependency relation scanning on various files generated after source code compiling by using an open source tool, and storing the files as a uniform file format:
s1.1.2: analyzing and extracting the dependency relationship among various files, performing network modeling, and constructing a software dependent network model, wherein the software dependent network model is an undirected and unauthorized network SDN= (V, E), and the node V is a network with no direction and no rights i (v i E V) represents a class file or interface file in a software system, and if there is a dependency relationship between two nodes, there is a continuous edge e between them ij, wherein ,eij =(v i ,v j )∈E。
3. The defect prediction method of claim 1, wherein step S1.2 comprises:
s1.2.1: the Node2vec method is utilized, the network is learned and converted into a Node sequence in a biased random walk mode, and the Node sequence obtained through conversion is used as a text sequence in natural language processing;
s1.2.2: training is carried out through a word vector model skip gram according to the obtained node sequence, and low-dimensional vector representation of the node is obtained and used as an embedded vector of the node.
4. The defect prediction method of claim 1, wherein after step S1.3.2, the method further comprises:
the output probability value of each node is obtained through the softmax layer, and the specific formula is as follows:
wherein ,is an adjacent matrix normalized by the Laplace matrix, X is an embedded vector of a node, and W is an initial characteristic (0) ∈R C×H and W(1) ∈R H×2 The weight matrix of two layers is respectively, C is the dimension of each node of the input layer, H is the dimension of each node of the hidden layer, Z epsilon R N×2 Is the output probability value of all nodes;
training is carried out by taking the predicted value and the real label as targets, and the adopted loss function is as follows:
wherein ,is a label node index set, Z lf Is GCN is the predictive vector of tag l, f is the dimension of the predictive vector, y lf Is the true one-hot vector of label l, with the goal of reducing the error between the true label and the predicted label.
5. The defect prediction method of claim 1, wherein constructing the software system source file into an abstract syntax tree, extracting the token sequence, comprises:
converting the source file into an abstract syntax tree using a javalang tool;
traversing an abstract syntax tree, preserving nodes of the type that need to be extracted, each source file is parsed into a series of code tags [ w ] 1 ,w 2 ,…,w n ]Constructing a word list V with a fixed size from the parsed code marks, converting each code mark into a low-dimensional continuous real-valued vector representation by using word2vec training word list to obtain a token embedded matrixd represents the dimension of the code tag and the token embedding matrix is used to store the token sequence.
6. The defect prediction method of claim 1, wherein modeling using a bi-directional gating recursion unit obtains deep semantic features of a software system, comprising:
inputting a sequence of code tokens for each class file into the sequence of GRU units, each code token w i Using a real value vector x i =[x 1 ,x 2 ,…,x d ]Representing, entering into one of the GRU units, training each GRU unit by predicting a next code token in the sequence;
trained GRU units generate a state output vector s for each code tag i =[s 1 ,s 2 ,…,s d ]Representing capturing the semantic distribution of the code tag according to the context of the code tag;
a series of code tags [ w ] for each java source file 1 ,w 2 ,…,w n ]Input into trained GRU to obtain token state output sequence s 1 ,s 2 ,…,s n ]The output sequence of token states is aggregated into a file vector through a pooling layer as a deep semantic feature of the software system corresponding to the source file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210323112.5A CN114816997B (en) | 2022-03-29 | 2022-03-29 | Defect prediction method based on graph neural network and bidirectional GRU feature extraction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210323112.5A CN114816997B (en) | 2022-03-29 | 2022-03-29 | Defect prediction method based on graph neural network and bidirectional GRU feature extraction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114816997A CN114816997A (en) | 2022-07-29 |
CN114816997B true CN114816997B (en) | 2023-08-18 |
Family
ID=82532126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210323112.5A Active CN114816997B (en) | 2022-03-29 | 2022-03-29 | Defect prediction method based on graph neural network and bidirectional GRU feature extraction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114816997B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115617694B (en) * | 2022-11-30 | 2023-03-10 | 中南大学 | Software defect prediction method, system, device and medium based on information fusion |
CN116521899B (en) * | 2023-05-08 | 2024-03-26 | 中国传媒大学 | Improved graph neural network-based document level relation extraction method and system |
CN116383089B (en) * | 2023-05-29 | 2023-08-04 | 云南大学 | Statement level software defect prediction system based on ordinary differential equation diagram neural network |
CN117251376B (en) * | 2023-10-09 | 2024-03-19 | 湖北大学 | Software defect prediction method and system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825615A (en) * | 2019-09-23 | 2020-02-21 | 中国科学院信息工程研究所 | Software defect prediction method and system based on network embedding |
CN111259388A (en) * | 2020-01-09 | 2020-06-09 | 中山大学 | Malicious software API (application program interface) calling sequence detection method based on graph convolution |
DE202020101664U1 (en) * | 2020-02-07 | 2020-07-07 | Google Llc | Computer-aided graph optimization |
CN111783100A (en) * | 2020-06-22 | 2020-10-16 | 哈尔滨工业大学 | Source code vulnerability detection method for code graph representation learning based on graph convolution network |
CN112069822A (en) * | 2020-09-14 | 2020-12-11 | 上海风秩科技有限公司 | Method, device and equipment for acquiring word vector representation and readable medium |
CN112541180A (en) * | 2020-12-16 | 2021-03-23 | 北京理工大学 | Software security vulnerability detection method based on grammatical features and semantic features |
CN112633550A (en) * | 2020-11-23 | 2021-04-09 | 成都唐源电气股份有限公司 | RNN-based catenary fault trend prediction method, equipment and storage medium |
CN112966271A (en) * | 2021-03-18 | 2021-06-15 | 中山大学 | Malicious software detection method based on graph convolution network |
CN114185769A (en) * | 2021-11-16 | 2022-03-15 | 南京航空航天大学 | Software defect prediction method and terminal based on bidirectional long-short term memory neural network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11544105B2 (en) * | 2019-09-11 | 2023-01-03 | Google Llc | Recommendations for scheduling jobs on distributed computing devices |
US11481418B2 (en) * | 2020-01-02 | 2022-10-25 | International Business Machines Corporation | Natural question generation via reinforcement learning based graph-to-sequence model |
-
2022
- 2022-03-29 CN CN202210323112.5A patent/CN114816997B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825615A (en) * | 2019-09-23 | 2020-02-21 | 中国科学院信息工程研究所 | Software defect prediction method and system based on network embedding |
CN111259388A (en) * | 2020-01-09 | 2020-06-09 | 中山大学 | Malicious software API (application program interface) calling sequence detection method based on graph convolution |
DE202020101664U1 (en) * | 2020-02-07 | 2020-07-07 | Google Llc | Computer-aided graph optimization |
CN111783100A (en) * | 2020-06-22 | 2020-10-16 | 哈尔滨工业大学 | Source code vulnerability detection method for code graph representation learning based on graph convolution network |
CN112069822A (en) * | 2020-09-14 | 2020-12-11 | 上海风秩科技有限公司 | Method, device and equipment for acquiring word vector representation and readable medium |
CN112633550A (en) * | 2020-11-23 | 2021-04-09 | 成都唐源电气股份有限公司 | RNN-based catenary fault trend prediction method, equipment and storage medium |
CN112541180A (en) * | 2020-12-16 | 2021-03-23 | 北京理工大学 | Software security vulnerability detection method based on grammatical features and semantic features |
CN112966271A (en) * | 2021-03-18 | 2021-06-15 | 中山大学 | Malicious software detection method based on graph convolution network |
CN114185769A (en) * | 2021-11-16 | 2022-03-15 | 南京航空航天大学 | Software defect prediction method and terminal based on bidirectional long-short term memory neural network |
Non-Patent Citations (1)
Title |
---|
钱冠群 ; 张莉 ; 张林 ; Philip Lew ; .软件静态结构的依赖网络建模方法与特性分析.计算机科学.2008,(第11期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN114816997A (en) | 2022-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114816997B (en) | Defect prediction method based on graph neural network and bidirectional GRU feature extraction | |
CN110597735B (en) | Software defect prediction method for open-source software defect feature deep learning | |
CN111913702B (en) | Method for identifying key classes in software system based on graph neural network | |
CN113468888A (en) | Entity relation joint extraction method and device based on neural network | |
CN113312447A (en) | Semi-supervised log anomaly detection method based on probability label estimation | |
CN112417063B (en) | Heterogeneous relation network-based compatible function item recommendation method | |
CN115145551A (en) | Intelligent auxiliary system for machine learning application low-code development | |
CN115858825A (en) | Equipment fault diagnosis knowledge graph construction method and device based on machine learning | |
CN101546290B (en) | Method for improving accuracy of quality forecast of class hierarchy in object-oriented software | |
CN115659966A (en) | Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention | |
CN116521560A (en) | Multi-feature fusion emperor class detection method based on graph neural network | |
Ouyang et al. | Semantic enrichment of object associations across federated BIM semantic graphs in a common data environment | |
CN117290238B (en) | Software defect prediction method and system based on heterogeneous relational graph neural network | |
CN114238524B (en) | Satellite frequency-orbit data information extraction method based on enhanced sample model | |
CN114881032A (en) | Hierarchical category named entity recognition model design method based on multi-task learning | |
CN111783688B (en) | Remote sensing image scene classification method based on convolutional neural network | |
CN115438190B (en) | Power distribution network fault auxiliary decision knowledge extraction method and system | |
CN117056226A (en) | Cross-project software defect number prediction method based on transfer learning | |
CN117236374A (en) | Layering interpretation method based on fully developed material graph neural network | |
CN113342982B (en) | Enterprise industry classification method integrating Roberta and external knowledge base | |
CN116910190A (en) | Method, device and equipment for acquiring multi-task perception model and readable storage medium | |
CN117439800B (en) | Network security situation prediction method, system and equipment | |
CN113377422B (en) | Self-recognition technical liability method based on deep learning identification | |
CN113326352B (en) | Sub-event relation identification method based on heterogeneous event graph | |
CN117555776A (en) | Software defect prediction method based on dual learning and deep semantic mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |