CN115203408A

CN115203408A - Intelligent labeling method for multi-modal test data

Info

Publication number: CN115203408A
Application number: CN202210723506.XA
Authority: CN
Inventors: 张骁雄; 周晓磊; 范强; 严浩; 王芳潇; 江春
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2022-06-24
Filing date: 2022-06-24
Publication date: 2022-10-18

Abstract

The invention discloses an intelligent multi-mode test data labeling method, belongs to the technical field of computer application, and aims to label text, images and video data. For a text, intelligent labeling of the text is realized by adopting a complex network unsupervised technology, word vector weights are distributed according to eccentricity centrality and point centrality, and then a label is given; for an image, constructing an annotation model by adopting a supervised machine learning method, segmenting, filtering noise and over-segmenting the image, classifying the segmented image by taking each semantic concept as a category, and adjusting a classifier based on feedback information to improve the classification precision; for videos, video labeling is carried out based on machine learning, video frames are used as segmentation points, the videos are subjected to picture formation, content analysis is carried out, effective data are extracted and stored, and then the pictures are analyzed and labeled. According to the invention, various intelligent learning methods are introduced for multi-format data, so that the labeling efficiency and precision of test data are effectively improved.

Description

Intelligent labeling method for multi-modal test data

Technical Field

The invention relates to a multi-mode test data intelligent labeling method and system, and belongs to the technical field of data processing.

Background

The text labeling technology can help a product test data acquisition worker to quickly screen effective information, and can provide support for text processing technologies such as text classification, text retrieval and automatic abstractions. And marking document data such as technical data, official documents and the like, and extracting key information such as contracts, product models, manufacturers and the like. The text labeling mainly completes the acquisition and the structuralization processing of the basic information of the file, and the labeling after the segmentation, the office segmentation and the word segmentation processing of the text content.

The test historical data comprises a large amount of images and videos, and key information such as document titles, subject terms, abstracts, indexes, image entities, scenes, places, characters and the like can be extracted by labeling the image data and the video data. The method comprises the steps of using the existing collected and sorted labeled data to sort the existing known type of images to form a training library, training the training library to obtain a model through a convolutional neural network, and using the model to classify and label unknown images. And performing structured storage on the structure, and then warehousing the structure to perform other business operations such as retrieval and the like.

In the prior art, a model of each semantic concept is established by learning manually labeled training video data, and then the model is used to classify unlabeled video data sets and label the corresponding semantic concepts. Because the machine learning theory is relatively mature, the method is generally considered to be a relatively proper method for solving the video annotation problem, and the current video annotation mainly utilizes the machine learning theory to improve the annotation accuracy.

Disclosure of Invention

The invention aims to overcome the technical defects in the prior art and solve the problem of labeling of multi-class data of a product test, and provides a multi-mode test data intelligent labeling method which can realize intelligent labeling of text, image and video data and can effectively improve the intelligent labeling efficiency and accuracy of the text, the image and the video through a complex network unsupervised technology, a supervised machine learning method and a machine learning method.

The invention specifically adopts the following technical scheme: an intelligent labeling method for multi-modal test data comprises the following steps:

step SS1: inputting test data, wherein the test data comprises text data, image data and video data;

step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; based on the image data, executing an image classification labeling method based on supervised machine learning; executing a video labeling method based on machine learning based on the video data;

and step SS3: performing word segmentation, word filtering and word text preprocessing without stop on the text data by an unsupervised text labeling method based on a complex network to obtain a preliminary word segmentation result, completing semantic mapping of node words, assigning weights to word components of the word segmentation result by combining eccentricity centrality and centrality of the degree, obtaining labels of the text data according to the weight, and completing labeling of the text data; identifying targets in the image data through an image classification labeling method based on supervised machine learning, adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish labeling of the image data; extracting key frames aiming at the video data by a video labeling method based on machine learning, and extracting the characteristics of entities, scenes, places and characters of key frame images by utilizing an OCR character recognition method to complete the recognition of character contents; identifying typical image data by using an image target identification algorithm, marking the typical image data in an image, displaying the identified mark under intelligent feature extraction, and clicking the mark to position the mark to a specified area in the image;

and step SS4: summarizing the labeling result of the text data, the labeling result of the image data and the labeling result of the video data obtained in the step SS3 to generate a test data labeling result set.

As a preferred embodiment, the step SS3 specifically includes: the complex network is formed by points and edges, the minimum unit capable of representing complete semantic information in text data is a sentence, nodes are adopted to represent the sentence, and structural feature analysis of the text is carried out by taking the sentence as a unit; the definition principle of the edge is that if two sentences have a common noun, one edge is generated to be connected, otherwise, no edge is generated, if two sentences in the network have an edge, namely a common noun, the same subject can be stated or supplementary data of the same subject can be conveyed, although the two sentences can contain redundant information repeatedly, the contents of the two sentences are related most closely, a complex network is constructed through the relationship of the common noun between the two sentences, and finally the complex network of the text data is obtained; after preprocessing, mapping nouns generated by each sentence in the text data to a complex network of the text data, and defining two matrixes A and W according to the concepts of an adjacency matrix and N-order matrix weight, wherein N is the number of nodes or sentences, the matrix A represents the relationship of edges between the sentences, and the matrix W represents the weight of the sentences; in the A matrix, a is if there is an edge between node i and node j _ij a _ji Equal to 1, otherwise equal to 0; weight W of edge in W matrix _ij w _ji Is the number of times a common word appears in node i and node j.

As a preferred embodiment, the step SS3 of weighting the word components of the word segmentation result by combining the eccentricity centrality and the centrality of the degree, and the obtaining the text label according to the weight specifically includes:

for the simple connected graph G, a node set is V (G), and an edge set is E (G); for two nodes u, V ∈ V (G), the distance between the two nodes u, V is defined as the length of the shortest path between them, denoted as d _G (u, v); the eccentricity ε (v) of node v is the maximum of the distances from node v to other nodes, and deg is used _G (v) Degree of node v, if deg _G (v) =1, then the node v is called a good connection point; by omega (G)Indicating the number of good connection points in the simple connectivity graph G, the centrifugal connectivity index uses xi ^c (G) Expressed as:

the total centrifuge index is defined as ζ (G), expressed as:

wherein ζ (G) = (ζ (G, x))') has furnace openings _x＝1 ；

The dot degree centrality represents the number of nodes directly connected with the node, an undirected graph is (n-1), a directed graph is (in-degree and out-degree), and the dot degree centrality is divided into absolute and relative;

the absolute point degree centrality of degree centrality, intermediate centrality, and approximate centrality are respectively expressed as:

the degree centrality, the middle centrality, and the relative point centrality of the approximate centrality are respectively expressed as:

C _RDi ＝d(i)/(n-1)

C _RBi ＝2C _ABi /[(n-1)(n-2)]

and finally, allocating labels to the text data according to the centrality.

As a preferred embodiment, the image classification labeling method based on supervised machine learning in step SS3 specifically includes:

step SS321: an image tag management step including: selecting a plurality of labels for each training image;

step SS322: the characteristic extraction step comprises the following steps: extracting color, texture, shape and spatial features of the image, and describing the image features by using a scale invariant feature transform algorithm SIFT, an accelerated robust feature algorithm SURF and a histogram of oriented gradients algorithm HOG;

step SS323: training an algorithm model, comprising the following steps: generating a classification model by using a Support Vector Machine (SVM), a convolutional neural network, BOOSTING and a random forest method to obtain a classifier, and identifying a target in an image by using the classification model;

step SS324: a target detection analysis step, comprising: adding image semantic content and keywords aiming at unknown images, giving corresponding category information to the unclassified images, and completing the labeling of the images.

As a preferred embodiment, the scale invariant feature transform algorithm SIFT specifically includes:

the building of the scale space comprises the following steps: determining and down-sampling a fuzzy scale, and blurring an image by adopting Gaussian convolution, wherein a Gaussian difference function of the image is as follows:

in the formula, x and y represent pixel coordinates, and σ is a variance.

The detection of the candidate extremum in the scale space comprises the following steps: searching image positions on all scales, and identifying potential interest points which are invariable in scale and rotation through a Gaussian difference function;

feature point positioning, comprising: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:

wherein, I _xx Representing the second partial derivative, I, in the x direction _yy Representing the second partial derivative, I, in the y direction _xy Second derivatives of table xy directions;

feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is unchanged, the image with larger scale contains less information, and the calculation formula of the gradient direction m (x, y) and the gradient amplitude θ (x, y) of each sampling point (x, y) in the window is as follows:

θ(x,y)＝tan ^-1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))

the characteristic point description comprises the following steps: measuring the local gradient of the image on a selected scale in a neighborhood around each key point;

and feature point matching, comprising: matching key points between two images, namely identifying a neighborhood, but in some cases, the difference between the second nearest image and the nearest image is not obvious, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.

As a preferred embodiment, the feature point description specifically includes: firstly, a neighborhood range for calculating the feature descriptors needs to be determined, a neighborhood near the feature points is divided into 4-by-4 sub-regions, each sub-region serves as a seed point, each seed point has 8 directions, the method is different from the method for solving the main direction of the feature points, at the moment, a gradient direction histogram of each sub-region divides 0 degree to 360 degrees into 8 direction ranges, and each range is 45 degrees, so that each seed point has gradient intensity information in 8 directions.

As a preferred embodiment, the speeded up robust features algorithm SURF specifically includes:

the building of the scale space comprises the following steps: determining and down-sampling a fuzzy scale, blurring an image by adopting Gaussian convolution, constructing a Hessian matrix, and assuming a function f (x, y), wherein the Hessian matrix H consists of a function and a partial derivative:

the H matrix discriminant is:

in SURF, the H matrix is calculated by computing two more partial derivatives by convolution between specific kernels using a second order standard gaussian function as a filter:

L(X,t)＝G(t)*I(X)

l (X, t) is the representation of an image under different resolutions and is represented by the convolution of a Gaussian kernel G (t) and an image function I (X); to balance the error between the exact and approximate values, the discriminant of the H matrix is given by:

det(H _approx )＝D _xx D _yy -(0.9D _xy ) ²

feature point positioning, including: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:

wherein, I _xx Representing the second partial derivative, I, in the x direction _yy Representing the second partial derivative, I, in the y direction _xy Second derivative of table xy direction;

feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is not changed, the image with larger scale contains less information, and the gradient direction m (x, y) and the gradient amplitude θ (x, y) of each sampling point (x, y) in the window are calculated as follows:

θ(x,y)＝tan ^-1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))

and feature point matching, comprising: keypoint matching between two images is to identify a neighborhood, but in some cases, the second nearest is not clearly distinguished from the nearest, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.

As a preferred embodiment, the histogram of oriented gradients algorithm HOG specifically includes:

standardizing gamma space and color space, and normalizing the whole image in order to reduce the influence of illumination factors; in the texture intensity of the image, the contribution proportion of local surface exposure is large, so the compression processing can effectively reduce the local shadow and illumination change of the image, and the color information has small effect and is generally converted into a gray map firstly;

I(x,y)＝I(x,y) ^gamma

calculating image gradient, calculating the gradient of the image in the horizontal coordinate and the vertical coordinate directions, and calculating the gradient direction value of each pixel position according to the gradient direction value; the derivation operation not only captures contours, shadows and some texture information, but also further weakens the influence of illumination. The gradient of pixel point (x, y) in the image is:

G _x (x,y)＝H(x+1,y)-H(x-1,y)

G _y (x,y)＝H(x,y+1)-H(x,y-1)

in the formula, G _x (x,y)，G _y (x, y), H (x, y), which respectively represent the horizontal gradient, the vertical gradient and the pixel value at the pixel point (x, y) in the input image, and the gradient magnitude and the gradient direction at the pixel point (x, y) are respectively:

constructing a gradient direction histogram for each cell unit, providing a code for a local image area, and simultaneously maintaining weak sensibility to the posture and the appearance of an object in an image;

the cell units are combined into a large block, the gradient histogram is normalized in the block, the change range of the gradient intensity is very large due to the change of local illumination and the change of foreground-background contrast, the gradient intensity needs to be normalized, and the normalization can further compress illumination, shadow and edges;

and collecting HOG features, collecting all overlapped blocks in the detection window with the HOG features, and combining the overlapped blocks into a final feature vector for classification.

As a preferred embodiment, the support vector machine SVM using step SS323 includes: inputting: training data set T = { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _N ,y _N ) -means for, among other things,

y _i ∈{+1,-1}，,i＝1,2,...,N；

and (3) outputting: separating the hyperplane and the classification decision function;

selecting a proper kernel function K (x, z) and a penalty parameter C >0, and constructing and solving a convex quadratic programming problem:

get the optimal solution

As a preferred embodiment, the method for video annotation based on machine learning in step SS3 specifically includes:

assume there are M modes, sample x _i In the M modes, red is respectively expressed as

Suppose there are D distance measures D ₁ (.,.),d ₂ (.,.),...,d _D (,) M × D maps can be generated from the M modalities and D distance measures:

wherein W _{(m-1)×D+k,ij} Is a graph generated from the mth modality and the kth distance metric.

For time continuity, C-graphs can also be constructed; here we consider two graphs, i.e. C =2; the first graph considers the relationship between every two adjacent samples, i.e. it considers that there is a high probability that every sample has the same concept as its adjacent samples, which is expressed as:

the other diagram considers the relationship between each sample and its adjacent 6 samples, and the weight is determined according to the position of this sample, and the concrete form is:

compared with the prior art, the invention has the following beneficial effects: the invention discloses an intelligent multi-mode test data labeling method which is mainly used for solving the labeling of multi-type data of a product test. The method comprises three parts, namely intelligent text labeling based on a complex network unsupervised technology, intelligent image labeling based on supervised machine learning and intelligent video labeling based on machine learning. The method can be used for carrying out quick and accurate intelligent marking on the data according to the data characteristics of texts, images and videos, and greatly reduces the marking time cost of the product test data.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a flowchart of intelligent text labeling according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating intelligent extraction of text content features according to the present invention;

FIG. 4 is a schematic diagram of the annotation data management of the present invention;

FIG. 5 is a process for automatically labeling text content according to the present invention;

FIG. 6 is a flowchart illustrating intelligent labeling of images according to an embodiment of the present invention;

fig. 7 is a flowchart illustrating intelligent video annotation according to an embodiment of the present invention.

Detailed Description

As shown in fig. 1 to 7, the invention provides an intelligent labeling method for multi-modal test data, comprising the following steps:

step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; executing an image classification labeling method based on supervised machine learning based on the image data; executing a video annotation method based on machine learning based on the video data;

(1) The text intelligent labeling based on the complex network unsupervised technology comprises the following specific steps:

a. text pre-processing

The flow of text preprocessing generally comprises the following steps: obtaining original text, segmenting words, cleaning the text, standardizing, extracting features, modeling and the like.

(1) The original text is the text data of the product test to be processed;

(2) the word segmentation, chinese and English, is based on word segmentation, and the word segmentation idea is different due to the special characteristics of the language. In most cases, english words can be segmented by directly using a blank space, but in Chinese words, because grammar is more complex, a third-party library such as jieba is used for segmenting words.

(3) Text cleaning, namely removing unnecessary punctuations, stop words and the like, and cleaning in steps. The method comprises the following steps: punctuation removal, english to lower case conversion, number normalization, lexicon/low frequency lexicon deactivation, and unnecessary label removal.

(4) Normalization, word-shape reduction (Lemmatization) and stem extraction (Stemming).

(5) And (4) feature extraction, namely extracting text features by adopting TF-IDF, word2Vec, countVectorizer and other modes.

(6) In an actual working scenario, some other processing may be needed, for example, because a user often has spelling errors in an actual situation, needs to perform spelling correction (shell correction), and the like.

The invention selects Chinese Lexical Analysis System ICTCCLAS (Institute of Computing Technology, chinese Lexical Analysis System) developed by Chinese academy of sciences research as a tool for automatically segmenting words of texts, and the System not only supports Chinese word segmentation and part-of-speech tagging, but also has the functions of keyword recognition, user-defined dictionary support and the like.

b. Complex network construction

Since a complex network is composed of points and edges, and the minimum unit capable of representing complete semantic information in a text is a sentence, the sentence is represented by a node in the text, and the structural feature analysis of the text is performed in units of sentences, which is reliable. The edge is defined by generating an edge association if there is a common noun between two sentences, otherwise not generating an edge. If two sentences in the network have an edge, i.e. a common noun, it is possible to describe the same topic or to convey supplementary material for the same topic, and although the two sentences may contain redundant information, the two sentences relate to the content most closely. And constructing a complex network through the relation of common nouns between two sentences to finally obtain the text complex network.

After preprocessing, the nouns generated by each sentence in the text are mapped into the network. Two matrices, a and W, are defined according to the concept of adjacency matrix and N-order matrix weights (N is the number of nodes or sentences), the a matrix representing the relationship of edges between sentences and the W matrix representing the weights of sentences. In the A matrix, a is the edge between node i and node j _ij a _ji Equal to 1, otherwise equal to 0. Weight W of edge in W matrix _ij w _ji Is the number of times a common word appears in node i and node j.

c. And (4) calculating word weight, namely determining the weight of the word component by calculating the eccentricity centrality and the degree centrality, and distributing the label of the text according to the weight.

For simple connectivity graph G, the set of vertices is V (G) and the set of edges is E (G). For two points u, V ∈ V (G), the distance between them is defined as the length of the shortest path between them, denoted d _G (u, v). The eccentricity epsilon (v) of the vertex v is the maximum value of the distance of v to other points. By deg. of _G (v) Representing the degree of the vertex v. If deg.f _G (v) And if 1, v is called a good connection point. The number of good connection points in graph G is represented by ω (G). CentrifugationXi for connectivity index ^c (G) To represent

The total centrifuge index is defined as ζ (G), expressed as

Wherein ζ (G) = (ζ (G, x))', is non-toxic _x＝1 。

The dot degree centrality represents the number of points directly connected with the point, an undirected graph is (n-1), and a directed graph is (in-degree, out-degree), and is divided into absolute and relative.

Degree centrality, intermediate centrality, and absolute point centrality near centrality may be expressed as:

the degree centrality, the intermediate centrality, and the relative point centrality near centrality may be expressed as:

C _RDi ＝d(i)/(n-1)

C _RBi ＝2C _ABi /[(n-1)(n-2)]

finally, labels are allocated to the texts according to the centrality.

(2) The intelligent image annotation based on supervised machine learning comprises the following steps.

a. Image label management:

in order to balance the problem of image labels, the invention selects a plurality of labels for each training image by utilizing cooperative labeling, interactive labeling, semi-supervised labeling and supervised labeling.

b. And image preprocessing, which mainly performs graying, filtering and denoising, color space transformation and the like.

c. Extracting image color, texture and shape features by using the technologies of Scale Invariant Feature Transform (SIFT), speeded Up Robust Feature (SURF), histogram of Oriented Gradients (HOG) and the like;

d. combining the extracted three characteristics, clustering the characteristics by using a neighbor propagation method, and finding out respective central points;

e. taking each clustering center as a seed point, and performing region growth;

f. merging the results after the region growth to obtain a primary segmentation result;

g. calculating the similarity of adjacent segmentation areas;

k. is the similarity degree determined to be less than a preset similarity threshold? If the number of the areas is smaller than the set threshold value, carrying out area combination; otherwise, not merging;

outputting an image segmentation result;

training an image classifier by using a training sample;

and n, taking the image to be labeled as a trained image classifier to finish the labeling of the image.

(3) Intelligent video annotation based on machine learning

1) Extracting a video key frame;

the theme of the video is expressed in a clustering mode, the video frames are divided into a plurality of clusters through clustering, and corresponding frames are selected from each cluster as key frames after the process is finished. Firstly, initializing a clustering center; secondly, determining a reference frame which is divided into classes or a new clustering center which is used as the class by calculating the range between the clustering center and the current frame; and finally, selecting the video frame closest to the clustering center to process the video frame into a key frame.

a. The set of input video frame data is represented as: x = { X ₁ ,x ₂ ,…,x _n And dividing the number of the sets of clusters on the premise of giving an initial cluster number k (k is less than or equal to n).

b. Extracting characteristic values in the set X based on the attribute of the color histogram of the video frame, dividing the number of clusters according to the extracted color characteristic values, wherein the dividing process can be represented by a minimum value C of a cluster model, and a calculation formula is as follows:

wherein C = { C ₁ ,C ₂ ,…,C _n Is the result of the clustering, u _i Representing a cluster c _i Average value of (2)

c. Feature vector x from which the first frame of a video frame corresponds ₁ And classifying the color histogram into a first class, and taking the characteristic value of the color histogram corresponding to the first frame as the initial centroid of the first class.

d. Calculating the distance from the video frame to the centroid, and if the distance of the currently compared video frame is greater than a given initial threshold value T, classifying the frame into a new class; otherwise, the current frame is classified into the class closest to it, and the centroid of that class is updated.

e. Repeating the process d until the last frame corresponds to the value n of the feature vector _x Either into a certain class or as a new class center.

f. And selecting the video frame closest to the clustering center as a key frame each time. The video key frame extracted by the algorithm has low redundancy, and the key frame can accurately reflect all contents generated in the video.

2) Carrying out image preprocessing such as filtering, graying, color space transformation and the like on the video frame image;

3) Extracting color moment features, HSV (hue, saturation, value) correlation graphs, edge distribution histograms, HSV histograms, wavelet edges and symbiotic texture feature graphs of the frame images;

4) Fusing the 6 characteristic graphs;

5) Training a video annotation classifier by using a video sample;

6) Inputting the feature graph into a video annotation classifier;

7) Video standard results are obtained.

(4) Vote-based multimodal data fusion

1) And (4) selecting the characteristics. And performing word segmentation on the text data, mapping the text data to an N-dimensional space by adopting a word embedding model, and converting the text data into a matrix form. And image data is subjected to image preprocessing, and the image information is reduced into a high-dimensional matrix form through a feature extractor. The video is framed and the same processing method as the image is adopted for each frame.

2) And (4) decision support, namely converting the random data by considering fuzzy lines existing in the multi-source data, and calculating the confidence coefficient of the data to a decision.

3) And selecting a proper fuzzy quantization rule according to the preference of the decision maker, solving a fuzzy quantization operator f (x), determining a weight vector w according to the fuzzy quantization operator, and converting the result. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. The method for intelligently labeling the multi-modal test data is characterized by comprising the following steps of:

step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; based on the image data, executing an image classification labeling method based on supervised machine learning; executing a video annotation method based on machine learning based on the video data;

step SS3: performing word segmentation, word filtering and word text preprocessing without stop on the text data by an unsupervised text labeling method based on a complex network to obtain a preliminary word segmentation result, completing semantic mapping of node words, assigning weights to word components of the word segmentation result by combining eccentricity centrality and centrality of the degree, obtaining labels of the text data according to the weight, and completing labeling of the text data; identifying targets in the image data through an image classification labeling method based on supervised machine learning, adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish labeling of the image data; extracting key frames aiming at the video data by a video labeling method based on machine learning, and extracting the characteristics of entities, scenes, places and characters of key frame images by utilizing an OCR character recognition method to complete the recognition of character contents; identifying typical image data by using an image target identification algorithm, marking the typical image data in an image, displaying the identified mark under intelligent feature extraction, and clicking the mark to position the mark to a specified area in the image;

2. The method for intelligent labeling of multimodal experimental data as claimed in claim 1, wherein said step SS3 specifically comprises: the complex network is composed of points and edges, the minimum unit capable of representing complete semantic information in text data is a sentence, the sentence is represented by nodes, and structural feature analysis of the text is carried out by taking the sentence as a unit; the edge definition principle is that if two sentences have a common noun, an edge is generated to be associated, otherwise, no edge is generated, if the network is in useTwo sentences in the network have edges, namely a public noun exists, the same theme can be explained or the supplementary data of the same theme can be conveyed, although the two sentences can contain repeated redundant information, the related contents of the two sentences are the closest, a complex network is constructed through the relationship of the public noun between the two sentences, and finally the text data complex network is obtained; after preprocessing, mapping nouns generated by each sentence in the text data to a complex network of the text data, and defining two matrixes A and W according to the concepts of an adjacency matrix and N-order matrix weight, wherein N is the number of nodes or sentences, the matrix A represents the relationship of edges between the sentences, and the matrix W represents the weight of the sentences; in the A matrix, a is if there is an edge between node i and node j _ij a _ji Equal to 1, otherwise equal to 0; weight W of edge in W matrix _ij w _ji Is the number of times a common word appears in node i and node j.

3. The method for intelligently labeling multimodal experimental data as claimed in claim 1, wherein the step SS3 of weighting the word components of the word segmentation result by combining eccentricity centrality and eccentricity centrality, and the obtaining of the text label according to the weight specifically comprises:

for the simple connected graph G, a node set is V (G), and an edge set is E (G); for two nodes u, V ∈ V (G), the distance between the two nodes u, V is defined as the length of the shortest path between them, denoted as d _G (u, v); the eccentricity ε (v) of node v is the maximum of the distances from node v to other nodes, and deg is used _G (v) Degree of node v, if deg _G (v) If =1, the node v is called a good connection point; the number of good connection points in the simple connectivity graph G is represented by omega (G), and the centrifugal connectivity index is represented by xi ^c (G) Expressed as:

the total centrifugation index is defined as ζ (G), expressed as:

wherein ζ (G) = (ζ (G, x))') has furnace openings _x＝1 ；

The central degree of the point degree represents the number of nodes directly connected with the node, an undirected graph is (n-1), and a directed graph is (in-degree and out-degree) and is divided into absolute and relative;

degree centrality, intermediate centrality, and relative point centrality near centrality are respectively expressed as:

C _RDi ＝d(i)/(n-1)

C _RBi ＝2C _ABi /[(n-1)(n-2)]

and finally, allocating labels to the text data according to the centrality.

4. The method according to claim 1, wherein the image classification labeling method based on supervised machine learning in step SS3 specifically comprises:

step SS323: the algorithm model training step comprises the following steps: generating a classification model by using a Support Vector Machine (SVM), a convolutional neural network, BOOSTING and a random forest method to obtain a classifier, and identifying a target in an image by using the classification model;

step SS324: a target detection analysis step, comprising: adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish the annotation of the images.

5. The method according to claim 4, wherein the scale invariant feature transform algorithm SIFT specifically comprises:

in the formula, x and y represent pixel point coordinates, and sigma is a variance;

the detection of the candidate extremum in the scale space comprises the following steps: searching image positions on all scales, and identifying potential interest points which are invariable to the scales and the rotations through a Gaussian difference function;

wherein x, y represent pixel coordinates, I _xx Denotes the second partial derivative, I, in the x direction _yy Denotes the second partial derivative in the y direction, I _xy Second derivatives of table xy directions;

θ(x,y)＝tan ^-1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))

6. The method for intelligent labeling of multimodal experimental data as claimed in claim 5, wherein the feature point description specifically comprises: firstly, determining a neighborhood range for calculating a feature descriptor, dividing a neighborhood near a feature point into 4 to 4 sub-regions, taking each sub-region as a seed point, wherein each seed point has 8 directions, which is different from the calculation of the main direction of the feature point, at the moment, a gradient direction histogram of each sub-region divides 0-360 degrees into 8 direction ranges, and each range is 45 degrees, so that each seed point has gradient intensity information in 8 directions in total.

7. The method for intelligent labeling of multimodal experimental data as claimed in claim 4, wherein the speeded up robust features algorithm SURF specifically comprises:

the building of the scale space comprises the following steps: determining a fuzzy scale and down-sampling, blurring the image by adopting Gaussian convolution, and constructing a Hessian matrix, wherein the Hessian matrix H is assumed to be composed of a function f (x, y) and a partial derivative:

the H matrix discriminant is:

in SURF, the H matrix is calculated by computing the two sister partial derivatives by convolution between specific kernels using a second order standard gaussian function as a filter:

L(X,t)＝G(t)*I(X)

det(H _approx )＝D _xx D _yy -(0.9D _xy ) ²

detection of candidate extrema in a scale space, comprising: searching image positions on all scales, and identifying potential interest points which are invariable to the scales and the rotations through a Gaussian difference function;

wherein, I _xx Representing the second partial derivative, I, in the x direction _yy Denotes the second partial derivative in the y direction, I _xy Second derivatives of table xy directions;

feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is unchanged, the image with larger scale contains less information, and the gradient direction m (x, y) and gradient amplitude θ (x, y) of each sampling point (x, y) in the window are expressed as follows:

θ(x,y)＝tan ^-1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))

8. The method for intelligently labeling multi-modal experimental data according to claim 4, wherein the histogram of oriented gradients algorithm HOG specifically comprises:

standardizing gamma space and color space, and normalizing the whole image to reduce the influence of illumination factors; in the texture intensity of the image, the contribution proportion of local surface exposure is large, and because the color information effect is not large, the color information is generally converted into a gray map firstly;

I(x,y)＝I(x,y) ^gamma

calculating image gradient, calculating the gradient of the image in the horizontal coordinate and the vertical coordinate directions, and calculating the gradient direction value of each pixel position according to the gradient direction value; derivation operation not only can catch outline, shadow and some texture information, can also further weaken the influence of illumination, and the gradient of pixel (x, y) is in the image:

G _x (x,y)＝H(x+1,y)-H(x-1,y)

G _y (x,y)＝H(x,y+1)-H(x,y-1)

in the formula, G _x (x,y)，G _y (x, y), H (x, y) respectively representing a horizontal direction gradient, a vertical direction gradient and a pixel value at the pixel point (x, y) in the input image; the gradient amplitude and gradient direction at pixel point (x, y) are respectively:

constructing a gradient direction histogram for each cell unit, providing a code for a local image area, and simultaneously keeping weak sensitivity to the posture and appearance of an object in an image;

the cell units are combined into a large block, the gradient histogram is normalized in the block, the variation range of the gradient intensity is very large due to the variation of local illumination and the variation of foreground-background contrast, the gradient intensity needs to be normalized, and the normalization can further compress illumination, shadow and edges;

9. The method for intelligent labeling of multi-modal experimental data as claimed in claim 4, wherein the Support Vector Machine (SVM) utilization in the step SS323 comprises: inputting: training data set T = { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ),...,(x _N ,y _N ) -means for, among other things,

y _i ∈{+1,-1}，,i＝1,2,...,N；

get the optimal solution

10. The intelligent labeling method for multi-modal experimental data according to claim 1, wherein the video labeling method based on machine learning in the step SS3 specifically comprises:

Suppose there are D distance measures D ₁ (.,.),d ₂ (.,.),...,d _D In.,) then M × D maps may be generated from the M modalities and D distance measures:

wherein W _{(m-1)×D+k,ij} Is a graph generated from the mth modality and the kth distance metric;

for time continuity, C graphs can also be constructed; here we consider two graphs, i.e. C =2; the first graph considers the relationship between every two adjacent samples, i.e. it considers that there is a high probability that every sample has the same concept as its adjacent samples, which is expressed as: