Nothing Special   »   [go: up one dir, main page]

CN115203408A - Intelligent labeling method for multi-modal test data - Google Patents

Intelligent labeling method for multi-modal test data Download PDF

Info

Publication number
CN115203408A
CN115203408A CN202210723506.XA CN202210723506A CN115203408A CN 115203408 A CN115203408 A CN 115203408A CN 202210723506 A CN202210723506 A CN 202210723506A CN 115203408 A CN115203408 A CN 115203408A
Authority
CN
China
Prior art keywords
image
data
labeling
gradient
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210723506.XA
Other languages
Chinese (zh)
Inventor
张骁雄
周晓磊
范强
严浩
王芳潇
江春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210723506.XA priority Critical patent/CN115203408A/en
Publication of CN115203408A publication Critical patent/CN115203408A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an intelligent multi-mode test data labeling method, belongs to the technical field of computer application, and aims to label text, images and video data. For a text, intelligent labeling of the text is realized by adopting a complex network unsupervised technology, word vector weights are distributed according to eccentricity centrality and point centrality, and then a label is given; for an image, constructing an annotation model by adopting a supervised machine learning method, segmenting, filtering noise and over-segmenting the image, classifying the segmented image by taking each semantic concept as a category, and adjusting a classifier based on feedback information to improve the classification precision; for videos, video labeling is carried out based on machine learning, video frames are used as segmentation points, the videos are subjected to picture formation, content analysis is carried out, effective data are extracted and stored, and then the pictures are analyzed and labeled. According to the invention, various intelligent learning methods are introduced for multi-format data, so that the labeling efficiency and precision of test data are effectively improved.

Description

Intelligent labeling method for multi-modal test data
Technical Field
The invention relates to a multi-mode test data intelligent labeling method and system, and belongs to the technical field of data processing.
Background
The text labeling technology can help a product test data acquisition worker to quickly screen effective information, and can provide support for text processing technologies such as text classification, text retrieval and automatic abstractions. And marking document data such as technical data, official documents and the like, and extracting key information such as contracts, product models, manufacturers and the like. The text labeling mainly completes the acquisition and the structuralization processing of the basic information of the file, and the labeling after the segmentation, the office segmentation and the word segmentation processing of the text content.
The test historical data comprises a large amount of images and videos, and key information such as document titles, subject terms, abstracts, indexes, image entities, scenes, places, characters and the like can be extracted by labeling the image data and the video data. The method comprises the steps of using the existing collected and sorted labeled data to sort the existing known type of images to form a training library, training the training library to obtain a model through a convolutional neural network, and using the model to classify and label unknown images. And performing structured storage on the structure, and then warehousing the structure to perform other business operations such as retrieval and the like.
In the prior art, a model of each semantic concept is established by learning manually labeled training video data, and then the model is used to classify unlabeled video data sets and label the corresponding semantic concepts. Because the machine learning theory is relatively mature, the method is generally considered to be a relatively proper method for solving the video annotation problem, and the current video annotation mainly utilizes the machine learning theory to improve the annotation accuracy.
Disclosure of Invention
The invention aims to overcome the technical defects in the prior art and solve the problem of labeling of multi-class data of a product test, and provides a multi-mode test data intelligent labeling method which can realize intelligent labeling of text, image and video data and can effectively improve the intelligent labeling efficiency and accuracy of the text, the image and the video through a complex network unsupervised technology, a supervised machine learning method and a machine learning method.
The invention specifically adopts the following technical scheme: an intelligent labeling method for multi-modal test data comprises the following steps:
step SS1: inputting test data, wherein the test data comprises text data, image data and video data;
step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; based on the image data, executing an image classification labeling method based on supervised machine learning; executing a video labeling method based on machine learning based on the video data;
and step SS3: performing word segmentation, word filtering and word text preprocessing without stop on the text data by an unsupervised text labeling method based on a complex network to obtain a preliminary word segmentation result, completing semantic mapping of node words, assigning weights to word components of the word segmentation result by combining eccentricity centrality and centrality of the degree, obtaining labels of the text data according to the weight, and completing labeling of the text data; identifying targets in the image data through an image classification labeling method based on supervised machine learning, adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish labeling of the image data; extracting key frames aiming at the video data by a video labeling method based on machine learning, and extracting the characteristics of entities, scenes, places and characters of key frame images by utilizing an OCR character recognition method to complete the recognition of character contents; identifying typical image data by using an image target identification algorithm, marking the typical image data in an image, displaying the identified mark under intelligent feature extraction, and clicking the mark to position the mark to a specified area in the image;
and step SS4: summarizing the labeling result of the text data, the labeling result of the image data and the labeling result of the video data obtained in the step SS3 to generate a test data labeling result set.
As a preferred embodiment, the step SS3 specifically includes: the complex network is formed by points and edges, the minimum unit capable of representing complete semantic information in text data is a sentence, nodes are adopted to represent the sentence, and structural feature analysis of the text is carried out by taking the sentence as a unit; the definition principle of the edge is that if two sentences have a common noun, one edge is generated to be connected, otherwise, no edge is generated, if two sentences in the network have an edge, namely a common noun, the same subject can be stated or supplementary data of the same subject can be conveyed, although the two sentences can contain redundant information repeatedly, the contents of the two sentences are related most closely, a complex network is constructed through the relationship of the common noun between the two sentences, and finally the complex network of the text data is obtained; after preprocessing, mapping nouns generated by each sentence in the text data to a complex network of the text data, and defining two matrixes A and W according to the concepts of an adjacency matrix and N-order matrix weight, wherein N is the number of nodes or sentences, the matrix A represents the relationship of edges between the sentences, and the matrix W represents the weight of the sentences; in the A matrix, a is if there is an edge between node i and node j ij a ji Equal to 1, otherwise equal to 0; weight W of edge in W matrix ij w ji Is the number of times a common word appears in node i and node j.
As a preferred embodiment, the step SS3 of weighting the word components of the word segmentation result by combining the eccentricity centrality and the centrality of the degree, and the obtaining the text label according to the weight specifically includes:
for the simple connected graph G, a node set is V (G), and an edge set is E (G); for two nodes u, V ∈ V (G), the distance between the two nodes u, V is defined as the length of the shortest path between them, denoted as d G (u, v); the eccentricity ε (v) of node v is the maximum of the distances from node v to other nodes, and deg is used G (v) Degree of node v, if deg G (v) =1, then the node v is called a good connection point; by omega (G)Indicating the number of good connection points in the simple connectivity graph G, the centrifugal connectivity index uses xi c (G) Expressed as:
Figure BDA0003712514400000031
the total centrifuge index is defined as ζ (G), expressed as:
Figure BDA0003712514400000041
Figure BDA0003712514400000042
wherein ζ (G) = (ζ (G, x))') has furnace openings x=1
The dot degree centrality represents the number of nodes directly connected with the node, an undirected graph is (n-1), a directed graph is (in-degree and out-degree), and the dot degree centrality is divided into absolute and relative;
the absolute point degree centrality of degree centrality, intermediate centrality, and approximate centrality are respectively expressed as:
Figure BDA0003712514400000043
Figure BDA0003712514400000044
Figure BDA0003712514400000045
the degree centrality, the middle centrality, and the relative point centrality of the approximate centrality are respectively expressed as:
C RDi =d(i)/(n-1)
C RBi =2C ABi /[(n-1)(n-2)]
Figure BDA0003712514400000046
and finally, allocating labels to the text data according to the centrality.
As a preferred embodiment, the image classification labeling method based on supervised machine learning in step SS3 specifically includes:
step SS321: an image tag management step including: selecting a plurality of labels for each training image;
step SS322: the characteristic extraction step comprises the following steps: extracting color, texture, shape and spatial features of the image, and describing the image features by using a scale invariant feature transform algorithm SIFT, an accelerated robust feature algorithm SURF and a histogram of oriented gradients algorithm HOG;
step SS323: training an algorithm model, comprising the following steps: generating a classification model by using a Support Vector Machine (SVM), a convolutional neural network, BOOSTING and a random forest method to obtain a classifier, and identifying a target in an image by using the classification model;
step SS324: a target detection analysis step, comprising: adding image semantic content and keywords aiming at unknown images, giving corresponding category information to the unclassified images, and completing the labeling of the images.
As a preferred embodiment, the scale invariant feature transform algorithm SIFT specifically includes:
the building of the scale space comprises the following steps: determining and down-sampling a fuzzy scale, and blurring an image by adopting Gaussian convolution, wherein a Gaussian difference function of the image is as follows:
Figure BDA0003712514400000051
in the formula, x and y represent pixel coordinates, and σ is a variance.
The detection of the candidate extremum in the scale space comprises the following steps: searching image positions on all scales, and identifying potential interest points which are invariable in scale and rotation through a Gaussian difference function;
feature point positioning, comprising: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:
Figure BDA0003712514400000052
wherein, I xx Representing the second partial derivative, I, in the x direction yy Representing the second partial derivative, I, in the y direction xy Second derivatives of table xy directions;
feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is unchanged, the image with larger scale contains less information, and the calculation formula of the gradient direction m (x, y) and the gradient amplitude θ (x, y) of each sampling point (x, y) in the window is as follows:
Figure BDA0003712514400000061
θ(x,y)=tan -1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))
the characteristic point description comprises the following steps: measuring the local gradient of the image on a selected scale in a neighborhood around each key point;
and feature point matching, comprising: matching key points between two images, namely identifying a neighborhood, but in some cases, the difference between the second nearest image and the nearest image is not obvious, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.
As a preferred embodiment, the feature point description specifically includes: firstly, a neighborhood range for calculating the feature descriptors needs to be determined, a neighborhood near the feature points is divided into 4-by-4 sub-regions, each sub-region serves as a seed point, each seed point has 8 directions, the method is different from the method for solving the main direction of the feature points, at the moment, a gradient direction histogram of each sub-region divides 0 degree to 360 degrees into 8 direction ranges, and each range is 45 degrees, so that each seed point has gradient intensity information in 8 directions.
As a preferred embodiment, the speeded up robust features algorithm SURF specifically includes:
the building of the scale space comprises the following steps: determining and down-sampling a fuzzy scale, blurring an image by adopting Gaussian convolution, constructing a Hessian matrix, and assuming a function f (x, y), wherein the Hessian matrix H consists of a function and a partial derivative:
Figure BDA0003712514400000062
the H matrix discriminant is:
Figure BDA0003712514400000063
in SURF, the H matrix is calculated by computing two more partial derivatives by convolution between specific kernels using a second order standard gaussian function as a filter:
Figure BDA0003712514400000071
L(X,t)=G(t)*I(X)
l (X, t) is the representation of an image under different resolutions and is represented by the convolution of a Gaussian kernel G (t) and an image function I (X); to balance the error between the exact and approximate values, the discriminant of the H matrix is given by:
det(H approx )=D xx D yy -(0.9D xy ) 2
the detection of the candidate extremum in the scale space comprises the following steps: searching image positions on all scales, and identifying potential interest points which are invariable in scale and rotation through a Gaussian difference function;
feature point positioning, including: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:
Figure BDA0003712514400000072
wherein, I xx Representing the second partial derivative, I, in the x direction yy Representing the second partial derivative, I, in the y direction xy Second derivative of table xy direction;
feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is not changed, the image with larger scale contains less information, and the gradient direction m (x, y) and the gradient amplitude θ (x, y) of each sampling point (x, y) in the window are calculated as follows:
Figure BDA0003712514400000073
θ(x,y)=tan -1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))
the characteristic point description comprises the following steps: measuring the local gradient of the image on a selected scale in a neighborhood around each key point;
and feature point matching, comprising: keypoint matching between two images is to identify a neighborhood, but in some cases, the second nearest is not clearly distinguished from the nearest, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.
As a preferred embodiment, the histogram of oriented gradients algorithm HOG specifically includes:
standardizing gamma space and color space, and normalizing the whole image in order to reduce the influence of illumination factors; in the texture intensity of the image, the contribution proportion of local surface exposure is large, so the compression processing can effectively reduce the local shadow and illumination change of the image, and the color information has small effect and is generally converted into a gray map firstly;
I(x,y)=I(x,y) gamma
calculating image gradient, calculating the gradient of the image in the horizontal coordinate and the vertical coordinate directions, and calculating the gradient direction value of each pixel position according to the gradient direction value; the derivation operation not only captures contours, shadows and some texture information, but also further weakens the influence of illumination. The gradient of pixel point (x, y) in the image is:
G x (x,y)=H(x+1,y)-H(x-1,y)
G y (x,y)=H(x,y+1)-H(x,y-1)
in the formula, G x (x,y),G y (x, y), H (x, y), which respectively represent the horizontal gradient, the vertical gradient and the pixel value at the pixel point (x, y) in the input image, and the gradient magnitude and the gradient direction at the pixel point (x, y) are respectively:
Figure BDA0003712514400000081
Figure BDA0003712514400000082
constructing a gradient direction histogram for each cell unit, providing a code for a local image area, and simultaneously maintaining weak sensibility to the posture and the appearance of an object in an image;
the cell units are combined into a large block, the gradient histogram is normalized in the block, the change range of the gradient intensity is very large due to the change of local illumination and the change of foreground-background contrast, the gradient intensity needs to be normalized, and the normalization can further compress illumination, shadow and edges;
and collecting HOG features, collecting all overlapped blocks in the detection window with the HOG features, and combining the overlapped blocks into a final feature vector for classification.
As a preferred embodiment, the support vector machine SVM using step SS323 includes: inputting: training data set T = { (x) 1 ,y 1 ),(x 2 ,y 2 ),...,(x N ,y N ) -means for, among other things,
Figure BDA0003712514400000091
y i ∈{+1,-1},,i=1,2,...,N;
and (3) outputting: separating the hyperplane and the classification decision function;
selecting a proper kernel function K (x, z) and a penalty parameter C >0, and constructing and solving a convex quadratic programming problem:
Figure BDA0003712514400000092
Figure BDA0003712514400000093
get the optimal solution
Figure BDA0003712514400000094
As a preferred embodiment, the method for video annotation based on machine learning in step SS3 specifically includes:
assume there are M modes, sample x i In the M modes, red is respectively expressed as
Figure BDA0003712514400000095
Suppose there are D distance measures D 1 (.,.),d 2 (.,.),...,d D (,) M × D maps can be generated from the M modalities and D distance measures:
Figure BDA0003712514400000096
wherein W (m-1)×D+k,ij Is a graph generated from the mth modality and the kth distance metric.
For time continuity, C-graphs can also be constructed; here we consider two graphs, i.e. C =2; the first graph considers the relationship between every two adjacent samples, i.e. it considers that there is a high probability that every sample has the same concept as its adjacent samples, which is expressed as:
Figure BDA0003712514400000101
the other diagram considers the relationship between each sample and its adjacent 6 samples, and the weight is determined according to the position of this sample, and the concrete form is:
Figure BDA0003712514400000102
compared with the prior art, the invention has the following beneficial effects: the invention discloses an intelligent multi-mode test data labeling method which is mainly used for solving the labeling of multi-type data of a product test. The method comprises three parts, namely intelligent text labeling based on a complex network unsupervised technology, intelligent image labeling based on supervised machine learning and intelligent video labeling based on machine learning. The method can be used for carrying out quick and accurate intelligent marking on the data according to the data characteristics of texts, images and videos, and greatly reduces the marking time cost of the product test data.
Drawings
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a flowchart of intelligent text labeling according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating intelligent extraction of text content features according to the present invention;
FIG. 4 is a schematic diagram of the annotation data management of the present invention;
FIG. 5 is a process for automatically labeling text content according to the present invention;
FIG. 6 is a flowchart illustrating intelligent labeling of images according to an embodiment of the present invention;
fig. 7 is a flowchart illustrating intelligent video annotation according to an embodiment of the present invention.
Detailed Description
As shown in fig. 1 to 7, the invention provides an intelligent labeling method for multi-modal test data, comprising the following steps:
step SS1: inputting test data, wherein the test data comprises text data, image data and video data;
step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; executing an image classification labeling method based on supervised machine learning based on the image data; executing a video annotation method based on machine learning based on the video data;
and step SS3: performing word segmentation, word filtering and word text preprocessing without stop on the text data by an unsupervised text labeling method based on a complex network to obtain a preliminary word segmentation result, completing semantic mapping of node words, assigning weights to word components of the word segmentation result by combining eccentricity centrality and centrality of the degree, obtaining labels of the text data according to the weight, and completing labeling of the text data; identifying targets in the image data through an image classification labeling method based on supervised machine learning, adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish labeling of the image data; extracting key frames aiming at the video data by a video labeling method based on machine learning, and extracting the characteristics of entities, scenes, places and characters of key frame images by utilizing an OCR character recognition method to complete the recognition of character contents; identifying typical image data by using an image target identification algorithm, marking the typical image data in an image, displaying the identified mark under intelligent feature extraction, and clicking the mark to position the mark to a specified area in the image;
and step SS4: summarizing the labeling result of the text data, the labeling result of the image data and the labeling result of the video data obtained in the step SS3 to generate a test data labeling result set.
(1) The text intelligent labeling based on the complex network unsupervised technology comprises the following specific steps:
a. text pre-processing
The flow of text preprocessing generally comprises the following steps: obtaining original text, segmenting words, cleaning the text, standardizing, extracting features, modeling and the like.
(1) The original text is the text data of the product test to be processed;
(2) the word segmentation, chinese and English, is based on word segmentation, and the word segmentation idea is different due to the special characteristics of the language. In most cases, english words can be segmented by directly using a blank space, but in Chinese words, because grammar is more complex, a third-party library such as jieba is used for segmenting words.
(3) Text cleaning, namely removing unnecessary punctuations, stop words and the like, and cleaning in steps. The method comprises the following steps: punctuation removal, english to lower case conversion, number normalization, lexicon/low frequency lexicon deactivation, and unnecessary label removal.
(4) Normalization, word-shape reduction (Lemmatization) and stem extraction (Stemming).
(5) And (4) feature extraction, namely extracting text features by adopting TF-IDF, word2Vec, countVectorizer and other modes.
(6) In an actual working scenario, some other processing may be needed, for example, because a user often has spelling errors in an actual situation, needs to perform spelling correction (shell correction), and the like.
The invention selects Chinese Lexical Analysis System ICTCCLAS (Institute of Computing Technology, chinese Lexical Analysis System) developed by Chinese academy of sciences research as a tool for automatically segmenting words of texts, and the System not only supports Chinese word segmentation and part-of-speech tagging, but also has the functions of keyword recognition, user-defined dictionary support and the like.
b. Complex network construction
Since a complex network is composed of points and edges, and the minimum unit capable of representing complete semantic information in a text is a sentence, the sentence is represented by a node in the text, and the structural feature analysis of the text is performed in units of sentences, which is reliable. The edge is defined by generating an edge association if there is a common noun between two sentences, otherwise not generating an edge. If two sentences in the network have an edge, i.e. a common noun, it is possible to describe the same topic or to convey supplementary material for the same topic, and although the two sentences may contain redundant information, the two sentences relate to the content most closely. And constructing a complex network through the relation of common nouns between two sentences to finally obtain the text complex network.
After preprocessing, the nouns generated by each sentence in the text are mapped into the network. Two matrices, a and W, are defined according to the concept of adjacency matrix and N-order matrix weights (N is the number of nodes or sentences), the a matrix representing the relationship of edges between sentences and the W matrix representing the weights of sentences. In the A matrix, a is the edge between node i and node j ij a ji Equal to 1, otherwise equal to 0. Weight W of edge in W matrix ij w ji Is the number of times a common word appears in node i and node j.
c. And (4) calculating word weight, namely determining the weight of the word component by calculating the eccentricity centrality and the degree centrality, and distributing the label of the text according to the weight.
For simple connectivity graph G, the set of vertices is V (G) and the set of edges is E (G). For two points u, V ∈ V (G), the distance between them is defined as the length of the shortest path between them, denoted d G (u, v). The eccentricity epsilon (v) of the vertex v is the maximum value of the distance of v to other points. By deg. of G (v) Representing the degree of the vertex v. If deg.f G (v) And if 1, v is called a good connection point. The number of good connection points in graph G is represented by ω (G). CentrifugationXi for connectivity index c (G) To represent
Figure BDA0003712514400000131
The total centrifuge index is defined as ζ (G), expressed as
Figure BDA0003712514400000132
Figure BDA0003712514400000133
Wherein ζ (G) = (ζ (G, x))', is non-toxic x=1
The dot degree centrality represents the number of points directly connected with the point, an undirected graph is (n-1), and a directed graph is (in-degree, out-degree), and is divided into absolute and relative.
Degree centrality, intermediate centrality, and absolute point centrality near centrality may be expressed as:
Figure BDA0003712514400000141
Figure BDA0003712514400000142
Figure BDA0003712514400000143
the degree centrality, the intermediate centrality, and the relative point centrality near centrality may be expressed as:
C RDi =d(i)/(n-1)
C RBi =2C ABi /[(n-1)(n-2)]
Figure BDA0003712514400000144
finally, labels are allocated to the texts according to the centrality.
(2) The intelligent image annotation based on supervised machine learning comprises the following steps.
a. Image label management:
in order to balance the problem of image labels, the invention selects a plurality of labels for each training image by utilizing cooperative labeling, interactive labeling, semi-supervised labeling and supervised labeling.
b. And image preprocessing, which mainly performs graying, filtering and denoising, color space transformation and the like.
c. Extracting image color, texture and shape features by using the technologies of Scale Invariant Feature Transform (SIFT), speeded Up Robust Feature (SURF), histogram of Oriented Gradients (HOG) and the like;
d. combining the extracted three characteristics, clustering the characteristics by using a neighbor propagation method, and finding out respective central points;
e. taking each clustering center as a seed point, and performing region growth;
f. merging the results after the region growth to obtain a primary segmentation result;
g. calculating the similarity of adjacent segmentation areas;
k. is the similarity degree determined to be less than a preset similarity threshold? If the number of the areas is smaller than the set threshold value, carrying out area combination; otherwise, not merging;
outputting an image segmentation result;
training an image classifier by using a training sample;
and n, taking the image to be labeled as a trained image classifier to finish the labeling of the image.
(3) Intelligent video annotation based on machine learning
1) Extracting a video key frame;
the theme of the video is expressed in a clustering mode, the video frames are divided into a plurality of clusters through clustering, and corresponding frames are selected from each cluster as key frames after the process is finished. Firstly, initializing a clustering center; secondly, determining a reference frame which is divided into classes or a new clustering center which is used as the class by calculating the range between the clustering center and the current frame; and finally, selecting the video frame closest to the clustering center to process the video frame into a key frame.
a. The set of input video frame data is represented as: x = { X 1 ,x 2 ,…,x n And dividing the number of the sets of clusters on the premise of giving an initial cluster number k (k is less than or equal to n).
b. Extracting characteristic values in the set X based on the attribute of the color histogram of the video frame, dividing the number of clusters according to the extracted color characteristic values, wherein the dividing process can be represented by a minimum value C of a cluster model, and a calculation formula is as follows:
Figure BDA0003712514400000151
wherein C = { C 1 ,C 2 ,…,C n Is the result of the clustering, u i Representing a cluster c i Average value of (2)
c. Feature vector x from which the first frame of a video frame corresponds 1 And classifying the color histogram into a first class, and taking the characteristic value of the color histogram corresponding to the first frame as the initial centroid of the first class.
d. Calculating the distance from the video frame to the centroid, and if the distance of the currently compared video frame is greater than a given initial threshold value T, classifying the frame into a new class; otherwise, the current frame is classified into the class closest to it, and the centroid of that class is updated.
e. Repeating the process d until the last frame corresponds to the value n of the feature vector x Either into a certain class or as a new class center.
f. And selecting the video frame closest to the clustering center as a key frame each time. The video key frame extracted by the algorithm has low redundancy, and the key frame can accurately reflect all contents generated in the video.
2) Carrying out image preprocessing such as filtering, graying, color space transformation and the like on the video frame image;
3) Extracting color moment features, HSV (hue, saturation, value) correlation graphs, edge distribution histograms, HSV histograms, wavelet edges and symbiotic texture feature graphs of the frame images;
4) Fusing the 6 characteristic graphs;
5) Training a video annotation classifier by using a video sample;
6) Inputting the feature graph into a video annotation classifier;
7) Video standard results are obtained.
(4) Vote-based multimodal data fusion
1) And (4) selecting the characteristics. And performing word segmentation on the text data, mapping the text data to an N-dimensional space by adopting a word embedding model, and converting the text data into a matrix form. And image data is subjected to image preprocessing, and the image information is reduced into a high-dimensional matrix form through a feature extractor. The video is framed and the same processing method as the image is adopted for each frame.
2) And (4) decision support, namely converting the random data by considering fuzzy lines existing in the multi-source data, and calculating the confidence coefficient of the data to a decision.
3) And selecting a proper fuzzy quantization rule according to the preference of the decision maker, solving a fuzzy quantization operator f (x), determining a weight vector w according to the fuzzy quantization operator, and converting the result. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. The method for intelligently labeling the multi-modal test data is characterized by comprising the following steps of:
step SS1: inputting test data, wherein the test data comprises text data, image data and video data;
step SS2: executing an unsupervised text labeling method based on a complex network based on the text data; based on the image data, executing an image classification labeling method based on supervised machine learning; executing a video annotation method based on machine learning based on the video data;
step SS3: performing word segmentation, word filtering and word text preprocessing without stop on the text data by an unsupervised text labeling method based on a complex network to obtain a preliminary word segmentation result, completing semantic mapping of node words, assigning weights to word components of the word segmentation result by combining eccentricity centrality and centrality of the degree, obtaining labels of the text data according to the weight, and completing labeling of the text data; identifying targets in the image data through an image classification labeling method based on supervised machine learning, adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish labeling of the image data; extracting key frames aiming at the video data by a video labeling method based on machine learning, and extracting the characteristics of entities, scenes, places and characters of key frame images by utilizing an OCR character recognition method to complete the recognition of character contents; identifying typical image data by using an image target identification algorithm, marking the typical image data in an image, displaying the identified mark under intelligent feature extraction, and clicking the mark to position the mark to a specified area in the image;
and step SS4: summarizing the labeling result of the text data, the labeling result of the image data and the labeling result of the video data obtained in the step SS3 to generate a test data labeling result set.
2. The method for intelligent labeling of multimodal experimental data as claimed in claim 1, wherein said step SS3 specifically comprises: the complex network is composed of points and edges, the minimum unit capable of representing complete semantic information in text data is a sentence, the sentence is represented by nodes, and structural feature analysis of the text is carried out by taking the sentence as a unit; the edge definition principle is that if two sentences have a common noun, an edge is generated to be associated, otherwise, no edge is generated, if the network is in useTwo sentences in the network have edges, namely a public noun exists, the same theme can be explained or the supplementary data of the same theme can be conveyed, although the two sentences can contain repeated redundant information, the related contents of the two sentences are the closest, a complex network is constructed through the relationship of the public noun between the two sentences, and finally the text data complex network is obtained; after preprocessing, mapping nouns generated by each sentence in the text data to a complex network of the text data, and defining two matrixes A and W according to the concepts of an adjacency matrix and N-order matrix weight, wherein N is the number of nodes or sentences, the matrix A represents the relationship of edges between the sentences, and the matrix W represents the weight of the sentences; in the A matrix, a is if there is an edge between node i and node j ij a ji Equal to 1, otherwise equal to 0; weight W of edge in W matrix ij w ji Is the number of times a common word appears in node i and node j.
3. The method for intelligently labeling multimodal experimental data as claimed in claim 1, wherein the step SS3 of weighting the word components of the word segmentation result by combining eccentricity centrality and eccentricity centrality, and the obtaining of the text label according to the weight specifically comprises:
for the simple connected graph G, a node set is V (G), and an edge set is E (G); for two nodes u, V ∈ V (G), the distance between the two nodes u, V is defined as the length of the shortest path between them, denoted as d G (u, v); the eccentricity ε (v) of node v is the maximum of the distances from node v to other nodes, and deg is used G (v) Degree of node v, if deg G (v) If =1, the node v is called a good connection point; the number of good connection points in the simple connectivity graph G is represented by omega (G), and the centrifugal connectivity index is represented by xi c (G) Expressed as:
Figure FDA0003712514390000021
the total centrifugation index is defined as ζ (G), expressed as:
Figure FDA0003712514390000022
Figure FDA0003712514390000023
wherein ζ (G) = (ζ (G, x))') has furnace openings x=1
The central degree of the point degree represents the number of nodes directly connected with the node, an undirected graph is (n-1), and a directed graph is (in-degree and out-degree) and is divided into absolute and relative;
the absolute point degree centrality of degree centrality, intermediate centrality, and approximate centrality are respectively expressed as:
Figure FDA0003712514390000031
Figure FDA0003712514390000032
Figure FDA0003712514390000033
degree centrality, intermediate centrality, and relative point centrality near centrality are respectively expressed as:
C RDi =d(i)/(n-1)
C RBi =2C ABi /[(n-1)(n-2)]
Figure FDA0003712514390000034
and finally, allocating labels to the text data according to the centrality.
4. The method according to claim 1, wherein the image classification labeling method based on supervised machine learning in step SS3 specifically comprises:
step SS321: an image tag management step including: selecting a plurality of labels for each training image;
step SS322: the characteristic extraction step comprises the following steps: extracting color, texture, shape and spatial features of the image, and describing the image features by using a scale invariant feature transform algorithm SIFT, an accelerated robust feature algorithm SURF and a histogram of oriented gradients algorithm HOG;
step SS323: the algorithm model training step comprises the following steps: generating a classification model by using a Support Vector Machine (SVM), a convolutional neural network, BOOSTING and a random forest method to obtain a classifier, and identifying a target in an image by using the classification model;
step SS324: a target detection analysis step, comprising: adding image semantic content and keywords aiming at unknown images, and giving corresponding category information to the unclassified images to finish the annotation of the images.
5. The method according to claim 4, wherein the scale invariant feature transform algorithm SIFT specifically comprises:
the building of the scale space comprises the following steps: determining and down-sampling a fuzzy scale, and blurring an image by adopting Gaussian convolution, wherein a Gaussian difference function of the image is as follows:
Figure FDA0003712514390000041
in the formula, x and y represent pixel point coordinates, and sigma is a variance;
the detection of the candidate extremum in the scale space comprises the following steps: searching image positions on all scales, and identifying potential interest points which are invariable to the scales and the rotations through a Gaussian difference function;
feature point positioning, including: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:
Figure FDA0003712514390000042
wherein x, y represent pixel coordinates, I xx Denotes the second partial derivative, I, in the x direction yy Denotes the second partial derivative in the y direction, I xy Second derivatives of table xy directions;
feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is not changed, the image with larger scale contains less information, and the gradient direction m (x, y) and the gradient amplitude θ (x, y) of each sampling point (x, y) in the window are calculated as follows:
Figure FDA0003712514390000051
θ(x,y)=tan -1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))
the characteristic point description comprises the following steps: measuring the local gradient of the image on a selected scale in a neighborhood around each key point;
and feature point matching, comprising: matching key points between two images, namely identifying a neighborhood, but in some cases, the difference between the second nearest image and the nearest image is not obvious, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.
6. The method for intelligent labeling of multimodal experimental data as claimed in claim 5, wherein the feature point description specifically comprises: firstly, determining a neighborhood range for calculating a feature descriptor, dividing a neighborhood near a feature point into 4 to 4 sub-regions, taking each sub-region as a seed point, wherein each seed point has 8 directions, which is different from the calculation of the main direction of the feature point, at the moment, a gradient direction histogram of each sub-region divides 0-360 degrees into 8 direction ranges, and each range is 45 degrees, so that each seed point has gradient intensity information in 8 directions in total.
7. The method for intelligent labeling of multimodal experimental data as claimed in claim 4, wherein the speeded up robust features algorithm SURF specifically comprises:
the building of the scale space comprises the following steps: determining a fuzzy scale and down-sampling, blurring the image by adopting Gaussian convolution, and constructing a Hessian matrix, wherein the Hessian matrix H is assumed to be composed of a function f (x, y) and a partial derivative:
Figure FDA0003712514390000052
the H matrix discriminant is:
Figure FDA0003712514390000053
in SURF, the H matrix is calculated by computing the two sister partial derivatives by convolution between specific kernels using a second order standard gaussian function as a filter:
Figure FDA0003712514390000061
L(X,t)=G(t)*I(X)
l (X, t) is the representation of an image under different resolutions and is represented by the convolution of a Gaussian kernel G (t) and an image function I (X); to balance the error between the exact and approximate values, the discriminant of the H matrix is given by:
det(H approx )=D xx D yy -(0.9D xy ) 2
detection of candidate extrema in a scale space, comprising: searching image positions on all scales, and identifying potential interest points which are invariable to the scales and the rotations through a Gaussian difference function;
feature point positioning, including: determining the position and the scale of each candidate position through a fitting fine model, obtaining the ratio of the main gradient direction to other directions by calculating a second derivative Hessian matrix of the DOG, and reserving a local feature point of which the ratio is smaller than a certain value; the Hessian matrix is represented as:
Figure FDA0003712514390000062
wherein, I xx Representing the second partial derivative, I, in the x direction yy Denotes the second partial derivative in the y direction, I xy Second derivatives of table xy directions;
feature point direction assignment, comprising: assigning one or more directions to each keypoint location based on the local gradient direction of the image, all subsequent operations on the image data being transformed with respect to the direction, scale and location of the keypoint, thereby providing invariance to these transformations; the larger the scale of the same original image is, the larger the window is; conversely, if the window size is unchanged, the image with larger scale contains less information, and the gradient direction m (x, y) and gradient amplitude θ (x, y) of each sampling point (x, y) in the window are expressed as follows:
Figure FDA0003712514390000071
θ(x,y)=tan -1 (L(x,y+1)-L(x,y-1))/(L(x+1,y)-L(x-1,y))
the characteristic point description comprises the following steps: measuring the local gradient of the image on a selected scale in a neighborhood around each key point;
and feature point matching, comprising: matching key points between two images, namely identifying a neighborhood, but in some cases, the difference between the second nearest image and the nearest image is not obvious, possibly due to noise or other reasons; in this case, a ratio of the most recent to the second most recent is taken, and if the ratio is greater than 0.8, it is rejected, and 90% of the false matches can be eliminated.
8. The method for intelligently labeling multi-modal experimental data according to claim 4, wherein the histogram of oriented gradients algorithm HOG specifically comprises:
standardizing gamma space and color space, and normalizing the whole image to reduce the influence of illumination factors; in the texture intensity of the image, the contribution proportion of local surface exposure is large, and because the color information effect is not large, the color information is generally converted into a gray map firstly;
I(x,y)=I(x,y) gamma
calculating image gradient, calculating the gradient of the image in the horizontal coordinate and the vertical coordinate directions, and calculating the gradient direction value of each pixel position according to the gradient direction value; derivation operation not only can catch outline, shadow and some texture information, can also further weaken the influence of illumination, and the gradient of pixel (x, y) is in the image:
G x (x,y)=H(x+1,y)-H(x-1,y)
G y (x,y)=H(x,y+1)-H(x,y-1)
in the formula, G x (x,y),G y (x, y), H (x, y) respectively representing a horizontal direction gradient, a vertical direction gradient and a pixel value at the pixel point (x, y) in the input image; the gradient amplitude and gradient direction at pixel point (x, y) are respectively:
Figure FDA0003712514390000072
Figure FDA0003712514390000073
constructing a gradient direction histogram for each cell unit, providing a code for a local image area, and simultaneously keeping weak sensitivity to the posture and appearance of an object in an image;
the cell units are combined into a large block, the gradient histogram is normalized in the block, the variation range of the gradient intensity is very large due to the variation of local illumination and the variation of foreground-background contrast, the gradient intensity needs to be normalized, and the normalization can further compress illumination, shadow and edges;
and collecting HOG features, collecting all overlapped blocks in the detection window with the HOG features, and combining the overlapped blocks into a final feature vector for classification.
9. The method for intelligent labeling of multi-modal experimental data as claimed in claim 4, wherein the Support Vector Machine (SVM) utilization in the step SS323 comprises: inputting: training data set T = { (x) 1 ,y 1 ),(x 2 ,y 2 ),...,(x N ,y N ) -means for, among other things,
Figure FDA0003712514390000081
y i ∈{+1,-1},,i=1,2,...,N;
and (3) outputting: separating the hyperplane and the classification decision function;
selecting a proper kernel function K (x, z) and a penalty parameter C >0, and constructing and solving a convex quadratic programming problem:
Figure FDA0003712514390000082
Figure FDA0003712514390000083
get the optimal solution
Figure FDA0003712514390000084
10. The intelligent labeling method for multi-modal experimental data according to claim 1, wherein the video labeling method based on machine learning in the step SS3 specifically comprises:
assume there are M modes, sample x i In the M modes, red is respectively expressed as
Figure FDA0003712514390000085
Suppose there are D distance measures D 1 (.,.),d 2 (.,.),...,d D In.,) then M × D maps may be generated from the M modalities and D distance measures:
Figure FDA0003712514390000086
wherein W (m-1)×D+k,ij Is a graph generated from the mth modality and the kth distance metric;
for time continuity, C graphs can also be constructed; here we consider two graphs, i.e. C =2; the first graph considers the relationship between every two adjacent samples, i.e. it considers that there is a high probability that every sample has the same concept as its adjacent samples, which is expressed as:
Figure FDA0003712514390000091
the other diagram considers the relationship between each sample and its adjacent 6 samples, and the weight is determined according to the position of this sample, and the concrete form is:
Figure FDA0003712514390000092
CN202210723506.XA 2022-06-24 2022-06-24 Intelligent labeling method for multi-modal test data Pending CN115203408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210723506.XA CN115203408A (en) 2022-06-24 2022-06-24 Intelligent labeling method for multi-modal test data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210723506.XA CN115203408A (en) 2022-06-24 2022-06-24 Intelligent labeling method for multi-modal test data

Publications (1)

Publication Number Publication Date
CN115203408A true CN115203408A (en) 2022-10-18

Family

ID=83577331

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210723506.XA Pending CN115203408A (en) 2022-06-24 2022-06-24 Intelligent labeling method for multi-modal test data

Country Status (1)

Country Link
CN (1) CN115203408A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599966A (en) * 2022-12-15 2023-01-13 杭州欧若数网科技有限公司(Cn) Data locality measurement method and system for distributed graph data
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222101A (en) * 2011-06-22 2011-10-19 北方工业大学 Method for video semantic mining
CN110929729A (en) * 2020-02-18 2020-03-27 北京海天瑞声科技股份有限公司 Image annotation method, image annotation device and computer storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222101A (en) * 2011-06-22 2011-10-19 北方工业大学 Method for video semantic mining
CN110929729A (en) * 2020-02-18 2020-03-27 北京海天瑞声科技股份有限公司 Image annotation method, image annotation device and computer storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WILLYAN D. ABILHOA 等: "A keyword extraction method from twitter messages represented as graphs", APPLIED MATHEMATICS AND COMPUTATION》, vol. 240, pages 308 - 325, XP028848232, DOI: 10.1016/j.amc.2014.04.090 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115599966A (en) * 2022-12-15 2023-01-13 杭州欧若数网科技有限公司(Cn) Data locality measurement method and system for distributed graph data
CN116939320A (en) * 2023-06-12 2023-10-24 南京邮电大学 Method for generating multimode mutually-friendly enhanced video semantic communication

Similar Documents

Publication Publication Date Title
EP2805262B1 (en) Image index generation based on similarities of image features
Tarawneh et al. Invoice classification using deep features and machine learning techniques
CN108664996A (en) A kind of ancient writing recognition methods and system based on deep learning
WO2015096565A1 (en) Method and device for identifying target object in image
CN106055573B (en) Shoe print image retrieval method and system under multi-instance learning framework
CN104778242A (en) Hand-drawn sketch image retrieval method and system on basis of image dynamic partitioning
CN115203408A (en) Intelligent labeling method for multi-modal test data
CN106909895B (en) Gesture recognition method based on random projection multi-kernel learning
US11600088B2 (en) Utilizing machine learning and image filtering techniques to detect and analyze handwritten text
CN110737788B (en) Rapid three-dimensional model index establishing and retrieving method
CN110008365B (en) Image processing method, device and equipment and readable storage medium
CN114287005A (en) Negative sampling algorithm for enhancing image classification
CN113343920A (en) Method and device for classifying face recognition photos, electronic equipment and storage medium
Chakraborty et al. Application of daisy descriptor for language identification in the wild
Pengcheng et al. Fast Chinese calligraphic character recognition with large-scale data
Naiemi et al. Scene text detection using enhanced extremal region and convolutional neural network
Úbeda et al. Pattern spotting in historical documents using convolutional models
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
CN105844299B (en) A kind of image classification method based on bag of words
Devareddi et al. An edge clustered segmentation based model for precise image retrieval
Zanwar et al. A comprehensive survey on soft computing based optical character recognition techniques
Chen et al. Trademark image retrieval system based on SIFT algorithm
Devareddi et al. Interlinked feature query-based image retrieval model for content-based image retrieval
Sousa et al. Word indexing of ancient documents using fuzzy classification
Rao et al. Region division for large-scale image retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20221018