Nothing Special   »   [go: up one dir, main page]

CN118069840A - Domain knowledge guided empty pipe unsafe event classification method - Google Patents

Domain knowledge guided empty pipe unsafe event classification method Download PDF

Info

Publication number
CN118069840A
CN118069840A CN202410156217.5A CN202410156217A CN118069840A CN 118069840 A CN118069840 A CN 118069840A CN 202410156217 A CN202410156217 A CN 202410156217A CN 118069840 A CN118069840 A CN 118069840A
Authority
CN
China
Prior art keywords
text
domain knowledge
attention
deep
empty pipe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410156217.5A
Other languages
Chinese (zh)
Inventor
曾维理
郭子逸
朱聃
江灏
周亚东
谭湘花
刘继新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202410156217.5A priority Critical patent/CN118069840A/en
Publication of CN118069840A publication Critical patent/CN118069840A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a field knowledge guided empty pipe unsafe event classification method, which classifies unsafe event text data acquired in advance and preprocesses the text data; detecting and classifying the terms in an unsupervised mode based on a byte pair coding algorithm and a term word frequency method, and constructing an unsafe event domain knowledge base; constructing a text classification deep learning model to classify the empty-management safe text event under the guidance of domain knowledge; the text classification deep learning model comprises an embedding module, a wide module, a deep module and a classifier; and taking the historical unsafe event record of the empty pipe dangerous source database as a model input sample to carry out model training, and iteratively updating model parameters, thereby realizing accurate empty pipe unsafe event classification guided by field knowledge. The method can greatly reduce the acquisition difficulty of the domain knowledge, reduce the interaction loss of the domain knowledge and the input text, and is favorable for realizing the accurate classification of the empty pipe unsafe events guided by the domain knowledge.

Description

Domain knowledge guided empty pipe unsafe event classification method
Technical Field
The invention belongs to the technical field of digitization and intellectualization of civil aviation air traffic control, and particularly relates to an air traffic control unsafe event classification method guided by field knowledge.
Background
In recent years, with the rapid increase of civil aviation transportation demand, in order to improve the security capability, the collected civil aviation security information is fully utilized, the security condition and trend are evaluated, the information-driven security management is realized, and the information-driven security management is necessary for realizing the civil aviation security transportation. In order to achieve the information driven goal, it is important to automatically classify event data in security information. The automatic classification of the events is that statistical methods such as causal inference analysis and the like are applied to preconditions of civil aviation safety management, and the method has great promotion effect on timely finding potential safety hazards, controlling risks and preventing civil aviation accidents. The empty pipe is used as a central nerve of civil aviation, and intelligent analysis of unsafe events of the empty pipe is considered in advance.
The empty pipe unsafe event report is typically recorded in short text. The existing short text classification technology is mainly divided into three types: (1) rule-based method: such methods divide text into different categories by matching the text to be classified with a set of manually constructed linguistic rules. Rule-based text classification is useful in situations where available training data is limited, allowing for a more transparent and interpretable decision making process. However, such methods rely on a significant amount of manual work to create and maintain rules. (2) machine learning based methods: such methods typically break down the text classification problem into a two-step task: firstly, designing a proper feature extraction and dimension reduction method according to the characteristics of a text, and then training various machine learning models to complete regression from features to categories. For the problem of text classification, the greatest disadvantage of the method is that the characteristic engineering is complex, the method is seriously dependent on domain knowledge, and generalization is difficult. (3) a deep learning-based method: the process of manually extracting features is replaced by introducing an embedded model to map text into low-dimensional continuous feature vectors. For the embedded vector of the text, various deep learning models such as LSTM, CNN and the like and combinations thereof can be accessed as classifiers, so that the text classification task is completed.
The prior art currently has the following problems: (1) In areas of high specificity, such as the air traffic control area, performance is degraded due to the presence of special area knowledge. In these fields, the deep learning model requires more training data than the general model to fit the field causal relationships and learn the embedding of the field terms. (2) It is difficult to solve the problem of domain drift caused by the diversity of the training sample representation. For example, for the category "military and civil aviation conflict", the text expression of the sample will mostly exist in "military aircraft" or its close meaning, and a small part will not refer to "military", but will include descriptions of specific models such as "bombers". Both of the above expressions are actually predictive of class by the hidden variable "military aircraft". However, what is actually observed in the sample is some expressed mapping of hidden variables. The model does not learn the true hidden variables under the influence, thereby causing erroneous judgment.
Disclosure of Invention
The invention aims to: aiming at the problems of the method for extracting the critical characteristics of dangerous sources of unsafe events, the invention provides a method for classifying the unsafe events of the empty pipe guided by domain knowledge, which realizes accurate classification of the unsafe events of the empty pipe guided by domain knowledge.
The technical scheme is as follows: the invention relates to a field knowledge guided empty pipe unsafe event classification method, which comprises the following steps:
(1) Classifying pre-acquired unsafe event text data, and preprocessing the text data;
(2) Detecting and classifying the terms in an unsupervised mode based on a byte pair coding algorithm and a term word frequency method, and constructing an unsafe event domain knowledge base;
(3) Constructing a text classification deep learning model to classify the empty-management safe text event under the guidance of domain knowledge; the text classification deep learning model comprises an embedding module, a wide module, a deep module and a classifier;
(4) And taking the historical unsafe event record of the empty pipe dangerous source database as a model input sample to carry out model training, and iteratively updating model parameters, thereby realizing accurate empty pipe unsafe event classification guided by field knowledge.
Further, the classifying implementation process of the pre-acquired unsafe event text data in the step (1) is as follows:
The data unsafe event text data is divided into a hazard source number, a location unit, a specialty, a hazard source description, a trigger, a result, a likelihood, a severity, an initial risk level, a significant hazard source, an existing control measure, a buffer measure, a regulatory unit, an expected severity, an expected likelihood, an expected risk level, a remaining or derived risk, a safety performance goal, and a control status.
Further, the preprocessing implementation process of the text data in the step (1) is as follows:
Deleting invalid characters containing messy codes, redundant blank spaces and line feed symbols in the text data; the description related to the space entity is replaced by a unified mark; replacing specialized abbreviations in the dangerous source database with corresponding full names; and marking the unsafe events corresponding to each unsafe event result text to form an empty pipe unsafe event classification data set.
Further, the implementation process of the step (2) is as follows:
Traversing the civil aviation regulation text corpus based on a byte pair coding algorithm, searching a pair of adjacent Chinese characters with highest occurrence frequency, and replacing the adjacent Chinese characters with a new word mark; continuously searching adjacent Chinese characters with highest occurrence frequency, wherein the adjacent Chinese characters comprise Chinese character-Chinese character combinations and Chinese character-word mark combinations; repeating the above operations until a preset stopping condition is met, and obtaining a regular text word segmentation dictionary containing frequency; word segmentation is carried out on the regulation file corpus according to the dictionary;
in some category of text, a term appears more frequently than other terms and less frequently than other categories of text, then the term is a term associated with that category; the KF-IDF value was calculated according to this principle:
Wherein docs (term, cat) is the number of documents in the category containing the term, is the number of categories in which the term appears, α is a smoothing factor;
further screening redundant items in the results obtained by using KF-IDF values, wherein the screening is realized through the parts of speech of the term phrases; and eliminating other phrases which do not meet the part-of-speech requirement in the list.
Further, the embedding module in the step (3) converts the input text and the domain knowledge text into an embedded vector; the specific implementation process is as follows:
For text sequence input { w 1,w2,. }, first, it is tokenized by a token analyzer to obtain a string of token sequences of length N Next calculate word-level embedded sequence S w from BERT:
wherein, An embedded vector representing an nth word element, h being a hidden layer dimension of the BERT;
For each domain knowledge text, representing the domain knowledge text by adopting an average value of embedded vectors of each word source, and representing all domain knowledge texts as a complete sequence;
For the field knowledge text set L= { P 1,P2,…,PM } with the number of M, the same word element analyzer is adopted to obtain a word element sequence with the length of L (M) Then calculate the embedding of P j:
When M is not equal to N, filling or cutting the material to make the length of the material become M; converting L into a domain knowledge text embedded sequence S p:
Wherein e li represents the phrase-level embedded vector of the i-th domain knowledge text; the matrix forms of S w and S p are respectively:
further, the wide module in the step (3) realizes memorization and recall of knowledge in various fields of event labels; the specific implementation process is as follows:
Performing moving average on the embedded vectors of the input text word by word; for the input text M w in matrix form and the embedded sequence M p of the domain knowledge text, they are denoted as the query matrix Q W and the key matrix K W, respectively:
KW=Mp
Wherein r is half step length of the moving average, h is dimension of the embedded vector, and the value matrix is a single-heat matrix of the category to which the domain knowledge text belongs, namely:
Wherein K represents the dimension of the one-hot encoding of the tag, i.e., the number of categories; the nth row of matrix Q W represents the embedded vector of the nth lemma of the input text The m-th row of matrix K W represents phrase-level embedding/>, of the m-th domain knowledge textThe m line of V W is the single-hot code of the category to which the m field knowledge text belongs;
the calculation of the attention score is realized through an attention mechanism of regularization compatibility:
Β=(QWKW)T
wherein, beta j is the compatibility degree of elements between the token sequence and the domain knowledge text sequence, beta is the j-th column vector of beta;
pooling the attention scores of the regularized compatibility by adopting maximum pooling; for a W, the probability that the input text belongs to the kth class is:
further, the deep module of step (3) includes a deep text attention block and a deep domain attention block; the deep text attention block trains a model to pay attention to important words in sentences through the self-attention module; the depth domain attention block trains a model to pay attention to important domain knowledge through a common attention module; and combining the outputs of the multi-layer deep text attention block and the deep field attention block to obtain the output of the deep module, thereby realizing the deep extraction of text information.
Further, the deep text attention block processes the input text with a self-attention module; adding learnable parameters in an attention mechanismAnd/>Calculating self-attention of the input text:
AX=Attention(QX,KX,VX)
The residual connection and LayerNormalization (LN) are then performed for a X:
AX′=Mw+LayerNorm(AX)
A X′ is then non-linearised by the feed-forward layer, entering the next deep text attention block; calculate an average pooling for each word element of the input sequence at the end of all deep text attention blocks:
Where d X represents sentence-level embedding of the input text after the self-attention mechanism processing, i.e., the output of the deep text attention block.
Further, the depth domain attention block adds a learnable parameter in an attention mechanism, and the value is an embedded vector of domain knowledge text, and the input embedded sequence calculates the attention after matrix multiplication with the parameter:
ΑD=Attention(QD,KD,VD)
AD′=Mp+LayerNorm(ΑD)
average pooling was also calculated for a D′:
The vector d D is the output of the attention block in the depth field, and d D is also nonlinear through the feedforward layer and then enters the attention block in the next depth field.
Further, the classifier in the step (3) is implemented as follows:
the dimension of the output w of the wide module is the number K of labels; sentence-by-sentence depth text feature d of the deep module, wherein the dimension is the dimension h of the embedded vector; the classifier integrates the two vectors and outputs the prediction probability of the category to which the text belongs; first, d is reduced in dimension, and two vectors are integrated into one vector x:
d′=W1d+b1
x=concat(w,d′)
Wherein, W 1 and b 1 are parameters of a linear layer for reducing d; concat represents the connection of vectors; inputting x into a fully connected layer with tanh as an activation function for classification, and converting the result into a prediction probability through a softmax function:
Wherein W 2 and b 2 are parameters of the full connection layer;
minimizing a binary cross entropy loss function between the predicted value and the true value:
where y is the true class of the sample.
The beneficial effects are that: compared with the prior art, the invention has the beneficial effects that:
1. The invention characterizes the prior knowledge in the field in a text form, maps the prior knowledge and the input text into the same embedded space, realizes interaction with the input text and guides classification; this method records domain knowledge directly in terms (phrases) and therefore does not require structured data storage skills; the acquisition difficulty of the domain priori knowledge can be greatly reduced, and the interaction loss of the domain priori knowledge and the input text is reduced;
2. the invention introduces the wide module and the deep module to respectively finish the tasks of memorizing and applying priori knowledge or mining potential classification modes, thereby realizing the application of the priori knowledge of the field and solving the drift problem of the field;
3. The invention adopts an unsupervised mode to extract different forms of expressions of unsafe events related to the empty pipe from a large number of civil aviation regulation files to be used as the candidates of a domain knowledge dictionary; compared with an expert knowledge acquisition means based on experience, the method is beneficial to obtaining more comprehensive and accurate domain knowledge texts.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the invention provides a field knowledge guided empty pipe unsafe event classification method, which specifically comprises the following steps:
Step 1: classifying the pre-acquired unsafe event text data, and preprocessing the text data.
Firstly, reading in unsafe event text data, determining a classification basis, and then carrying out preprocessing of text denoising, space entity marking and abbreviation replacement on the text data.
The air-management hazard source database divides the data into 19 attribute stores of hazard source number, location unit, specialty, hazard source description, trigger, outcome, likelihood, severity, initial risk level, major hazard source, existing control measures, slow control measures, regulatory unit, expected severity, expected likelihood, expected risk level, remaining/derived risk, safety performance objective and control status. The invention classifies the empty pipe unsafe events according to the 'result' attribute data. The "outcome" is a description of the consequences that the reporting personnel may or have caused to the unsafe event, and is the best attribute for identifying the unsafe event class.
Preprocessing unsafe event text data:
Denoising text: the text data contains invalid characters such as messy codes, redundant blank spaces, line feed symbols and the like, which cause interference to the training process and delete the characters.
Spatial entity labeling: some descriptions relate to space entities, such as city names, airport names, waypoint names, and the like. These entities are replaced with uniform labels, thereby eliminating the bias due to space factors.
Abbreviation substitution: the hazard database contains a large number of highly specialized abbreviations that are replaced with corresponding holo-names.
And marking the data of the data set. Class 21 standard hazard events are defined according to the International civil aviation organization safety Manual (Doc 9859) and civil aircraft Accident Condition (MH/T2001-2015). The outcome description corresponds to multiple unsafe events, i.e., one data instance may correspond to multiple tags. And judging the type of each unsafe event result text according to the regulations in the International civil aviation organization safety management manual (Doc 9859) and the civil aircraft accident sign (MH/T2001-2015), and marking to form an empty pipe unsafe event classification data set.
Step 2: extracting terms containing category labels from massive regulation files, firstly adopting a byte pair coding algorithm to realize word segmentation, then adopting a method based on term word frequency, detecting the terms in an unsupervised mode and classifying the terms, and constructing an unsafe event field knowledge base. The method comprises the following specific steps:
Traversing the civil aviation regulation text corpus, searching a pair of adjacent Chinese characters with highest occurrence frequency, and replacing the adjacent Chinese characters with a new word mark; the adjacent Chinese characters with highest occurrence frequency (comprising the combination of Chinese characters and the combination of word marks) are continuously searched. And repeatedly iterating the above operation until the stopping condition is met, and obtaining the regulation text word segmentation dictionary containing the frequency. And segmenting the regulation file corpus according to the table.
A method based on keyword word frequency-reverse file frequency (Keyword Frequency-Inverse Document Frequency, KF-IDF) is adopted to extract knowledge terms in the unsafe event field. A term is a term that is related to a certain category if it appears more frequently than other terms in that category of text and less frequently than other terms in that category of text. The KF-IDF value was calculated according to this principle:
Wherein docs (term, cat) is the number of documents in the category containing the term, is the number of categories in which the term appears, and α is a smoothing factor. In the regulation file, each item is defined as a document, and unsafe events described by the items correspond to categories obtained by regular matching of the titles and contents of the items.
The redundancy in the results obtained using KF-IDF values is further screened. The filtering is accomplished by the parts of speech of these term phrases. Nouns (phrases) (e.g., "fly-away") or combinations of nouns (phrases) and verbs (e.g., "deviate from control instructions") have a high probability of forming valid terms, and only those terms are retained, while other non-part-of-speech requiring phrases in the list are culled. And finally, performing manual auditing and increasing and decreasing to ensure the quality of the domain knowledge base.
Step 3: constructing a text classification deep learning model to accurately classify the empty-management safe text event under the guidance of domain knowledge; the text classification deep learning model comprises an embedding module, a wide module, a deep module and a classifier.
The embedding module converts the input text and the domain knowledge text into embedded vectors, thereby enabling the deep learning model to understand the natural language text. For text sequence input { w 1,w2,. }, first, it is tokenized by a token analyzer to obtain a string of token sequences of length NNext, the word-level embedding sequence S w is calculated from BERT, namely:
wherein, And (3) representing an embedded vector of the nth word element, wherein h is the hidden layer dimension of the BERT.
For each domain knowledge text, the average value of the embedded vector of each word source is used for representing the domain knowledge text, so that all domain knowledge texts are represented as a complete sequence, and not a single sequence for each domain knowledge text. This approach can reduce the computational effort to the original few percent.
For a set of domain knowledge texts of number M, l= { P 1,P2,…,PM }, a representation of each of the domain knowledge texts is calculated using P m as an example. Obtaining a sequence of tokens of length l (m) using the same token analyzerThen calculate the embedding of P j:
when M is not equal to N, it needs to be filled or truncated so that its length becomes M. Converting L into a domain knowledge text embedding sequence S p, namely:
Where e li represents the phrase-level embedded vector of the i-th domain knowledge text. The matrix forms of S w and S p are respectively:
The wide module focuses on memorization and recall of various domain knowledge of event tags. First, a rolling average is performed on the embedded vectors of the input text per word. Specifically, the embedded sequences M p of the input text M w and the domain knowledge text in the form of a matrix are respectively represented as a query matrix Q W and a key matrix K W, and there are:
KW=Mp
where r is the half step of the moving average and h is the dimension of the embedded vector. The value matrix is a single-heat matrix of the category to which the domain knowledge text belongs, namely:
Where K represents the dimension of the one-hot encoding of the tag, i.e., the number of categories. The visual interpretation of the three matrices is that the nth row of matrix Q W represents the embedded vector of the nth word element of the input text The m-th row of matrix K W represents phrase-level embedding/>, of the m-th domain knowledge textAnd the m line of V W is the one-hot code of the category to which the m-th domain knowledge text belongs.
Next, the calculation of the attention score is achieved by regularization of the attention mechanism of the compatibility:
Β=(QWKW)T
wherein, beta j is the compatibility degree between token sequence and domain knowledge text sequence element by element, beta is the j-th column vector of beta. Since V W uses single-hot coding of a class, the calculation result of this step actually represents the probability that each token input belongs to a certain class.
The attention scores of regularized compatibility are pooled using maximum pooling. For a W, the probability that the input text belongs to the kth class is:
the deep module exploits the generalization capability of the deep network to mine the depth features of the input text. Deep modules are divided into two types of blocks: the deep text attention block focuses on important words in sentences through a self-attention module training model, and the deep field attention block focuses on important field knowledge through a common attention module training model. The depth extraction of the text information is achieved by superimposing a plurality of blocks. The results of the two parts are combined at the end of the wide module.
Depth text attention block: the input text is processed using a self-attention module. Unlike the class attention mechanism in the broad module, the attention module in the deep text attention block incorporates a learnable parameterAnd/>Calculating self-attention of the input text:
AX=Attention(QX,KX,VX)
The residual connection and LayerNormalization (LN) are then performed for a X:
AX′=Mw+LayerNorm(AX)
A X′ is then non-linearised by the feed-forward layer and then goes to the next deep text attention block.
Since a X′ is word-by-word, it is necessary to calculate an average pooling for each word of the input sequence at the end of all deep text attention blocks:
Where d X represents sentence-level embedding of the input text after the self-attention mechanism processing, and is also the output of the deep text attention block.
Depth field attention block: deep domain attention block essence is to let a model learn the potential relevance of the input text to domain knowledge and attach the corresponding domain knowledge text insert to the sentence insert of the input text.
Depth field attention blocks are more similar to wide modules. The difference between the two is that the attention mechanism of the deep domain attention block has a learnable parameter, and the value is an embedded vector of domain knowledge text, not a single-hot matrix in the wide module. The input embedded sequence calculates the attention after matrix multiplication with the parameters, namely:
ΑD=Attention(QD,KD,VD)
AD′=Mp+LayerNorm(ΑD)
average pooling was also calculated for a D′:
Vector d D is the output of the depth field attention block, and d D is also non-linearized by the feed-forward layer before entering the next depth field attention block.
The output d of the deep module is obtained by combining the outputs of the multi-layer deep text attention block and the deep field attention block:
d=dX+dD
two vectors are obtained in the three modules: the dimension of the output w of the wide module is the number K of labels; sentence-by-sentence depth text feature d of the deep module, and dimension is dimension h of the embedded vector. The classifier integrates the two vectors and outputs a predictive probability of the category to which the text belongs. First, d is reduced in dimension, and two vectors are integrated into one vector x:
d′=W1d+b1
x=concat(w,d′)
Wherein, W 1 and b 1 are parameters of a linear layer for reducing d; concat represents the connection of vectors. Inputting x into a fully connected layer with tanh as an activation function for classification, and converting the result into a prediction probability through a softmax function:
Wherein W 2 and b 2 are parameters of the full connection layer.
For the multi-label classification problem referred to in this patent, it is desirable to minimize the binary cross entropy (Binary Cross Entropy, BCE) loss function between the predicted and real values:
where y is the true class of the sample.
Step 4: and taking the historical unsafe event record of the empty pipe dangerous source database as a model input sample to carry out model training, and iteratively updating model parameters, thereby realizing accurate empty pipe unsafe event classification guided by field knowledge.
And setting training parameters, training modes and other attributes of the text classification deep learning model, and training on a training set. AdamW of transformers libraries was used as optimizer and different learning rates were employed for BERT and downstream modules: the initial learning rate of BERT is 2e-5, and the initial learning rate of other modules is 1e-4. The batch size is 16, the training round number is 5, and the maximum length of the word sequence is 256. The dataset was assembled as per 8:2 is randomly divided into a training set and a test set, and 13 th generation is arrangedTraining on a server of the Kui TM i9-13900K 3.00GHz CPU, NVIDIA GeForce RTX 4090GPU and 128GB RAM takes about 7.5 hours.
In the present embodiment, 8000 records of risk sources in the national aviation agency of 2012 to 2022 are taken as an example, and the following evaluation indexes are adopted:
multi-label classification accuracy: calculating the frequency of complete matching of the prediction and the real label, setting n as the number of samples, and setting the accuracy as follows:
Wherein the method comprises the steps of Is the predicted class of the i-th sample, y i is the true class of the i-th sample, 1 (x) is the indicator function: if x is true, the value of the function will be 1, otherwise 0.
Top-k accuracy: the Top-k accuracy calculates the number of instances containing the correct tag in the Top k predictive tags with the highest scores. The Top-k accuracy is:
wherein, Is the class with the j-th highest prediction score in the i-th sample. Common k values are 3 (noted P@3) and 5 (noted P@5).
TABLE 1 comparative experiment results of the present invention and mainstream method
Table 1 shows the experimental results of the method of the present invention compared with the mainstream text classification method. As can be seen from the table, the prediction model of the invention has higher precision and better prediction effect than the main stream text classification model.

Claims (10)

1. The field knowledge guided empty pipe unsafe event classification method is characterized by comprising the following steps of:
(1) Classifying pre-acquired unsafe event text data, and preprocessing the text data;
(2) Detecting and classifying the terms in an unsupervised mode based on a byte pair coding algorithm and a term word frequency method, and constructing an unsafe event domain knowledge base;
(3) Constructing a text classification deep learning model to classify the empty-management safe text event under the guidance of domain knowledge; the text classification deep learning model comprises an embedding module, a wide module, a deep module and a classifier;
(4) And taking the historical unsafe event record of the empty pipe dangerous source database as a model input sample to carry out model training, and iteratively updating model parameters, thereby realizing accurate empty pipe unsafe event classification guided by field knowledge.
2. The method for classifying an unsafe event for an empty pipe guided by domain knowledge according to claim 1, wherein the classifying the unsafe event text data acquired in advance in step (1) is implemented as follows:
The data unsafe event text data is divided into a hazard source number, a location unit, a specialty, a hazard source description, a trigger, a result, a likelihood, a severity, an initial risk level, a significant hazard source, an existing control measure, a buffer measure, a regulatory unit, an expected severity, an expected likelihood, an expected risk level, a remaining or derived risk, a safety performance goal, and a control status.
3. The method for classifying an empty pipe unsafe event guided by domain knowledge according to claim 1, wherein the preprocessing of the text data in step (1) is implemented as follows:
Deleting invalid characters containing messy codes, redundant blank spaces and line feed symbols in the text data; the description related to the space entity is replaced by a unified mark; replacing specialized abbreviations in the dangerous source database with corresponding full names; and marking the unsafe events corresponding to each unsafe event result text to form an empty pipe unsafe event classification data set.
4. The method for classifying an empty pipe unsafe event guided by domain knowledge according to claim 1, wherein said step (2) is implemented as follows:
Traversing the civil aviation regulation text corpus based on a byte pair coding algorithm, searching a pair of adjacent Chinese characters with highest occurrence frequency, and replacing the adjacent Chinese characters with a new word mark; continuously searching adjacent Chinese characters with highest occurrence frequency, wherein the adjacent Chinese characters comprise Chinese character-Chinese character combinations and Chinese character-word mark combinations; repeating the above operations until a preset stopping condition is met, and obtaining a regular text word segmentation dictionary containing frequency; word segmentation is carried out on the regulation file corpus according to the dictionary;
in some category of text, a term appears more frequently than other terms and less frequently than other categories of text, then the term is a term associated with that category; the KF-IDF value was calculated according to this principle:
Wherein docs (term, cat) is the number of documents in the category containing the term, is the number of categories in which the term appears, α is a smoothing factor;
further screening redundant items in the results obtained by using KF-IDF values, wherein the screening is realized through the parts of speech of the term phrases; and eliminating other phrases which do not meet the part-of-speech requirement in the list.
5. The domain knowledge guided empty pipe unsafe event classification method according to claim 1, wherein step (3) said embedding module converts the input text and domain knowledge text into an embedded vector; the specific implementation process is as follows:
For text sequence input { w 1,w2,. }, first, it is tokenized by a token analyzer to obtain a string of token sequences of length N Next calculate word-level embedded sequence S w from BERT:
wherein, An embedded vector representing an nth word element, h being a hidden layer dimension of the BERT;
For each domain knowledge text, representing the domain knowledge text by adopting an average value of embedded vectors of each word source, and representing all domain knowledge texts as a complete sequence;
For the field knowledge text set L= { P 1,P2,…,PM } with the number of M, the same word element analyzer is adopted to obtain a word element sequence with the length of L (M) Then calculate the embedding of P j:
When M is not equal to N, filling or cutting the material to make the length of the material become M; converting L into a domain knowledge text embedded sequence S p:
Wherein e li represents the phrase-level embedded vector of the i-th domain knowledge text; the matrix forms of S w and S p are respectively:
6. the domain knowledge guided empty pipe unsafe event classification method according to claim 1, wherein said wide module of step (3) implements memorization and recall of various domain knowledge of event tags; the specific implementation process is as follows:
Performing moving average on the embedded vectors of the input text word by word; for the input text M w in matrix form and the embedded sequence M p of the domain knowledge text, they are denoted as the query matrix Q W and the key matrix K W, respectively:
KW=Mp
Wherein r is half step length of the moving average, h is dimension of the embedded vector, and the value matrix is a single-heat matrix of the category to which the domain knowledge text belongs, namely:
Wherein K represents the dimension of the one-hot encoding of the tag, i.e., the number of categories; the nth row of matrix Q W represents the embedded vector of the nth lemma of the input text The m-th row of matrix K W represents phrase-level embedding/>, of the m-th domain knowledge textThe m line of V W is the single-hot code of the category to which the m field knowledge text belongs;
the calculation of the attention score is realized through an attention mechanism of regularization compatibility:
Β=(QWKW)T
wherein, beta j is the compatibility degree of elements between the token sequence and the domain knowledge text sequence, beta is the j-th column vector of beta;
pooling the attention scores of the regularized compatibility by adopting maximum pooling; for a W, the probability that the input text belongs to the kth class is:
7. A domain knowledge guided empty pipe unsafe event classification method according to claim 1, wherein said deep module of step (3) comprises a deep text attention block and a deep domain attention block; the deep text attention block trains a model to pay attention to important words in sentences through the self-attention module; the depth domain attention block trains a model to pay attention to important domain knowledge through a common attention module; and combining the outputs of the multi-layer deep text attention block and the deep field attention block to obtain the output of the deep module, thereby realizing the deep extraction of text information.
8. A domain knowledge guided empty pipe unsafe event classification method according to claim 7, wherein said deep text attention block processes input text using a self-attention module; adding learnable parameters in an attention mechanismAnd/>Calculating self-attention of the input text:
AX=Attention(QX,KX,VX)
Residual connection and Layer Normalization (LN) were then performed for a X:
AX′=Mw+LayerNorm(AX)
A X′ is then non-linearised by the feed-forward layer, entering the next deep text attention block; calculate an average pooling for each word element of the input sequence at the end of all deep text attention blocks:
Where d X represents sentence-level embedding of the input text after the self-attention mechanism processing, i.e., the output of the deep text attention block.
9. The method for classifying an empty pipe unsafe event guided by domain knowledge according to claim 7, wherein said deep domain attention block adds a learnable parameter to an attention mechanism, and value is an embedded vector of domain knowledge text, and the input embedded sequence calculates attention after matrix multiplication with the parameter:
ΑD=Attention(QD,KD,VD)
AD′=Mp+LayerNorm(ΑD)
average pooling was also calculated for a D′:
The vector d D is the output of the attention block in the depth field, and d D is also nonlinear through the feedforward layer and then enters the attention block in the next depth field.
10. The domain knowledge guided empty pipe unsafe event classification method according to claim 1, wherein the classifier implementation process of step (3) is as follows:
the dimension of the output w of the wide module is the number K of labels; sentence-by-sentence depth text feature d of the deep module, wherein the dimension is the dimension h of the embedded vector; the classifier integrates the two vectors and outputs the prediction probability of the category to which the text belongs; first, d is reduced in dimension, and two vectors are integrated into one vector x:
d′=W1d+b1
x=concat(w,d′)
Wherein, W 1 and b 1 are parameters of a linear layer for reducing d; concat represents the connection of vectors; inputting x into a fully connected layer with tanh as an activation function for classification, and converting the result into a prediction probability through a softmax function:
Wherein W 2 and b 2 are parameters of the full connection layer;
minimizing a binary cross entropy loss function between the predicted value and the true value:
where y is the true class of the sample.
CN202410156217.5A 2024-02-04 2024-02-04 Domain knowledge guided empty pipe unsafe event classification method Pending CN118069840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410156217.5A CN118069840A (en) 2024-02-04 2024-02-04 Domain knowledge guided empty pipe unsafe event classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410156217.5A CN118069840A (en) 2024-02-04 2024-02-04 Domain knowledge guided empty pipe unsafe event classification method

Publications (1)

Publication Number Publication Date
CN118069840A true CN118069840A (en) 2024-05-24

Family

ID=91103385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410156217.5A Pending CN118069840A (en) 2024-02-04 2024-02-04 Domain knowledge guided empty pipe unsafe event classification method

Country Status (1)

Country Link
CN (1) CN118069840A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897167A (en) * 2022-05-13 2022-08-12 内蒙古大学 Method and device for constructing knowledge graph in biological field
CN115204140A (en) * 2022-06-22 2022-10-18 西安交通大学 Legal provision prediction method based on attention mechanism and knowledge graph

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114897167A (en) * 2022-05-13 2022-08-12 内蒙古大学 Method and device for constructing knowledge graph in biological field
CN115204140A (en) * 2022-06-22 2022-10-18 西安交通大学 Legal provision prediction method based on attention mechanism and knowledge graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘桃: "领域术语自动抽取及其在文本分类中的应用", 《电子学报》, 28 February 2007 (2007-02-28) *
曾维理: "Hierarchical Method for Mining a Prevailing Flight Pattern in Airport Terminal Airspace", 《JOURNAL OF AEROSPACE INFORMATION SYSTEMS》, 4 July 2023 (2023-07-04) *
蔡志鹏: "空中交通管制中的危险源关键特征提取", 《航空计算技术》, 31 December 2023 (2023-12-31) *

Similar Documents

Publication Publication Date Title
Feng et al. A small samples training framework for deep Learning-based automatic information extraction: Case study of construction accident news reports analysis
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
US10755045B2 (en) Automatic human-emulative document analysis enhancements
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN114091460B (en) Multitasking Chinese entity naming identification method
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN112487293B (en) Method, device and medium for extracting structured information of security accident case
Li et al. A method for resume information extraction using bert-bilstm-crf
CN112069307B (en) Legal provision quotation information extraction system
CN112527961A (en) Automatic extraction method for emergency response level of emergency plan and responsibility of administrative unit
CN111709225B (en) Event causal relationship discriminating method, device and computer readable storage medium
CN115238040A (en) Steel material science knowledge graph construction method and system
Ribeiro et al. Discovering IMRaD structure with different classifiers
CN112215002A (en) Electric power system text data classification method based on improved naive Bayes
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
Troxler et al. Actuarial applications of natural language processing using transformers: Case studies for using text features in an actuarial context
KR102563539B1 (en) System for collecting and managing data of denial list and method thereof
Ahmad et al. Machine and deep learning methods with manual and automatic labelling for news classification in bangla language
CN117852541A (en) Entity relation triplet extraction method, system and computer equipment
CN111522945A (en) Poetry style analysis method based on chi-square test
CN118069840A (en) Domain knowledge guided empty pipe unsafe event classification method
CN115544213A (en) Method, device and storage medium for acquiring information in text
CN115238093A (en) Model training method and device, electronic equipment and storage medium
Xu et al. Entity recognition in the field of coal mine construction safety based on a pre-training language model
CN109635046B (en) Protein molecule name analysis and identification method based on CRFs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination