Nothing Special   »   [go: up one dir, main page]

CN114757178A - Core product word extraction method, device, equipment and medium - Google Patents

Core product word extraction method, device, equipment and medium Download PDF

Info

Publication number
CN114757178A
CN114757178A CN202210243423.0A CN202210243423A CN114757178A CN 114757178 A CN114757178 A CN 114757178A CN 202210243423 A CN202210243423 A CN 202210243423A CN 114757178 A CN114757178 A CN 114757178A
Authority
CN
China
Prior art keywords
word
model
product
product word
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210243423.0A
Other languages
Chinese (zh)
Inventor
曾思亮
蔡子哲
包智
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qizhidao Network Technology Co Ltd
Original Assignee
Qizhidao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qizhidao Network Technology Co Ltd filed Critical Qizhidao Network Technology Co Ltd
Priority to CN202210243423.0A priority Critical patent/CN114757178A/en
Publication of CN114757178A publication Critical patent/CN114757178A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0282Rating or review of business operators or products
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/0601Electronic shopping [e-shopping]
    • G06Q30/0623Item investigation
    • G06Q30/0625Directed, with specific intent or strategy

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a device, equipment and a medium for extracting core product words, wherein the method for extracting the core product words comprises the following steps: acquiring an enterprise information text, wherein the enterprise information text comprises at least one effective statement; processing the enterprise text information by adopting a preprocessing method to obtain a model input statement conforming to a preset format; extracting a model based on the trained product word entities, acquiring predicted product word entities corresponding to model input sentences, and cleaning the predicted product word entities to acquire effective product word entities; training at least two word classification models based on the effective product word entities and the product word feature dimension data corresponding to the effective product word entities to obtain a fused word score model, and extracting core product words meeting a word score threshold value through the score model. The method can guarantee that the real and effective core product words are extracted, and meanwhile, the recognition accuracy and recognition efficiency of the core product words are improved.

Description

Core product word extraction method, device, equipment and medium
Technical Field
The invention relates to the technical field of language information processing, in particular to a method, a device, equipment and a medium for extracting core product words.
Background
The enterprise portrait can be used for constructing a multi-level enterprise knowledge map for scenes such as smart cities, financial supervision, enterprise information, enterprise assessment and the like, and can be used for deeply mining complex network relationships among enterprises, high management, legal people, products and industrial chains, so that a plurality of comprehensive services such as enterprise public opinion, accurate marketing and the like are provided for the enterprises. Most enterprises use products as media to exchange value with users, so that the purpose of creating commercial value is achieved, and main products of the enterprises are important reference bases for constructing enterprise portraits.
With the vigorous development of the internet, various types of authenticity information related to enterprises are increasingly appearing in multiple channels of the internet. How to extract real and effective core product information from various information related to enterprises becomes a problem to be solved urgently.
Disclosure of Invention
The embodiment of the invention provides a method, a device, equipment and a medium for extracting core product words, which are used for solving the problem of extracting real and effective core product information from various information related to enterprises.
A core product word extraction method comprises the following steps:
acquiring an enterprise information text, and processing the enterprise information text by adopting a preprocessing method to acquire a model input statement conforming to a preset format;
Inputting a sentence through a labeling model to train a product word entity extraction model based on deep learning;
extracting a model based on the product word entities, acquiring predicted product word entities corresponding to the model input sentences, and cleaning the predicted product word entities to acquire effective product word entities;
training at least two word classification models based on the effective product word entities and the product word feature dimension data corresponding to the effective product word entities to obtain a fused word score model, and extracting core product words meeting a word score threshold value through the score model.
A core product word extraction device, comprising:
the enterprise information text acquisition module is used for acquiring an enterprise information text and processing the enterprise text information by adopting a preprocessing method so as to acquire a model input statement conforming to a preset format;
the entity extraction model acquisition module is used for inputting a sentence through a labeling model and training a product word entity extraction model based on deep learning;
the system comprises an effective product word entity obtaining module, a product word entity extracting module, a product word entity cleaning module and a product word entity selecting module, wherein the effective product word entity obtaining module is used for extracting a model based on a product word entity, obtaining a predicted product word entity corresponding to a model input statement, and cleaning the predicted product word entity to obtain an effective product word entity;
and the core product word obtaining module is used for training at least two word classification models based on the effective product word entities and the product word characteristic dimension data corresponding to the effective product word entities so as to obtain a fused word scoring model, and extracting the core product words meeting a word score threshold value through the scoring model.
An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the above-described core product word extraction method when executing the computer program.
A computer-readable medium, in which a computer program is stored, which computer program, when being executed by a processor, carries out the above-mentioned core product word extraction method.
According to the method, the device, the equipment and the medium for extracting the core product words, the predicted product word entities are obtained by extracting the enterprise information texts through the deep learning model, the extracted predicted product word entities are continuously cleaned and graded, the core product words related to enterprises can be finally obtained, the extraction of real and effective core product words is guaranteed, meanwhile, the recognition accuracy rate and the recognition efficiency of the core product words are improved, and the method, the device, the equipment and the medium are beneficial to building reliable enterprise portrait through accurate core product words.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of an application environment of a core product word extraction method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for extracting core product words according to an embodiment of the present invention;
FIG. 3 is a first flowchart of a method for extracting core product words according to an embodiment of the present invention;
FIG. 4 is a second flowchart of a method for extracting core product words according to an embodiment of the present invention;
FIG. 5 is a third flowchart of a method for extracting core product words according to an embodiment of the present invention;
FIG. 6 is a fourth flowchart of a method for extracting core product words according to an embodiment of the present invention;
FIG. 7 is a diagram of a core product word extraction apparatus according to an embodiment of the present invention;
fig. 8 is a schematic diagram of an apparatus in an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The method for extracting core product words provided by the embodiment of the invention can be applied to the application environment shown in fig. 1, and is applied to a core product word extraction system, wherein the core product word extraction system comprises a client and a server, and the client communicates with the server through a network. The client is also called a client, and refers to a program corresponding to the server and providing local services for the client. The client can be installed on various devices such as but not limited to personal computers, notebook computers, smart phones, tablet computers and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.
In an embodiment, as shown in fig. 2, a method for extracting core product words is provided, which is described by taking the server in fig. 1 as an example, and specifically includes the following steps:
and S10, acquiring an enterprise information text, and processing the enterprise text information by adopting a preprocessing method to acquire a model input statement conforming to a preset format.
The enterprise information refers to information related to registration and operation of the enterprise, and generally includes: the system comprises a company name, a company location, a legal representative, registered capital, a management range, a management qualification, the number of staff, a company website, a contact mode, a company content introduction disclosed in an open channel and the like. The business information text is text information formed based on the above information.
In this embodiment, the enterprise information text is used for inputting and training the deep learning model, and the statements conforming to the preset format facilitate rapid and effective completion of model training.
The limitation of the preset format comprises setting the formats of Chinese word segmentation, word shape reduction, word stem extraction, part of speech tagging, stop word removal and vector space representation in the enterprise information text. Further, a method adopted to implement that the sentence conforms to the preset format is also a preprocessing method, for example, a text classification model of BERT (Bidirectional Encoder retrieval from converters) for filtering invalid texts, NLPIR chinese word segmentation software, Jieba word segmentation, and the like, which are not specifically limited herein.
The model input statement is a statement which is obtained by processing the enterprise text information by adopting a preprocessing method and accords with a preset format.
In particular, with the rapid development of the information age, users can acquire enterprise information, which is related to a certain enterprise and is accompanied by error and interference information, through various online channels, thereby making it difficult to extract real core products related to the enterprise from the information.
This embodiment focuses on obtaining the actual core product words related to the enterprise, so that the unordered text that does not include the core product words, has more sentence pattern errors, or does not include any punctuation marks can be filtered as the interfering sentence patterns by the preprocessing method.
The sentence pattern corresponding to the interference sentence pattern also includes the main operation sentence pattern and the product word sentence pattern, etc., which can be used as effective sentence patterns to further optimize to form high quality model input sentence.
And S20, inputting sentences through the marking model to train a product word entity extraction model based on deep learning.
In this embodiment, the deep learning-based product word entity extraction model is a model used for extracting product word entities from an enterprise information text, and may be obtained by training a plurality of deep learning models of a main stream, where the deep learning models of the main stream include CNN (Convolutional Neural Network), RNN (recurrent Neural Network), LSTM (Long-Short Term Neural Network), RNTN (recurrent Neural Network, recurrent Tensor Network model), BERT, GAN (generic adaptive Network, generated countermeasure Network), and the like, and are not specifically limited herein.
Specifically, the labeling of the model input sentence is a process of substantially dividing the model input sentence into smaller units based on part-of-speech labeling, a word segmentation algorithm, an entity recognition algorithm, and the like, wherein the smaller units are single words or terms, and the entity recognition is to recognize entities with specific meanings in a text, and mainly includes names of people, places, names of institutions, proper nouns, product words, and the like.
S30, based on the product word entity extraction model, obtaining a predicted product word entity corresponding to the model input sentence, and cleaning the predicted product word entity to obtain an effective product word entity.
The predicted product word entity is a product word set obtained by performing information extraction on the enterprise information text through the trained product word entity extraction model.
The effective product word entity is a set of effective product words obtained by filtering the prediction product word entities which do not accord with word sense integrity, word sense correctness and other error information in the prediction product word entity.
Specifically, determining semantic integrity may be accomplished using mainstream semantic disambiguation algorithms such as dictionary-based word sense disambiguation, rule-based word sense disambiguation, supervised word sense disambiguation, and unsupervised and semi-supervised word sense disambiguation.
In the embodiment, the effective product word entities are obtained by cleaning and predicting the product word entities, so that the interference of wrong product words on core product words can be effectively reduced, and the accuracy of finally obtaining the core product words is improved; the time occupied by processing the invalid product word entity is reduced, so that the optimization efficiency of performing one-step word optimization on the valid product word entity in the follow-up process is improved, and the overall performance is improved.
And S40, training at least two word classification models based on the effective product word entities and the product word feature dimension data corresponding to the effective product word entities to obtain a fused word score model, and extracting core product words meeting a word score threshold value through the score model.
The product word feature dimension data is obtained by performing multi-dimensional feature mining on effective product word entities, such as word component features, word criticality features, word source credibility features and the like, and performing feature construction according to various features obtained by mining to form multi-dimensional feature information about the effective product word entities.
The term classification model includes LR (logical Regression) classifier, SVM (support Vector Machines), and classification algorithms such as random forest, which are not specifically limited herein.
The term score threshold is the lowest score that can be divided into core product terms, calculated by the fused term scoring model.
The S40 provided in this embodiment can score the effective product word entities through the fused word score model and finally obtain the core product words meeting the word score threshold, thereby further improving the accuracy of obtaining the core product words.
According to the method for extracting the core product words, the deep learning model is adopted to extract the enterprise information texts to obtain the predicted product word entities, the extracted predicted product word entities are continuously cleaned and scored, the core product words related to enterprises can be finally obtained, the core product words are extracted truly and effectively, the identification accuracy and identification efficiency of the core product words are improved, and the method is beneficial to establishing reliable enterprise images through the accurate core product words.
In a specific embodiment, as shown in fig. 3, before step S10, that is, before acquiring the enterprise information text, the method further specifically includes the following steps:
and S101, segmenting the enterprise information text by taking the sentence pattern as granularity to obtain at least one sentence to be processed.
The sentence to be processed is a plurality of single sentences obtained by cutting the segmented enterprise information texts by a sentence division algorithm with the sentence as a unit.
And S102, acquiring invalid sentence patterns in all sentences to be processed based on the sentence pattern analysis model.
Each single sentence of the sentence to be processed after being segmented corresponds to a sentence type, for example: a main sentence pattern, a product word sentence pattern or an interference sentence pattern, etc. In this embodiment, a BERT-based text classification model may be used as a sentence pattern analysis model to obtain a sentence pattern category corresponding to each single sentence, so as to improve training speed and precision of a subsequent NER (NAMED ENTITY RECOGNITION) model.
Further, the NER model mainly uses large-scale labeled corpora to train the deep learning model, and sequence decoding and the like are performed on the test corpora through the trained deep learning model to obtain a named entity, that is, the predicted product word entity in this embodiment.
Invalid sentence pattern is an unordered text with more error sentence patterns or missing punctuation marks in the sentence to be processed.
Specifically, the commonly used text classification models are: the Tf-Idf-based Word bag model, the Word embedding model based on Word2Vec, the CNN-based deep learning model, the BERT-based pre-training language model and the like are excellent in various downstream tasks since the appearance of BERT, and the optimized model prediction effect can be achieved by simply fine-tuning (fine-tuning) the model.
And S103, filtering invalid sentences in the enterprise information text so as to enable each remaining sentence to be processed to be an effective sentence.
The effective sentences are the sentences with interference sentence patterns removed from the enterprise information text.
In steps S101 to S103, the enterprise information text is finally processed into a text including only valid sentences, so that the subsequent processing efficiency of valid sentences is greatly improved.
In one embodiment, the business information text includes at least one valid statement that filters out invalid statements. In step S10, the enterprise text information is processed by a preprocessing method to obtain a model input sentence conforming to a preset format, which specifically includes the following steps:
and (3) segmenting the effective sentences by adopting a length grouping segmentation algorithm and a semantic integrity algorithm to obtain model input sentences which accord with the preset length and keep the semantic integrity.
The length grouping and cutting algorithm is an algorithm adopted to meet the requirement of training text length in the process of constructing a training data set, and comprises the following steps: short message segmentation algorithm, text automatic segmentation algorithm, etc., and are not limited herein.
The semantic integrity algorithm is an algorithm for keeping the complete semantics of the model input sentences in the training data set construction process, and whether the model input sentences are sentences with complete semantics or not is judged through semantic analysis, lexical analysis, syntactic analysis, contextual analysis and the like.
The mode input statement is a statement after the effective statement is processed by adopting a length grouping cutting algorithm and a semantic integrity algorithm, so that the mode input statement conforms to the preset length and keeps the semantic integrity, and the efficiency of inputting the product word entity extraction model based on deep learning to process is improved.
Specifically, in the embodiment, a plurality of specified separators are set to perform text segmentation to form a minimum text segmentation unit, and on the premise of keeping the sequence of the preceding sentence and the following sentence, the model input sentences are recombined to ensure that the length of the recombined sentences does not exceed the maximum length range set by the model, so that part of prediction information is prevented from being lost; meanwhile, the semantic integrity of the model input statement is kept through a syntactic analysis algorithm. Particularly, when the length limiting algorithm conflicts with the semantic integrity algorithm, the text compression algorithm (removing part of contents on the premise of not destroying the original semantics) is used for performing text compression on the minimum text cutting unit.
In an embodiment, in step S20, training the deep learning-based product word entity extraction model by using the annotation model input sentence includes the following steps:
training a BERT-BilSTM-CRF model by adopting a model input statement and a corresponding label model input statement to obtain a product word entity extraction model, wherein the BERT-BilSTM-CRF model comprises the following steps: a BERT pre-training model layer, a BilSTM (Bi-directional Long Short-Term Memory) network layer and a CRF (Conditional random field) probability distribution layer.
Specifically, there are many mainstream methods for entity extraction, wherein the combination of BERT + bllstm + CRF can improve the model training result, and BERT-bidirectional LSTM + CRF is used as a tokenizer, which is essentially implemented by using sequence labeling:
BERT pre-training model layer: the efficiency of natural language processing tasks is greatly improved, and the problem of one-word polysemy during text feature representation is solved. In the pre-training method, a BERT pre-training model layer respectively captures expressions of word and Sentence levels by adopting a Masked LM method and a Next sequence Prediction method.
BilStm network layer: in the sequence labeling task, effective historical information and future information can be mined, and the model uses a bidirectional LSTM network structure. The BilSTM model can fully utilize historical information through forward state at any time, and utilize future information through backhaul state. The update of the learning parameters of the BilSTM model uses a back-propagation through time (BPTT) algorithm, and the model is different from a general model in the forward and back stages in that a hidden layer is subjected to calculation for all time steps.
Because BilSTM is used for independently classifying score values of all positions in a sequence, the probability that each word belongs to different labels is predicted, and then Softmax is used for obtaining the label with the highest probability as the predicted value of the position, but information between adjacent labels cannot be considered.
CRF layer (output layer): the CRF better solves the problems of the BilSTM, and the last layer of the model uses a conditional random field model as sentence-level sequence labels, so that the model can consider the correlation between class labels.
The CRF layer may add some constraints to the final predicted labels to ensure that they are valid. These constraints may be automatically learned by the CRF layer from the training data set during the training process. The CRF probability distribution layer may consider information between sentence-level neighboring tags. The correlation among the labels is a transition matrix in the CRF, which represents the probability of transition from one state to another state, and a global optimal sequence is obtained so as to effectively extract the words of which the maximum probability in the enterprise information texts from different source channels is the entity of the predicted product words.
In an embodiment, as shown in fig. 4, in step S30, the step of cleaning the predicted product word entity to obtain the valid product word entity includes the following steps:
and S31, taking the predicted product word entity with semantic deletion as a target repair word, and comparing the previous and later repair target repair words to obtain an effective product word entity.
And the effective product word entity is an effective product word entity which is obtained by cleaning the predicted product word entity and basically has no semantic deletion.
Specifically, in this embodiment, after the previous and subsequent paragraphs are compared, a semantic integrity algorithm is adopted, whether the entity of the predicted product word is a word with complete semantics is determined through semantic analysis, lexical analysis, syntactic analysis, contextual analysis and the like, an invalid product word is removed from the set of predicted product words, and a valid word with missing semantics is repaired, so that the efficiency of subsequently continuously optimizing the entity of the valid product word is improved. Particularly, the words with abbreviations and out-of-order input errors can be corrected by comparing other related words in the context; and (4) correcting the spelling input error of the words by comparing the spelling correction algorithm of the words with the words in the preceding and following texts.
And S32, based on the position of the predicted product word entity in the model input sentence, removing the predicted product word entity with dislocation or error by comparing the preset word bank.
The preset word stock is formed by gathering common products of enterprises and is used for comparing and predicting whether the product word entity has dislocation or error.
Specifically, the input text contains an incomplete sentence or a sentence with an unexpected model identification effect, a word with a word breaking error in the product word entity result is predicted, and the product word with the wrong dislocation information needs to be removed or repaired by comparing with a preset word bank, so that the efficiency of subsequently continuously optimizing the effective product word entity is improved.
In an embodiment, after the step S30, that is, after the valid product word entity is obtained, the method further includes the following steps:
analyzing the dimension characteristics of the effective product word entities, and constructing product word characteristic dimension data corresponding to the effective product word entities, wherein the dimension characteristics comprise at least one of word components, word criticality, word source credibility, word original text position information and word polymerization degree.
Wherein, the general complete sentence component arrangement of the word component in the Chinese text is as follows: a fixed term (modified subject) + subject + predicate + complement + fixed term + (modified object) + object, and the like.
Word criticality: the tf-idf (term frequency-inverse text frequency index) value of a word in a corpus is used to evaluate the importance of a word to a corpus or a document in a corpus.
Word source confidence: and the comprehensive credibility score of the enterprise information text where the words are located.
Word textual location information: the original location of the word in the business information text.
The word degree of polymerization: the average distance of each word from the center of a plurality of clusters of the set of words.
Specifically, in this embodiment, a process of constructing product word feature dimension data corresponding to an effective product word entity is also described, that is, a process of constructing a product word feature dimension data project is described. The characteristic engineering refers to screening better data characteristics from the original data by a series of engineering modes to improve the training effect of the model. Good product word feature dimension data helps the model and algorithm to play a greater role.
Specifically, extracting product word feature dimension data (hereinafter referred to as data) corresponding to an effective product word entity mainly includes the following steps:
1. understanding the data: understanding data by visualizing data, such as: whether the distribution of data satisfies a certain distribution, whether data is missing or abnormal, etc.
2. Data cleaning: if missing or abnormal data is encountered, determining whether to filter the missing data or supplement the missing data; determining whether the sample sizes of the data are balanced, wherein the unbalanced sample sizes of the data may cause overfitting; finally, whether the dimensions are consistent is judged, such as: there are two units of meters and centimeters, both of which are unified as centimeters.
3. Feature extraction: after data are cleaned, the quantitative characteristics of the data are extracted, and the quantitative characteristics mainly comprise four types of data:
1) the data is quantized; 2) label and description class data (label quantization): such as: sex (male-1, female-0); 3) unstructured data: namely the text type, and the portrait characteristics and the emotion of the user are extracted through a natural language processing technology; 4) network relationship type data (social network): for marking relationships between users, such as: family, friends, colleagues, etc.
4. And (3) feature screening: and screening the extracted features by using a plurality of indexes such as coverage rate, IV, stability and the like.
5. Generating a data set: the prepared data is divided into a training set, a testing set and a verification set.
In a specific embodiment, as shown in fig. 5, in step S40, that is, based on the valid product word entities and the product word feature data corresponding to the valid product word entities, training at least two word classification models to obtain a fused word score model specifically includes the following steps:
and S41, marking the word types of the effective product word entities, and taking the word types as the training results of each word classification model.
The word type is the product relevance divided based on the product relevance, for example, a primary core product, a secondary core product, a general product, a secondary product, a non-core product, an interference product and the like, for example, a plurality of words such as 'television', 'LED television', 'computer', 'consultation service' and the like are calculated through a scoring model, the 'television' and the 'LED television' are higher in score, the word of the core product is obtained, the word of the 'computer' belongs to the general product word, and the word of the consultation service belongs to the word of the interference product and the like.
And S42, dividing the product word characteristic data into a training data set and a testing data set according to a preset proportion, inputting each word classification model by adopting the training set data, and obtaining a trained classification model based on a training result, wherein the word classification model comprises at least two of an LR model, an SVM model and a random forest model.
Wherein, the setting principle of the preset proportion can be as follows: when the data volume is small, 7: 3 training data and test data, about 2/3-4/5 sample data is typically used for training and the remaining samples are used for testing, and are not limited in particular here.
The trained classification model is obtained by inputting the word classification model into a training data set in the product word characteristic data for training.
The LR model is a logistic regression model, and by assuming that data obeys Bernoulli distribution, and by a maximum likelihood function method, solving parameters by gradient descent, the purpose of bisecting the data is achieved.
The SVM support vector machine is a two-class model whose basic model is a linear classifier defined to have the largest spacing in feature space, making it a substantially non-linear classifier. The learning algorithm of the SVM is an optimization algorithm for solving convex quadratic programming.
The random forest model is a classifier which trains and predicts a sample by using a plurality of trees.
S43, testing the accuracy of each trained classification model by using the test set data to determine the score weight of each trained classification model.
And S44, forming a fused word scoring model based on each trained classification model and the corresponding score weight.
The fused word scoring model is a model for scoring effective product word entities by fusing a plurality of trained word classification models.
Specifically, the present embodiment may determine, from the product word feature data, a conditional probability that each candidate product word in the candidate product word set belongs to a core product word; and determining the candidate product words with the conditional probability exceeding the word score threshold value as the core product words related to the enterprises.
According to the embodiment, the core product words with high scores can be extracted by training the score model fused with classification algorithms such as LR, SVM and random forest, and the noise product words with low scores are removed.
According to the method for extracting the core product words, the deep learning model is adopted to extract the enterprise information texts to obtain the predicted product word entities, the extracted predicted product word entities are continuously cleaned and scored, the core product words related to enterprises can be finally obtained, the core product words are extracted truly and effectively, the identification accuracy and identification efficiency of the core product words are improved, and the method is beneficial to establishing reliable enterprise images through the accurate core product words.
Further, as shown in fig. 6, in the core product word extraction method provided in this embodiment, the enterprise information text is finally processed into a text only including an effective statement, so that the subsequent processing efficiency of the effective statement is greatly improved; the effective sentences are cut to form groups, so that the correct sequence and semantic integrity of the front and the back text are maintained; by adopting a BERT-BilSTM-CRF model, words with the maximum probability of predicting product word entities in enterprise information texts from different source channels can be effectively extracted; good product word feature dimension data are provided, so that the model and the algorithm can play a greater role; and (3) extracting high-score core product words by training a scoring model fused with classification algorithms such as LR, SVM, random forest and the like, and removing low-score noise product words.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by functions and internal logic of the process, and should not limit the implementation process of the embodiments of the present invention in any way.
In an embodiment, a core product word extraction device is provided, and the core product word extraction device corresponds to the core product word extraction method in the embodiment one to one. As shown in fig. 7, the core product word extracting apparatus includes an enterprise information text acquiring module 10, a model input acquiring module 10, an entity extracting model acquiring module 20, an effective product word acquiring module 30, and a core product word acquiring module 40. The detailed description of each functional module is as follows:
and the enterprise information text acquiring module 10 is configured to acquire an enterprise information text, and process the enterprise text information by using a preprocessing method to acquire a model input statement conforming to a preset format.
And the obtaining entity extraction model module 20 is used for training the product word entity extraction model based on deep learning through the labeled model input sentences.
And an effective product word entity obtaining module 30, configured to extract a model based on the product word entities, obtain predicted product word entities corresponding to the model input sentences, and clean the predicted product word entities to obtain effective product word entities.
And the core product word obtaining module 40 is configured to train at least two word classification models based on the effective product word entities and the product word feature dimension data corresponding to the effective product word entities to obtain a fused word scoring model, and extract the core product words meeting a word score threshold through the scoring model.
Preferably, the core product word extraction device further comprises a to-be-processed sentence acquisition module, an invalid sentence acquisition module and an invalid sentence filtering module, and the detailed description of each functional module is as follows:
and the sentence acquisition module is used for segmenting the enterprise information text by taking the sentence pattern as granularity to acquire at least one sentence to be processed.
And the invalid sentence pattern acquisition module is used for acquiring invalid sentence patterns in all the sentences to be processed based on the sentence pattern analysis model.
And the invalid sentence filtering module is used for filtering the invalid sentences in the enterprise information text so as to enable each remaining sentence to be processed to be an effective sentence.
Preferably, the module for obtaining model input sentences 10 includes a sub-module for obtaining model input sentences, and the functional modules are described in detail as follows:
and the obtaining model input statement submodule is used for adopting a length grouping cutting algorithm and a semantic integrity algorithm to cut the effective statements so as to obtain the model input statements which accord with the preset length and keep the semantic integrity.
Preferably, the module for obtaining an entity extraction model 20 includes a sub-module for training a BERT-BilSTM-CRF model, and the functional modules are detailed as follows:
the BERT-BilSTM-CRF model sub-module is used for training a BERT-BilSTM-CRF model by adopting model input sentences and corresponding word marking data thereof and obtaining a product word entity extraction model, wherein the BERT-BilSTM-CRF model comprises the following steps: a BERT pre-training model layer, a BilSTM network layer and a CRF probability distribution layer.
Preferably, the module 30 for obtaining valid product words includes a sub-module for obtaining valid product words and a sub-module for removing predicted product words, and the functional modules are described in detail as follows:
and the effective product word entity obtaining sub-module is used for taking the predicted product word entity with semantic deletion as a target repair word and comparing the previous and later repair target repair words to obtain an effective product word entity.
And the predicted product word entity removing submodule is used for removing the predicted product word entities with dislocation or error by comparing the preset word bank based on the positions of the predicted product word entities in the model input sentences.
Preferably, the core product word extraction device further comprises an analysis dimension feature module, and the functional modules are described in detail as follows:
And the dimension characteristic analysis module is used for analyzing dimension characteristics of the effective product word entity and constructing product word characteristic dimension data corresponding to the effective product word entity, wherein the dimension characteristics comprise at least one of word components, word criticality, word source credibility, word original text position information and word polymerization degree.
Preferably, the module 40 for obtaining core product words includes a sub-module for labeling word types, a sub-module for obtaining trained classification models, a sub-module for determining score weights, and a sub-module for forming word score models, and the functional modules are described in detail as follows:
and the word type labeling submodule is used for labeling the word types of the effective product word entities and taking the word types as the training results of each word classification model.
And the obtaining of the trained classification model submodule is used for dividing the product word characteristic data into a training data set and a testing data set according to a preset proportion, inputting the training set data into each word classification model, and obtaining the trained classification model based on the training result, wherein the word classification model comprises at least two of an LR model, an SVM model and a random forest model.
A determine fractional weight sub-module for testing the accuracy of each trained classification model using the test set data to determine a fractional weight for each trained classification model.
And the formed word scoring model submodule is used for forming a fused word scoring model based on each trained classification model and the corresponding score weight of the trained classification model.
For specific limitations of the core product word extraction device, reference may be made to the above limitations of the core product word extraction method, which are not described herein again. All or part of each module in the core product word extraction device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the device, and can also be stored in a memory in the device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a device is provided, which may be a server, and the internal structure thereof may be as shown in fig. 8. The apparatus includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device includes a non-volatile medium, an internal memory. The non-volatile medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile media. The database of the device is used for data related to the core product word extraction method. The network interface of the device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a core product word extraction method.
In one embodiment, an apparatus is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the core product word extraction method of the foregoing embodiments, for example, S10 to S40 shown in fig. 2. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the core product word extraction apparatus in the above embodiments, such as the functions of the modules 10 to 40 shown in fig. 7. To avoid repetition, the description is omitted here.
In one embodiment, a computer readable medium is provided, on which a computer program is stored, and the computer program is executed by a processor to implement the core product word extraction method of the foregoing embodiments, such as S10 to S40 shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of each module/unit in the core product word extraction apparatus in the above-described apparatus embodiments, such as the functions of modules 10 to 40 shown in fig. 7. To avoid repetition, the description is omitted here.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer readable medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database, or other medium used in the embodiments of the present application may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It should be clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional units and modules is only used for illustration, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the apparatus may be divided into different functional units or modules to perform all or part of the above described functions.
The above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein.

Claims (10)

1. A method for extracting core product words is characterized by comprising the following steps:
acquiring an enterprise information text, and processing the enterprise information text by adopting a preprocessing method to acquire a model input statement conforming to a preset format;
Training a product word entity extraction model based on deep learning by labeling the model input sentence;
based on the product word entity extraction model, acquiring a predicted product word entity corresponding to the model input sentence, and cleaning the predicted product word entity to acquire an effective product word entity;
training at least two word classification models based on the effective product word entities and product word feature dimension data corresponding to the effective product word entities to obtain a fused word score model, and extracting core product words meeting a word score threshold value through the score model.
2. The method for extracting core product words according to claim 1, further comprising, before the obtaining the enterprise information text:
segmenting the enterprise information text by taking a sentence pattern as granularity to obtain at least one sentence to be processed;
acquiring invalid sentence patterns in all the sentences to be processed based on a sentence pattern analysis model;
and filtering invalid sentences in the enterprise information text so as to enable each remaining sentence to be processed to be valid sentences.
3. The method for extracting core product words according to claim 1, wherein the enterprise information text includes at least one valid sentence with invalid sentence patterns filtered out;
The processing the effective sentences by adopting a preprocessing method to obtain model input sentences conforming to a preset format comprises the following steps:
and adopting a length grouping cutting algorithm and a semantic integrity algorithm to cut the effective sentences so as to obtain model input sentences which accord with preset length and keep semantic integrity.
4. The method for extracting core product words according to claim 1, wherein training a deep learning-based product word entity extraction model by labeling the model input sentences comprises:
training a BERT-BilSTM-CRF model by adopting a model input sentence and word marking data corresponding to the model input sentence, and acquiring the product word entity extraction model, wherein the BERT-BilSTM-CRF model comprises the following steps: a BERT pre-training model layer, a BilSTM network layer and a CRF probability distribution layer.
5. The method as claimed in claim 1, wherein the step of cleaning the predicted product word entities to obtain valid product word entities comprises:
taking the predicted product word entity with semantic deletion as a target repair word, and comparing the target repair word with the preceding and following text repair words to obtain the effective product word entity;
and based on the position of the predicted product word entity in the model input sentence, removing the predicted product word entity with dislocation or error by comparing with a preset word bank.
6. The method for extracting core product words according to claim 1, further comprising, after the obtaining valid product word entities:
analyzing the dimension characteristics of the effective product word entity, and constructing product word characteristic dimension data corresponding to the effective product word entity, wherein the dimension characteristics comprise at least one of word components, word criticality, word source credibility, word original text position information and word polymerization degree.
7. The method for extracting core product words according to claim 1, wherein the training of at least two word classification models based on the valid product word entities and the product word feature data corresponding to the valid product word entities to obtain a fused word score model comprises:
marking the word type of the effective product word entity, and taking the word type as a training result of each word classification model;
dividing the product word feature data into a training data set and a testing data set according to a preset proportion, inputting the training set data into each word classification model, and acquiring a trained classification model based on the training result, wherein the word classification model comprises at least two of an LR model, an SVM model and a random forest model;
Testing the accuracy of each of the trained classification models using the test set data to determine a fractional weight for each of the trained classification models;
forming the fused word scoring model based on each of the trained classification models and its corresponding score weight.
8. A core product word extraction device, comprising:
the enterprise information text acquisition module is used for acquiring an enterprise information text and processing the enterprise information text by adopting a preprocessing method to acquire a model input statement conforming to a preset format;
the entity extraction model acquisition module is used for training a product word entity extraction model based on deep learning by marking the model input sentence;
an effective product word entity obtaining module, configured to obtain a predicted product word entity corresponding to the model input statement based on the product word entity extraction model, and clean the predicted product word entity to obtain an effective product word entity;
and the core product word obtaining module is used for training at least two word classification models based on the effective product word entity and the product word feature dimension data corresponding to the effective product word entity so as to obtain a fused word scoring model, and extracting the core product words meeting a word score threshold value through the scoring model.
9. An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the core product word extraction method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable medium, in which a computer program is stored, which, when being executed by a processor, carries out the core product word extraction method according to any one of claims 1 to 7.
CN202210243423.0A 2022-03-11 2022-03-11 Core product word extraction method, device, equipment and medium Withdrawn CN114757178A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210243423.0A CN114757178A (en) 2022-03-11 2022-03-11 Core product word extraction method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210243423.0A CN114757178A (en) 2022-03-11 2022-03-11 Core product word extraction method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114757178A true CN114757178A (en) 2022-07-15

Family

ID=82327625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210243423.0A Withdrawn CN114757178A (en) 2022-03-11 2022-03-11 Core product word extraction method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114757178A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115618877A (en) * 2022-12-20 2023-01-17 北京仁科互动网络技术有限公司 User portrait label determination method and device and electronic equipment
CN115619290A (en) * 2022-12-02 2023-01-17 北京视野智慧数字科技有限公司 Method, device and equipment for determining product service of enterprise
CN115774994A (en) * 2022-12-09 2023-03-10 企知道网络技术有限公司 Keyword screening method, device and storage medium
CN115860005A (en) * 2022-12-29 2023-03-28 企知道网络技术有限公司 Method and device for mounting enterprise in industrial chain based on semantic matching and related components

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115619290A (en) * 2022-12-02 2023-01-17 北京视野智慧数字科技有限公司 Method, device and equipment for determining product service of enterprise
CN115774994A (en) * 2022-12-09 2023-03-10 企知道网络技术有限公司 Keyword screening method, device and storage medium
CN115618877A (en) * 2022-12-20 2023-01-17 北京仁科互动网络技术有限公司 User portrait label determination method and device and electronic equipment
CN115860005A (en) * 2022-12-29 2023-03-28 企知道网络技术有限公司 Method and device for mounting enterprise in industrial chain based on semantic matching and related components

Similar Documents

Publication Publication Date Title
US10558746B2 (en) Automated cognitive processing of source agnostic data
US20220050967A1 (en) Extracting definitions from documents utilizing definition-labeling-dependent machine learning background
US20200111023A1 (en) Artificial intelligence (ai)-based regulatory data processing system
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
US10796084B2 (en) Methods, systems, and articles of manufacture for automatic fill or completion for application software and software services
CN114757178A (en) Core product word extraction method, device, equipment and medium
US10776583B2 (en) Error correction for tables in document conversion
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
US11003950B2 (en) System and method to identify entity of data
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN114840685A (en) Emergency plan knowledge graph construction method
CN114490937A (en) Comment analysis method and device based on semantic perception
CN115269833A (en) Event information extraction method and system based on deep semantics and multitask learning
CN117291192B (en) Government affair text semantic understanding analysis method and system
US11989677B2 (en) Framework for early warning of domain-specific events
CN110941713A (en) Self-optimization financial information plate classification method based on topic model
Nishiwaki et al. A consideration of evaluation method of sentiment analysis on social listening
Basha et al. Natural Language Processing: Practical Approach
CN113191160A (en) Emotion analysis method for knowledge perception
DeVille et al. Text as Data: Computational Methods of Understanding Written Expression Using SAS
Mihaylov et al. Predicting the Resolution Time and Priority of Bug Reports: A Deep Learning Approach
Khadilkar et al. A Knowledge Graph Based Approach for Automatic Speech and Essay Summarization
Yadao et al. A semantically enhanced deep neural network framework for reputation system in web mining for Covid-19 Twitter dataset
CN117573956B (en) Metadata management method, device, equipment and storage medium
US20240232539A1 (en) Systems and Methods for Insights Extraction Using Semantic Search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518051 2201, block D, building 1, bid section 1, Chuangzhi Yuncheng, Liuxian Avenue, Xili community, Xili street, Nanshan District, Shenzhen, Guangdong

Applicant after: Qizhi Technology Co.,Ltd.

Address before: 518051 2201, block D, building 1, bid section 1, Chuangzhi Yuncheng, Liuxian Avenue, Xili community, Xili street, Nanshan District, Shenzhen, Guangdong

Applicant before: Qizhi Network Technology Co.,Ltd.

CB02 Change of applicant information
WW01 Invention patent application withdrawn after publication

Application publication date: 20220715

WW01 Invention patent application withdrawn after publication