Nothing Special   »   [go: up one dir, main page]

CN117633518B - Industrial chain construction method and system - Google Patents

Industrial chain construction method and system Download PDF

Info

Publication number
CN117633518B
CN117633518B CN202410105299.0A CN202410105299A CN117633518B CN 117633518 B CN117633518 B CN 117633518B CN 202410105299 A CN202410105299 A CN 202410105299A CN 117633518 B CN117633518 B CN 117633518B
Authority
CN
China
Prior art keywords
product name
candidate
product names
standard product
candidate product
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410105299.0A
Other languages
Chinese (zh)
Other versions
CN117633518A (en
Inventor
王冉冉
莫冰莹
秦秀磊
张洋浩
邓礼馨
刘乙蒙
张琼
张瀚文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing Big Data Research Institute Of Peking University
Peking University
Original Assignee
Chongqing Big Data Research Institute Of Peking University
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing Big Data Research Institute Of Peking University, Peking University filed Critical Chongqing Big Data Research Institute Of Peking University
Priority to CN202410105299.0A priority Critical patent/CN117633518B/en
Publication of CN117633518A publication Critical patent/CN117633518A/en
Application granted granted Critical
Publication of CN117633518B publication Critical patent/CN117633518B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/018Certifying business or products
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • Development Economics (AREA)
  • Computational Linguistics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Databases & Information Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Accounting & Taxation (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Finance (AREA)
  • Strategic Management (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an industrial chain construction method and system. Wherein the method comprises the following steps: obtaining candidate product names corresponding to target industries, and determining a first mapping relation between the candidate product names and enterprises; determining the literal similarity and the semantic similarity between the candidate product name and the standard product name corresponding to the target industry, and determining a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity; determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation; and constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name. The application solves the technical problems of low construction efficiency and poor accuracy of the industrial chain caused by the fact that the industrial chain map is constructed by manual mode in the related technology.

Description

Industrial chain construction method and system
Technical Field
The application relates to the technical field of data processing, in particular to an industrial chain construction method and system.
Background
The industrial chain analysis can dynamically monitor and analyze the economic operation condition of the important industry in the region, and provides a decision basis for deep understanding of the development condition of the industry, finding out the weak links of the industry, making the supporting policy of the industry and leading the industrial gathering and upgrading.
The industrial chain analysis is independent of the establishment of the relationship between the upstream and downstream of the industry, and the enterprises between the upstream and downstream of the industrial chain have close value exchange, so that the enterprises can clearly know the position of themselves in the industry through the industrial map. However, most of the industrial chain atlas in the related art is constructed in a manual mode, and the technical problems of long construction period, large workload, low accuracy, incomplete data and the like exist.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides an industrial chain construction method and system, which at least solve the technical problems of low industrial chain construction efficiency and poor accuracy caused by the fact that an industrial chain map is constructed in a manual mode in the related technology.
According to an aspect of an embodiment of the present application, there is provided an industrial chain construction method including: obtaining candidate product names corresponding to target industries, and determining a first mapping relation between the candidate product names and enterprises; determining the literal similarity and the semantic similarity between the candidate product name and the standard product name corresponding to the target industry, and determining a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity; determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation; and constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream and downstream relation of the product corresponding to the standard product name in production and life.
Optionally, determining the second mapping relationship between the candidate product name and the standard product name according to the literal similarity and the semantic similarity includes: determining the character similarity characteristics between the candidate product names and the standard product names by adopting a first matching model, and determining the standard product names matched with the candidate product names according to the character similarity characteristics to obtain a first matching relation; calculating semantic similarity parameters between the candidate product names which are not successfully matched after being processed by the first matching model and the standard product names by adopting the second matching model, and determining the standard product names matched with the candidate product names according to the semantic similarity parameters to obtain a second matching relation; and obtaining a second mapping relation according to the first matching relation and the second matching relation.
Optionally, the literal similarity feature comprises: a surface text similarity feature and a character relevance feature, wherein the surface text similarity feature is used for representing the matching degree and distance of characters and/or words between candidate product names and standard product names, and the character relevance feature comprises at least one of the following: the length of the largest common substring between the candidate product name and the standard product name, the length of the largest common subsequence, the length of the candidate product name, the length of the standard product name; the first matching model is obtained through training the following steps: acquiring a first training data set, wherein the first training data set comprises a plurality of candidate product names and standard product names corresponding to the candidate product names; training the weak classifier in the first initial model according to the first training data set to obtain a predicted value output by the weak classifier; and determining a target residual error according to the predicted value and the true value in the first training data set, training the rest weak classifiers in the first initial model according to the target residual error, and repeating the training process until all the weak classifiers in the first initial model are trained, so as to obtain a first matching model.
Optionally, calculating, by using the second matching model, the semantic similarity parameter between the candidate product name and the standard product name that are not successfully matched after being processed by the first matching model includes: converting the candidate product names into corresponding candidate word embedding sequences and converting the standard product names into corresponding standard word embedding sequences by adopting an input layer in the second matching model; converting the candidate word embedded sequence into a corresponding candidate semantic representation vector by adopting a representation layer in the second matching model, and converting the standard word embedded sequence into a corresponding standard semantic representation vector; and calculating cosine similarity between the candidate semantic representation vector and the standard semantic representation vector, and taking the cosine similarity as a semantic similarity parameter.
Optionally, obtaining the candidate product names corresponding to the target industry includes: acquiring an initial industry data set corresponding to a target industry, and screening the initial industry data set according to preset parameters to obtain the target industry data set, wherein the initial industry data set comprises at least one of the following: the management scope information, qualification certificate information, intellectual property information and invoice transaction information corresponding to the target industry, and the preset parameters comprise at least one of the following: service authority parameters, service description granularity parameters, coverage parameters and availability parameters, wherein the service authority parameters are used for representing the credibility degree of the data for proving that enterprises have the capability of producing certain products, the service description granularity is used for representing the fineness degree of the range of the products produced by the enterprises described by the data, the coverage parameters are used for representing the coverage degree of the data on the enterprises corresponding to the target industry, and the availability parameters are used for representing the difficulty degree of data acquisition; and performing word segmentation processing on the information in the target industry data set to obtain candidate product names.
Optionally, performing word segmentation processing on the information in the target industry data set to obtain candidate product names includes: splitting information in the target industry data set according to preset splitting characters to obtain a plurality of first fields, wherein the preset splitting characters comprise: punctuation characters and hyphen characters; deleting a first field containing words in a preset disabling word stock to obtain a second field; determining words with parts of speech being verbs in the second field, and deleting the words with parts of speech being verbs from the second field to obtain noun fields; and determining the field length of each noun field, and determining the noun field with the field length larger than the preset field length as a candidate product name.
Optionally, after determining the third mapping relationship for characterizing the correspondence between the standard product name and the enterprise, the method further includes: determining a technical side evaluation index and a service side evaluation index of each third mapping relation, wherein the technical side evaluation index comprises: the number of candidate product names/standard product names/enterprises corresponding to the third mapping relation respectively accounts for the proportion of all candidate product names/standard product names/enterprises corresponding to the target industry, and the service side evaluation index is used for representing the accuracy degree of matching of the candidate product names/standard product names corresponding to the third mapping relation.
According to another aspect of the embodiment of the present application, there is also provided an industrial chain construction system, including: the candidate product name acquisition module is used for acquiring candidate product names corresponding to target industries and determining a first mapping relation between the candidate product names and enterprises; the second mapping relation determining module is used for determining the literal similarity and the semantic similarity between the candidate product names and the standard product names corresponding to the target industry, and determining the second mapping relation between the candidate product names and the standard product names according to the literal similarity and the semantic similarity; the third mapping relation determining module is used for determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation; the target industry chain construction module is used for constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream-downstream relation of the product corresponding to the standard product name in production and life.
According to still another aspect of the embodiment of the present application, there is also provided an electronic device including: the system comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program runs to execute the industrial chain construction method.
According to still another aspect of the embodiments of the present application, there is also provided a non-volatile storage medium, where the non-volatile storage medium includes a stored computer program, and a device in which the non-volatile storage medium is located executes an industrial chain construction method by running the computer program.
In the embodiment of the application, the candidate product names corresponding to the target industry are acquired, and a first mapping relation between the candidate product names and enterprises is determined; determining the literal similarity and the semantic similarity between the candidate product name and the standard product name corresponding to the target industry, and determining a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity; determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation; according to a third mapping relation and an association relation corresponding to a standard product name, a target industry chain corresponding to a target industry is constructed, wherein the association relation is used for representing the upstream and downstream relation of products corresponding to the standard product name in production and life, a foundation is laid for accurate matching between enterprises and products by screening and combining data sets of different categories from multiple dimensions as input, on the basis, an algorithm of literal and semantic similarity is fused to achieve the accurate matching between the enterprises and the products, the aim of reducing the matching cost between the enterprises and the products and improving the matching accuracy and coverage is achieved, and the technical problems of low construction efficiency and poor accuracy of the industry chain caused by the fact that an industry chain map is constructed in a manual mode in a related technology are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
Fig. 1 is a block diagram of a hardware structure of a computer terminal (or electronic device) for implementing a method of industrial chain construction according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a method flow for industrial chain construction according to an embodiment of the present application;
FIG. 3 is a schematic diagram of an overall construction flow of an industrial chain upstream and downstream according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a process for preprocessing a candidate product name dataset according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a data matching process according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a process flow for enterprise-product relationship construction provided in accordance with an embodiment of the present application;
FIG. 7 is a tree-like schematic diagram of an industrial chain upstream and downstream according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a network upstream and downstream of an industrial chain according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an industrial chain upstream and downstream fusing enterprise information according to an embodiment of the present application;
Fig. 10 is a schematic structural diagram of an industrial chain construction system according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to facilitate a better understanding of the embodiments of the present application, some technical terms or terms related to the embodiments of the present application will be explained below:
Named entity Recognition (NAMED ENTITY Reconnaissance, NER): is a task in natural language processing for identifying named entities in text, such as person names, place names, organization names, etc.
BERT-CRF: is a combination of BERT (Bidirectional Encoder Representations from Transformers) and CRF (Conditional Random Fields) models for named entity recognition tasks. The BERT model provides a representation of the context, and the CRF model is used to label sequence tags, improving the performance of named entity recognition by joint optimization.
ERNIE (Enhanced Representation through Knowledge Integration): is one of the Chinese pre-training language models, i.e. the representation is enhanced by knowledge integration. ERNIE uses more pre-training tasks and pre-training data, improves the representation capability of Chinese text by learning semantic representations at word level, sentence level and chapter level, and is suitable for various natural language processing tasks including named entity recognition.
LightGBM (LIGHT GRADIENT Boosting Machine): is a machine learning algorithm based on a gradient boosting decision tree (Gradient Boosting Decision Tree, GBDT) algorithm.
SimNet (Similarity Net): the method is a model for calculating the similarity of texts, and can measure the similarity between two texts by training a large number of text samples, and is used for tasks such as question and answer, information retrieval and the like.
GPT-2 (GENERATIVE PRE-trained Transformer 2): is a language model based on deep learning. The method is a pre-trained neural network model, can generate texts and can finish tasks such as text generation, text abstracts, dialogue systems and the like.
The industrial chain analysis adopts the technologies of natural language processing, knowledge graph, big data artificial intelligence and the like, constructs an industrial chain graph and an enterprise graph based on multidimensional data sources (such as data of business, bidding, intellectual property, financing, public opinion and the like), dynamically monitors and analyzes the economic operation condition of the important industry in the region on the basis of the industrial chain graph and the enterprise graph, and provides a decision basis for deeply knowing the development condition of the industry, finding out weak links of the industry, making industry supporting policies and leading the industrial gathering and upgrading.
On the one hand, the industrial chain analysis utilizes the industrial chain map to comb the local supply conditions of upstream and downstream basic components, core components, key basic materials, key commonality technology, advanced basic process and the like, rapidly identifies the technology 'blocking point' and the enterprise 'breakpoint', automatically refines the technology bottleneck to overcome the enterprise list which can be combined and cooperated and needs to be brought to the quotation of the quotation, realizes the 'blocking point' technology breakthrough and 'breakpoint' enterprise deficiency, and assists in constructing the regional industry cluster with large scale and competitive power. On the other hand, the system can assist a financial institution to realize full life cycle risk management on public clients, cover links before, during and after loan, perfect an industrial chain risk conduction early warning mechanism, help the financial institution to develop the customer-expanding business in order, screen high-quality clients by combining with industrial chain and enterprise map information, improve conversion rate and reduce marketing cost.
The industrial chain analysis is independent of the establishment of the relationship between the upstream and downstream of the industry, and the enterprises between the upstream and downstream of the industry have close value exchange, so that the enterprises can clearly know the position of themselves in the industry through the industrial chain map. However, most of the industrial chain atlas in the related art is constructed in a manual mode, and the technical problems of long construction period, large workload, low accuracy, incomplete data and the like exist.
In order to solve the above problems, related solutions are provided in the embodiments of the present application, and are described in detail below.
In accordance with an embodiment of the present application, there is provided an embodiment of a method of industrial chain construction, it being noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system, such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal (or electronic device) for implementing an industrial chain construction method. As shown in fig. 1, the computer terminal 10 (or electronic device) may include one or more processors 102 (shown as 102a, 102b, … …,102n in the figures), which processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA, a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or electronic device). As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the industrial chain construction method in the embodiment of the present application, and the processor 102 executes the software programs and modules stored in the memory 104 to perform various functional applications and data processing, that is, implement the industrial chain construction method described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or electronic device).
In the above operating environment, the embodiment of the application provides an industrial chain construction method for solving the problem of how to quickly and accurately construct the upstream and downstream of an industrial chain, and fig. 2 is a schematic diagram of a flow of the industrial chain construction method according to the embodiment of the application, as shown in fig. 2, the method includes the following steps:
Step S202, obtaining candidate product names corresponding to target industries, and determining a first mapping relation between the candidate product names and enterprises;
Step S204, determining the literal similarity and the semantic similarity between the candidate product name and the standard product name corresponding to the target industry, and determining a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity;
step S206, determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation;
Step S208, a target industry chain corresponding to the target industry is constructed according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream and downstream relation of the product corresponding to the standard product name in production and life.
According to the scheme, the specific characteristics of different industries are comprehensively considered, data sets of different categories are screened and combined from different dimensions such as service authority, granularity of service description, coverage, availability and the like to serve as input, and a foundation is laid for accurate matching between enterprises and products. On the basis, the literal similarity algorithm and the semantic similarity algorithm are fused and are mutually complemented, so that the matching cost between enterprises and products can be effectively reduced, and the matching accuracy and coverage are improved. Aiming at literal matching, a classification model is constructed on the basis of various similarity algorithms, so that the accuracy of decision making is improved. And further solves the technical problems of low construction efficiency and poor accuracy of the industrial chain caused by the fact that the industrial chain map is constructed in a manual mode in the related technology.
The method for constructing an industrial chain in steps S202 to S208 according to the embodiment of the present application is further described below.
Fig. 3 is a schematic diagram of an overall construction flow of an industrial chain upstream and downstream, according to an embodiment of the present application, as shown in fig. 3, the flow includes the following steps:
Step1, screening and preparing a data set;
Firstly, aiming at specific characteristics of different target industries, the initial industry data set is screened from different dimensions according to preset parameters, and the specific steps are as follows.
In some embodiments of the present application, obtaining candidate product names corresponding to the target industry includes: acquiring an initial industry data set corresponding to a target industry, and screening the initial industry data set according to preset parameters to obtain the target industry data set, wherein the initial industry data set comprises at least one of the following: the management scope information, qualification certificate information, intellectual property information and invoice transaction information corresponding to the target industry, and the preset parameters comprise at least one of the following: service authority parameters, service description granularity parameters, coverage parameters and availability parameters, wherein the service authority parameters are used for representing the credibility degree of the data for proving that enterprises have the capability of producing certain products, the service description granularity is used for representing the fineness degree of the range of the products produced by the enterprises described by the data, the coverage parameters are used for representing the coverage degree of the data on the enterprises corresponding to the target industry, and the availability parameters are used for representing the difficulty degree of data acquisition; and performing word segmentation processing on the information in the target industry data set to obtain candidate product names.
Specifically, the initial industry data set is screened from dimensions such as service authority parameters, service description granularity parameters, coverage parameters, availability parameters and the like to obtain a target industry data set, wherein the initial industry data set comprises: business administration scope information, qualification certificate information, intellectual property patent information, regulatory certification information, invoice information and the like.
The business authority parameters are used for evaluating data, so that the capability of enterprises for producing certain products can be reasonably proved. The business description granularity parameter is used for evaluating granularity of data in terms of describing what kind of products are produced by an enterprise. The coverage parameters are used for evaluating the coverage of the data to the industrial enterprises, and if the coverage is too low, other data sets are considered to be introduced to be used as supplements; the availability parameters described above are used to evaluate the ease of data acquisition.
The target industry is exemplified as the automotive industry.
For the automobile manufacturing industry, after screening the initial industry data set of the automobile manufacturing industry based on the preset parameters, the obtained target industry data set comprises: regulatory authorities authenticate data, qualification certificate information, intellectual property patent data and business scope data of the industry and commerce.
The embodiment of the application can adopt different processing modes aiming at different data sources. Specifically, for the structured data in the target industry data set, the enterprise-product relationship can be directly constructed, and the enterprise-product relationship can be extracted through an enterprise-product two-dimensional table, for example: the supervision department authenticates the data; for unstructured data, the way such data is processed requires natural language processing of the data, such as: business scope data, because the data involved in this embodiment is largely unstructured, the construction of business-product relationships based on natural language processing is emphasized in this embodiment.
In addition, when screening and preparing the data set, the standard product name corresponding to the target industry and the association relationship corresponding to the standard product name can be determined, specifically, the association relationship can be determined by the following modes: firstly, constructing based on a natural language processing method, applying data such as industry research report, enterprise financial report, industry cluster book and the like, adopting NER (named entity recognition ,Named Entity Recognition)、BERT-CRF(Bidirectional Encoder Representations from Transformers-Conditional Random Fields)、ERNIE(Enhanced Representation through Knowledge Integration) and other technologies, extracting the name expression and upstream-downstream relationship of an industry chain standard product in industry (such as automobile manufacturing industry), correcting by using domain knowledge, and finally judging according to expert experience to obtain the final industry chain standard product name and related relationship.
Step 2, preprocessing data, namely preprocessing the target industry data set obtained in the step 1 to obtain candidate product names corresponding to target industries and a first mapping relation between the candidate product names and enterprises;
since most of the data in the target industry data set in the embodiment of the present application have the same unstructured characteristics, for example: short text, no word order, more punctuation marks and the like are all required to be processed in natural language, so that the target industrial data set can be preprocessed in a word segmentation mode, and the method comprises the following specific steps.
In some embodiments of the present application, word segmentation is performed on information in a target industry data set to obtain candidate product names, including the steps of: splitting information in the target industry data set according to preset splitting characters to obtain a plurality of first fields, wherein the preset splitting characters comprise: punctuation characters and hyphen characters; deleting a first field containing words in a preset disabling word stock to obtain a second field; determining words with parts of speech being verbs in the second field, and deleting the words with parts of speech being verbs from the second field to obtain noun fields; and determining the field length of each noun field, and determining the noun field with the field length larger than the preset field length as a candidate product name.
Specifically, fig. 4 is a schematic diagram of a flow of preprocessing a candidate product name data set, as shown in fig. 4, according to an embodiment of the present application.
Firstly, rough clauses and rough word cutting are carried out on information in the target industry data set according to preset segmentation characters (punctuation marks and words), for example: processing according to the segmentation characters such as "," "and", "or" and the like to obtain a plurality of phrases (first fields); then, according to a preset stop word stock, deleting phrases containing a certain word in the preset stop word stock, deleting a certain phrase appearing at the beginning, deleting a certain phrase appearing at the end and the like, and processing a plurality of phrases obtained by the rough clauses and the rough word cutting to obtain a second field; and identifying verbs in the phrases left after the processing and deleting the verbs to obtain noun fields, so that the noun fields are closer to the expression mode of the product name. Finally, deleting part of noun fields according to the word length setting rule, for example deleting the noun fields of single words, and obtaining candidate product names.
For example, duplicate values are removed for the word segmentation of the approximate product expression, phrase rules related to "and" are set to avoid word leakage, such as: splitting the automobile chassis and related parts into the automobile chassis and related parts, setting phrase rule deletion verb expressions related to the automobile chassis and related parts to enable the phrase rule deletion verb expressions to approximate product expressions, such as: deleting the manufacturing of the automobile parts, reserving the operation modes of the automobile parts, and the like, and further constructing candidate product names and the corresponding relation between enterprises and the candidate product names. Taking part of automobile manufacturing industry as an example, the corresponding relation between the original operation range and the candidate product names obtained by word segmentation processing is shown in the following table.
Step3, product matching, namely combining two methods of literal matching and semantic matching, and constructing an industry chain standard product name-candidate product name mapping table (namely the second mapping relation);
After the candidate product names are obtained, a mapping table of the industry chain standard product names and the candidate product names, namely the second mapping relation, can be constructed by carrying out similarity calculation on the candidate product names and the standard product names, and the specific steps are as follows.
In some embodiments of the present application, determining a second mapping between candidate product names and standard product names based on the literal similarity and the semantic similarity comprises the steps of: determining the character similarity characteristics between the candidate product names and the standard product names by adopting a first matching model, and determining the standard product names matched with the candidate product names according to the character similarity characteristics to obtain a first matching relation; calculating semantic similarity parameters between the candidate product names which are not successfully matched after being processed by the first matching model and the standard product names by adopting the second matching model, and determining the standard product names matched with the candidate product names according to the semantic similarity parameters to obtain a second matching relation; and obtaining a second mapping relation according to the first matching relation and the second matching relation.
Fig. 5 is a schematic diagram of a flow of data matching according to an embodiment of the present application, as shown in fig. 5.
In this embodiment, the process of determining the second mapping relationship is divided into two key sub-steps, namely, literal similarity matching and semantic similarity matching. In consideration of that the cost of model training, calculation cost and the like of the literal similarity matching are lower than those of the semantic similarity matching, the literal similarity matching is placed before the semantic similarity matching in the process design.
The following describes the process of calculating the similarity of words first, specifically as follows.
In some embodiments of the application, the literal similarity feature comprises: a surface text similarity feature and a character relevance feature, wherein the surface text similarity feature is used for representing the matching degree and distance of characters and/or words between candidate product names and standard product names, and the character relevance feature comprises at least one of the following: the length of the largest common substring between the candidate product name and the standard product name, the length of the largest common subsequence, the length of the candidate product name, the length of the standard product name; the first matching model is obtained through training the following steps: acquiring a first training data set, wherein the first training data set comprises a plurality of candidate product names and standard product names corresponding to the candidate product names; training the weak classifier in the first initial model according to the first training data set to obtain a predicted value output by the weak classifier; and determining a target residual error according to the predicted value and the true value in the first training data set, training the rest weak classifiers in the first initial model according to the target residual error, and repeating the training process until all the weak classifiers in the first initial model are trained, so as to obtain a first matching model.
Specifically, a part of samples can be marked manually according to candidate product names and industry chain standard product names to form a first training data set for subsequent model training. And then, according to modeling requirements of a literal similarity matching model (namely the first matching model), combining the candidate product names and characteristics of industry chain standard product names, developing index derivation, namely determining literal similarity characteristics between the candidate product names and the standard product names. The index derivation can be subdivided into two ways, and a part of indexes are surface text similarity characteristics, specifically, the matching degree or distance of characters or words between two texts of candidate product names and industry chain standard product names is taken as a similarity measurement standard, and A, B in the table respectively represents a set corresponding to the candidate product names and a set corresponding to the industry chain standard product names as shown in the following table.
The other part of indexes are character correlation characteristics (such as the maximum public substring, the number of words in candidate product names, the number of words in industry chain standard product names and the like), and the characters are specifically shown in the following table.
The index features (surface text similarity features and character correlation features) generated in the above two ways will then be input as a classification model (first matching model).
In the embodiment of the application, lightGBM (LIGHT GRADIENT Boosting Machine) algorithm can be adopted to construct a first matching model, and a strong classifier or regressor is constructed by utilizing the integration of a plurality of weak classifiers. Specifically, a first weak classifier is trained first, then a plurality of weak classifiers are trained step by step and iteratively by calculating the difference between the true value and the predicted value of the sample (i.e., the target residual) and taking the target residual as the training target of the next weak classifier, and they are combined into a stronger classifier or regressor. The training method has the advantages of high training speed, low memory occupation, strong expandability and the like.
Specifically, the first training data set of artificial markers may be processed according to 7:3, dividing the training set and the testing set in proportion; selecting Top-k features (surface text similarity features or character correlation features) through feature importance, and putting the features into a model; optimizing model parameters by adopting a grid search method; model evaluation is carried out on the training set and the test set respectively, and specific evaluation indexes are shown in the following table.
TP is a real example, and represents the number of positive class samples accurately predicted by a model; FP is a false positive example, representing the number of negative class samples that the model erroneously predicts as positive classes; FN is a false negative example, representing the number of positive class samples that the model erroneously predicts as negative classes; TN is true and negative example, and represents the number of negative classes correctly predicted by the model; TPR is the true rate, also called recall rate, in all samples that are actually positive examples, the proportion that is predicted by the model as positive examples; FPR is the false positive rate, and in all samples that are actually negative examples, the proportion of the positive examples is mispredicted by the model.
The logic for the calculation of the confusion matrix Confusion Matrix in the above table is shown in the following table.
After training to obtain the first matching model, unlabeled data can be predicted based on the derived indicators and the first matching model to expand the range of literal matching. Thus, the literal matching between the full candidate product names of the first stage and the standard product names of the industry chain is completed. And for candidate product names which are not matched yet, entering a semantic matching stage for further matching.
The semantic matching aims to solve the problem that matching fails or is omitted due to the fact that the semantic similarity matching cannot be recognized, and accordingly matching coverage and accuracy are further improved. The selection principle of the semantic similarity matching model is as follows:
1) The time cost and accuracy are taken into account comprehensively. Generally, the time cost of the self-building model is higher, but the flexibility is better, higher precision can be realized, and the pre-training model is opposite. If time cost is a priority, a pre-training model may be considered.
2) For a pre-training model, the following dimensions need to be considered: whether the data of the bottom layer pre-training is fit with the service requirement, whether the pre-training model is fit with the actual matching scene (for example, the actual scene belongs to a short text, and a pre-training model related to the short text should be adopted); accuracy comparison of the pre-trained model tests on the published dataset; the time overhead of the pre-trained model on the public dataset is compared.
3) For a self-building model, the following dimensions need to be considered: selecting a proper data source, wherein the data source can be derived from research report, enterprise annual report, field data and the like; when selecting a model, similar to a pre-training model, the consistency of the model and an actual scene, the accuracy of the model on a public data set and the time consumption need to be considered.
In the embodiment of the application, low-time cost factors are preferentially considered, so that a pre-training model is adopted as a second matching model to carry out semantic similarity matching, and for example, pre-training models such as SimNet, GPT-2 and the like can be adopted.
The process of semantic similarity calculation is further described below, and the specific steps are as follows.
In some embodiments of the present application, calculating a semantic similarity parameter between a candidate product name that is not successfully matched after processing by the first matching model and a standard product name using the second matching model comprises the steps of: converting the candidate product names into corresponding candidate word embedding sequences and converting the standard product names into corresponding standard word embedding sequences by adopting an input layer in the second matching model; converting the candidate word embedded sequence into a corresponding candidate semantic representation vector by adopting a representation layer in the second matching model, and converting the standard word embedded sequence into a corresponding standard semantic representation vector; and calculating cosine similarity between the candidate semantic representation vector and the standard semantic representation vector, and taking the cosine similarity as a semantic similarity parameter.
In an embodiment, the second matching model is a supervised neural network model capable of calculating semantic similarity of short text, and the framework includes an input layer, a representation layer and a matching layer. The input layer converts the text word sequence into word embedding sequences (candidate word embedding sequences and standard word embedding sequences) in a search table mode; the representation layer converts the isolated word embedded sequence into semantic representation vectors (candidate semantic representation vectors and standard semantic representation vectors) with global information, and further can accumulate more fully-connected networks to improve the representation effect; the matching layer performs interactive computation on the semantic representation vectors to construct a matching score, and generally adopts cosine similarity to calculate, wherein the value of the matching score is [ -1,1]. The closer the cosine similarity is to 1, the higher the similarity of the two text is characterized. The Cosine Similarity calculation formula is shown below:
wherein X and Y represent the candidate semantic representation vector and the standard semantic representation vector, respectively.
Further, sorting the similarity between the candidate product names and the industry chain standard product names according to the similarity, and selecting the candidate product names of Top-K. On the basis, selecting a proper similarity threshold Cut by combining a specific result, and if the similarity is larger than the threshold, reserving the value of the similarity and the candidate product name; otherwise, the similarity is set to 0, and the related formula is as follows:
Further, if a candidate product name matches with a plurality of industry chain standard product names, sorting the similarity between the candidate product name and the plurality of industry chain standard product names, and reserving the matching with the highest similarity.
And finally, integrating the results of the literal similarity matching and the semantic similarity matching, and marking the candidate product names on the matching to obtain an industry chain standard product name-candidate product name mapping table, namely the second mapping relation.
For example, the automotive industry can obtain a mapping table of industry chain standard product names to candidate product names through literal matching and semantic matching, as shown in the following table.
Standard product name Candidate product name
Automobile injection molding part Injection molding part of automobile
Sound-deadening catalytic converter Catalytic converter
…… ……
Filter device Gasoline filter
Step 4, integrating the enterprise-candidate product name relation mapping table (i.e. the first mapping relation) constructed in the step 2 with the industry chain standard product name-candidate product name mapping table (i.e. the second mapping relation) constructed in the step 3, and constructing an enterprise-product relation (i.e. the third mapping relation) by the result that the enterprise name hits the industry chain standard product name;
Specifically, fig. 6 is a schematic diagram of a process for establishing an enterprise-product relationship according to an embodiment of the present application, as shown in fig. 6, a relationship of an enterprise-industry chain standard product name is established based on the enterprise-candidate product name mapping table obtained in step 2 and the industry chain standard product name-candidate product name mapping table obtained in step 3, and in parallel, a result of hitting the industry chain standard product name by the enterprise name is integrated into each enterprise and de-duplicated to obtain a final enterprise-product relationship (i.e. the third mapping relationship). The hit mapping table of industry chain standard product name-enterprise name is shown in the following table:
industry chain standard product name Enterprise name
Transmission shaft Some transmission shaft factory
Radiator Some radiator Limited
…… ……
Shock absorber Some shock absorber Limited liability company
Step 5, evaluating the enterprise-product relationship (namely the third mapping relationship) output by the module 4 from the service side and the technology side;
after the third mapping relationship is obtained, the obtained third mapping relationship may be evaluated, specifically as follows.
In some embodiments of the present application, after determining the third mapping relationship for characterizing the correspondence between standard product names and enterprises, the method further includes the steps of: determining a technical side evaluation index and a service side evaluation index of each third mapping relation, wherein the technical side evaluation index comprises: the number of candidate product names/standard product names/enterprises corresponding to the third mapping relation respectively accounts for the proportion of all candidate product names/standard product names/enterprises corresponding to the target industry, and the service side evaluation index is used for representing the accuracy degree of matching of the candidate product names/standard product names corresponding to the third mapping relation.
Specifically, a series of indexes of the technical side and the service side are adopted to evaluate the matching result, namely the third mapping relation, and if the coverage or accuracy is too low, further optimization needs to be considered as shown in the following table.
And 6, combining the enterprise-product relationship (namely the third mapping relationship) with the association relationship corresponding to the standard product name of the industrial chain to construct the upstream-downstream relationship (target industrial chain) of the industrial chain.
Specifically, the standard product name and the correlation of the industry chain in step 1 and the enterprise-product relationship (i.e. the third mapping relationship) obtained in step 4 are used to build the upstream-downstream relationship of the target industry chain, and in this embodiment, the method may be used to build a multi-type product-target industry chain, which at least includes: the relationship of the tree structure industry chain upstream and downstream and the network structure industry chain upstream and downstream is exemplified by antibiotics in the automobile manufacturing industry as a whole downstream and in the medical manufacturing industry, respectively, as shown in fig. 7 and 8. In the embodiment, an industrial node of an automobile chassis system of an automobile industrial chain may be taken as an example, and a mode of the industrial node may be shown in fig. 9, where the industrial chain is constructed and integrated with the enterprise map information.
According to the scheme, characteristics of different industries are comprehensively considered, and data sets of different categories are screened and combined from service authority, granularity, coverage, availability and the like of service description to serve as input, so that a foundation is laid for accurate matching between enterprises and products; the non-semantic and semantic similarity algorithms are fused and are mutually complemented, so that the matching cost can be effectively reduced, and the matching precision and coverage are improved; based on various non-semantic similarity algorithms, a classification model is constructed, and the accuracy of decision making is improved.
According to the embodiment of the application, an embodiment of an industrial chain construction system is also provided. Fig. 10 is a schematic structural diagram of an industrial chain construction system according to an embodiment of the present application. As shown in fig. 10, the system includes:
The candidate product name acquisition module 1000 is configured to acquire a candidate product name corresponding to a target industry, and determine a first mapping relationship between the candidate product name and an enterprise;
Optionally, obtaining the candidate product names corresponding to the target industry includes: acquiring an initial industry data set corresponding to a target industry, and screening the initial industry data set according to preset parameters to obtain the target industry data set, wherein the initial industry data set comprises at least one of the following: the management scope information, qualification certificate information, intellectual property information and invoice transaction information corresponding to the target industry, and the preset parameters comprise at least one of the following: service authority parameters, service description granularity parameters, coverage parameters and availability parameters, wherein the service authority parameters are used for representing the credibility degree of the data for proving that enterprises have the capability of producing certain products, the service description granularity is used for representing the fineness degree of the range of the products produced by the enterprises described by the data, the coverage parameters are used for representing the coverage degree of the data on the enterprises corresponding to the target industry, and the availability parameters are used for representing the difficulty degree of data acquisition; and performing word segmentation processing on the information in the target industry data set to obtain candidate product names.
Optionally, performing word segmentation processing on the information in the target industry data set to obtain candidate product names includes: splitting information in the target industry data set according to preset splitting characters to obtain a plurality of first fields, wherein the preset splitting characters comprise: punctuation characters and hyphen characters; deleting a first field containing words in a preset disabling word stock to obtain a second field; determining words with parts of speech being verbs in the second field, and deleting the words with parts of speech being verbs from the second field to obtain noun fields; and determining the field length of each noun field, and determining the noun field with the field length larger than the preset field length as a candidate product name.
A second mapping relation determining module 1002, configured to determine a literal similarity and a semantic similarity between the candidate product name and a standard product name corresponding to the target industry, and determine a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity;
Optionally, determining the second mapping relationship between the candidate product name and the standard product name according to the literal similarity and the semantic similarity includes: determining the character similarity characteristics between the candidate product names and the standard product names by adopting a first matching model, and determining the standard product names matched with the candidate product names according to the character similarity characteristics to obtain a first matching relation; calculating semantic similarity parameters between the candidate product names which are not successfully matched after being processed by the first matching model and the standard product names by adopting the second matching model, and determining the standard product names matched with the candidate product names according to the semantic similarity parameters to obtain a second matching relation; and obtaining a second mapping relation according to the first matching relation and the second matching relation.
Optionally, the literal similarity feature comprises: a surface text similarity feature and a character relevance feature, wherein the surface text similarity feature is used for representing the matching degree and distance of characters and/or words between candidate product names and standard product names, and the character relevance feature comprises at least one of the following: the length of the largest common substring between the candidate product name and the standard product name, the length of the largest common subsequence, the length of the candidate product name, the length of the standard product name; the first matching model is obtained through training the following steps: acquiring a first training data set, wherein the first training data set comprises a plurality of candidate product names and standard product names corresponding to the candidate product names; training the weak classifier in the first initial model according to the first training data set to obtain a predicted value output by the weak classifier; and determining a target residual error according to the predicted value and the true value in the first training data set, training the rest weak classifiers in the first initial model according to the target residual error, and repeating the training process until all the weak classifiers in the first initial model are trained, so as to obtain a first matching model.
Optionally, calculating, by using the second matching model, the semantic similarity parameter between the candidate product name and the standard product name that are not successfully matched after being processed by the first matching model includes: converting the candidate product names into corresponding candidate word embedding sequences and converting the standard product names into corresponding standard word embedding sequences by adopting an input layer in the second matching model; converting the candidate word embedded sequence into a corresponding candidate semantic representation vector by adopting a representation layer in the second matching model, and converting the standard word embedded sequence into a corresponding standard semantic representation vector; and calculating cosine similarity between the candidate semantic representation vector and the standard semantic representation vector, and taking the cosine similarity as a semantic similarity parameter.
Optionally, after determining the third mapping relationship for characterizing the correspondence between the standard product name and the enterprise, the second mapping relationship determining module 1002 is further configured to: determining a technical side evaluation index and a service side evaluation index of each third mapping relation, wherein the technical side evaluation index comprises: the number of candidate product names/standard product names/enterprises corresponding to the third mapping relation respectively accounts for the proportion of all candidate product names/standard product names/enterprises corresponding to the target industry, and the service side evaluation index is used for representing the accuracy degree of matching of the candidate product names/standard product names corresponding to the third mapping relation.
A third mapping relationship determining module 1004, configured to determine a third mapping relationship for characterizing a correspondence between a standard product name and an enterprise according to the first mapping relationship and the second mapping relationship;
the target industry chain construction module 1006 is configured to construct a target industry chain corresponding to a target industry according to the third mapping relationship and an association relationship corresponding to the standard product name, where the association relationship is used to characterize an upstream-downstream relationship of a product corresponding to the standard product name in production and life.
Note that each module in the above-described industrial chain construction system may be a program module (for example, a set of program instructions for implementing a specific function), or may be a hardware module, and for the latter, it may be represented by the following form, but is not limited thereto: the expression forms of the modules are all a processor, or the functions of the modules are realized by one processor.
It should be noted that, the industrial chain construction system provided in the present embodiment may be used to execute the industrial chain construction method shown in fig. 2, so that the explanation of the industrial chain construction method is also applicable to the embodiment of the present application, and is not repeated here.
The embodiment of the application also provides a nonvolatile storage medium, which comprises a stored computer program, wherein equipment where the nonvolatile storage medium is located executes the following industrial chain construction method by running the computer program: obtaining candidate product names corresponding to target industries, and determining a first mapping relation between the candidate product names and enterprises; determining the literal similarity and the semantic similarity between the candidate product name and the standard product name corresponding to the target industry, and determining a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity; determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation; and constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream and downstream relation of the product corresponding to the standard product name in production and life.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (8)

1. The industrial chain construction method is characterized by comprising the following steps:
Obtaining candidate product names corresponding to target industries, and determining a first mapping relation between the candidate product names and enterprises, wherein obtaining the candidate product names corresponding to the target industries comprises: acquiring an initial industry data set corresponding to the target industry, and screening the initial industry data set according to preset parameters to obtain the target industry data set, wherein the initial industry data set comprises at least one of the following: the business scope information, qualification certificate information, intellectual property information and invoice transaction information corresponding to the target industry, and the preset parameters comprise at least one of the following: a business authority parameter, a business description granularity parameter, a coverage parameter and an availability parameter, wherein the business authority parameter is used for representing the credibility of the data proving that an enterprise has the capability of producing a certain product, the business description granularity is used for representing the fineness degree of the range of the enterprise production product described by the data, the coverage parameter is used for representing the coverage degree of the data on the enterprise corresponding to the target industry, and the availability parameter is used for representing the difficulty degree of the data acquisition; word segmentation processing is carried out on the information in the target industry data set to obtain the candidate product names;
Determining the literal similarity and the semantic similarity between the candidate product names and the standard product names corresponding to the target industry, and determining a second mapping relation between the candidate product names and the standard product names according to the literal similarity and the semantic similarity, wherein the second mapping relation comprises the following steps: determining character similarity characteristics between the candidate product names and the standard product names by adopting a first matching model, and determining the standard product names matched with the candidate product names according to the character similarity characteristics to obtain a first matching relationship; calculating semantic similarity parameters between the candidate product names which are not successfully matched after being processed by the first matching model and the standard product names by adopting a second matching model, and determining the standard product names matched with the candidate product names according to the semantic similarity parameters to obtain a second matching relation; obtaining the second mapping relation according to the first matching relation and the second matching relation, wherein the character similarity feature comprises: a surface text similarity feature and a character relevance feature, wherein the surface text similarity feature is used to characterize a degree and distance of matching of characters and/or words between the candidate product name and the standard product name, comprising: the feature based on the set and edit distance dimension and the feature based on the vector space model dimension respectively comprise the feature based on the character and/or the word, and the character relevance feature comprises at least one of the following: the method comprises the steps of determining a length of a maximum common sub-string between the candidate product name and the standard product name, a length of a maximum common sub-sequence, a length of the candidate product name, a length of the standard product name, a ratio of the length of the maximum common sub-string to the length of the maximum common sub-sequence, a ratio of the length of the maximum common sub-sequence to the length of the maximum common sub-string, a ratio of the length of the candidate product name to the length of the standard product name, a ratio of the length of the standard product name to the length of the candidate product name, a number of words in the standard product name, a ratio of a number of words in the standard product name to a number of words in the candidate product name;
determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation;
And constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream and downstream relation of the product corresponding to the standard product name in production and life.
2. The industrial chain construction method according to claim 1, wherein the first matching model is trained by:
Acquiring a first training data set, wherein the first training data set comprises a plurality of candidate product names and the standard product names corresponding to the candidate product names;
training a weak classifier in a first initial model according to the first training data set to obtain a predicted value output by the weak classifier;
And determining a target residual error according to the predicted value and the true value in the first training data set, training the rest weak classifiers in the first initial model according to the target residual error, and repeating the training process until all the weak classifiers in the first initial model are trained, so as to obtain the first matching model.
3. The method according to claim 1, wherein calculating semantic similarity parameters between the candidate product names that were not successfully matched after processing by the first matching model and the standard product names using a second matching model comprises:
Converting the candidate product names into corresponding candidate word embedding sequences and converting the standard product names into corresponding standard word embedding sequences by adopting an input layer in the second matching model;
Converting the candidate word embedded sequence into a corresponding candidate semantic representation vector by adopting a representation layer in the second matching model, and converting the standard word embedded sequence into a corresponding standard semantic representation vector;
And calculating cosine similarity between the candidate semantic representation vector and the standard semantic representation vector, and taking the cosine similarity as the semantic similarity parameter.
4. The method of claim 1, wherein performing word segmentation on the information in the target industrial dataset to obtain the candidate product names comprises:
Segmenting information in the target industry data set according to preset segmentation characters to obtain a plurality of first fields, wherein the preset segmentation characters comprise: punctuation characters and hyphen characters;
deleting the first field containing the words in the preset disabling word stock to obtain a second field;
Determining words with parts of speech being verbs in the second field, and deleting the words with parts of speech being verbs from the second field to obtain noun fields;
And determining the field length of each noun field, and determining the noun field with the field length larger than a preset field length as the candidate product name.
5. The industrial chain construction method according to claim 1, wherein after determining a third mapping relationship for characterizing a correspondence between the standard product name and the enterprise, the method further comprises:
Determining a technical side evaluation index and a service side evaluation index of each third mapping relation, wherein the technical side evaluation index comprises: the number of candidate product names/standard product names/enterprises corresponding to the third mapping relation respectively accounts for the proportion of all candidate product names/standard product names/enterprises corresponding to the target industry, and the service side evaluation index is used for representing the accuracy degree of matching of the candidate product names/standard product names corresponding to the third mapping relation.
6. An industrial chain construction system, comprising:
the candidate product name acquisition module is used for acquiring a candidate product name corresponding to a target industry and determining a first mapping relation between the candidate product name and an enterprise, wherein the candidate product name acquisition module is used for acquiring the candidate product name corresponding to the target industry and comprises the following steps: acquiring an initial industry data set corresponding to the target industry, and screening the initial industry data set according to preset parameters to obtain the target industry data set, wherein the initial industry data set comprises at least one of the following: the business scope information, qualification certificate information, intellectual property information and invoice transaction information corresponding to the target industry, and the preset parameters comprise at least one of the following: a business authority parameter, a business description granularity parameter, a coverage parameter and an availability parameter, wherein the business authority parameter is used for representing the credibility of the data proving that an enterprise has the capability of producing a certain product, the business description granularity is used for representing the fineness degree of the range of the enterprise production product described by the data, the coverage parameter is used for representing the coverage degree of the data on the enterprise corresponding to the target industry, and the availability parameter is used for representing the difficulty degree of the data acquisition; word segmentation processing is carried out on the information in the target industry data set to obtain the candidate product names;
A second mapping relation determining module, configured to determine a literal similarity and a semantic similarity between the candidate product name and a standard product name corresponding to the target industry, and determine a second mapping relation between the candidate product name and the standard product name according to the literal similarity and the semantic similarity, where the second mapping relation includes: determining character similarity characteristics between the candidate product names and the standard product names by adopting a first matching model, and determining the standard product names matched with the candidate product names according to the character similarity characteristics to obtain a first matching relationship; calculating semantic similarity parameters between the candidate product names which are not successfully matched after being processed by the first matching model and the standard product names by adopting a second matching model, and determining the standard product names matched with the candidate product names according to the semantic similarity parameters to obtain a second matching relation; obtaining the second mapping relation according to the first matching relation and the second matching relation, wherein the character similarity feature comprises: a surface text similarity feature and a character relevance feature, wherein the surface text similarity feature is used to characterize a degree and distance of matching of characters and/or words between the candidate product name and the standard product name, comprising: the feature based on the set and edit distance dimension and the feature based on the vector space model dimension respectively comprise the feature based on the character and/or the word, and the character relevance feature comprises at least one of the following: the method comprises the steps of determining a length of a maximum common sub-string between the candidate product name and the standard product name, a length of a maximum common sub-sequence, a length of the candidate product name, a length of the standard product name, a ratio of the length of the maximum common sub-string to the length of the maximum common sub-sequence, a ratio of the length of the maximum common sub-sequence to the length of the maximum common sub-string, a ratio of the length of the candidate product name to the length of the standard product name, a ratio of the length of the standard product name to the length of the candidate product name, a number of words in the standard product name, a ratio of a number of words in the standard product name to a number of words in the candidate product name;
the third mapping relation determining module is used for determining a third mapping relation for representing the corresponding relation between the standard product name and the enterprise according to the first mapping relation and the second mapping relation;
The target industry chain construction module is used for constructing a target industry chain corresponding to the target industry according to the third mapping relation and the association relation corresponding to the standard product name, wherein the association relation is used for representing the upstream and downstream relation of the product corresponding to the standard product name in production and life.
7. An electronic device, comprising: a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the industrial chain construction method according to any one of claims 1 to 5.
8. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored computer program, wherein a device in which the non-volatile storage medium is located executes the industrial chain construction method according to any one of claims 1 to 5 by running the computer program.
CN202410105299.0A 2024-01-25 2024-01-25 Industrial chain construction method and system Active CN117633518B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410105299.0A CN117633518B (en) 2024-01-25 2024-01-25 Industrial chain construction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410105299.0A CN117633518B (en) 2024-01-25 2024-01-25 Industrial chain construction method and system

Publications (2)

Publication Number Publication Date
CN117633518A CN117633518A (en) 2024-03-01
CN117633518B true CN117633518B (en) 2024-04-26

Family

ID=90020277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410105299.0A Active CN117633518B (en) 2024-01-25 2024-01-25 Industrial chain construction method and system

Country Status (1)

Country Link
CN (1) CN117633518B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118690210B (en) * 2024-08-28 2024-10-29 西南科技大学 Industrial chain correction method, system and equipment based on evidence theory

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111598702A (en) * 2020-04-14 2020-08-28 徐佳慧 Knowledge graph-based method for searching investment risk semantics
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
CN114880486A (en) * 2022-05-13 2022-08-09 江苏省联合征信有限公司 Industry chain identification method and system based on NLP and knowledge graph
CN114911999A (en) * 2022-05-24 2022-08-16 中国电信股份有限公司 Name matching method and device
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph
CN117390198A (en) * 2023-10-24 2024-01-12 北京中电普华信息技术有限公司 Method, device, equipment and medium for constructing scientific and technological knowledge graph in electric power field

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021164199A1 (en) * 2020-02-20 2021-08-26 齐鲁工业大学 Multi-granularity fusion model-based intelligent semantic chinese sentence matching method, and device
WO2021196520A1 (en) * 2020-03-30 2021-10-07 西安交通大学 Tax field-oriented knowledge map construction method and system
CN111598702A (en) * 2020-04-14 2020-08-28 徐佳慧 Knowledge graph-based method for searching investment risk semantics
CN114880486A (en) * 2022-05-13 2022-08-09 江苏省联合征信有限公司 Industry chain identification method and system based on NLP and knowledge graph
CN114911999A (en) * 2022-05-24 2022-08-16 中国电信股份有限公司 Name matching method and device
CN115438674A (en) * 2022-11-08 2022-12-06 腾讯科技(深圳)有限公司 Entity data processing method, entity linking method, entity data processing device, entity linking device and computer equipment
CN116628172A (en) * 2023-07-24 2023-08-22 北京酷维在线科技有限公司 Dialogue method for multi-strategy fusion in government service field based on knowledge graph
CN117390198A (en) * 2023-10-24 2024-01-12 北京中电普华信息技术有限公司 Method, device, equipment and medium for constructing scientific and technological knowledge graph in electric power field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于BERT和TextRank关键词提取的实体链接方法;詹飞;朱艳辉;梁文桐;冀相冰;;湖南工业大学学报;20200715(第04期);全文 *

Also Published As

Publication number Publication date
CN117633518A (en) 2024-03-01

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN109472033B (en) Method and system for extracting entity relationship in text, storage medium and electronic equipment
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN117271767B (en) Operation and maintenance knowledge base establishing method based on multiple intelligent agents
CN113779272B (en) Knowledge graph-based data processing method, device, equipment and storage medium
CN117633518B (en) Industrial chain construction method and system
CN112883193A (en) Training method, device and equipment of text classification model and readable medium
US20170140290A1 (en) Automated Similarity Comparison of Model Answers Versus Question Answering System Output
CN113449204B (en) Social event classification method and device based on local aggregation graph attention network
CN114386436B (en) Text data analysis method, model training method, device and computer equipment
CN111401065A (en) Entity identification method, device, equipment and storage medium
CN116342167B (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN111368096A (en) Knowledge graph-based information analysis method, device, equipment and storage medium
CN116842194A (en) Electric power semantic knowledge graph system and method
Braylan et al. Modeling and aggregation of complex annotations via annotation distances
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN117648915A (en) Question and answer scoring method and system based on knowledge graph
CN117435718A (en) Science and technology information recommendation method and system
CN117272142A (en) Log abnormality detection method and system and electronic equipment
CN112163098A (en) Knowledge graph creating method and device, storage medium and server
CN112463974A (en) Method and device for establishing knowledge graph
CN116861358A (en) BP neural network and multi-source data fusion-based computing thinking evaluation method
CN116366312A (en) Web attack detection method, device and storage medium
Nautiyal et al. Kcc qa latent semantic representation using deep learning & hierarchical semantic cluster inferential framework
Ceolin et al. Semi-automated assessment of annotation trustworthiness

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant