CN112308388A

CN112308388A - Electric power engineering overhaul project risk auditing method based on semantic analysis

Info

Publication number: CN112308388A
Application number: CN202011135566.7A
Authority: CN
Inventors: 崔霞; 程子华; 戴斐斐; 孙常鹏; 李伯让; 徐征; 李博; 冯伟; 张耀心; 季忠俊; 刘德玉
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Priority date: 2020-10-22
Filing date: 2020-10-22
Publication date: 2021-02-02

Abstract

The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis. The method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. The python web crawler technology is used for collecting the specified field information of the annual overhaul project. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.

Description

Electric power engineering overhaul project risk auditing method based on semantic analysis

Technical Field

The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis.

Background

Electric power engineering plays a very important role and position in national development. The audit is used as a supervision mechanism, and can check and supervise the financial income and expenditure conditions of relevant important items of all levels of government departments, financial institutions and enterprise and public institutions of the country by law so as to restrict negative economic activities, promote the stable operation of social economy and finally enable national economy to develop healthily. However, in the current stage, the electric power engineering audit still has some defects and problems, for example, the audit in the early stage of the engineering is insufficient, the audit attention on the construction process is insufficient, the preparation of the completion settlement audit material is not timely, and the audit work is not linked due to the staged audit, which seriously interferes with the development of the electric power engineering audit, also makes the electric power engineering audit not realize the timely discovery and disclosure of the problems, and finally makes the electric power engineering project smoothly complete. Aiming at the defects and problems existing in the current electric power engineering audit, a method for auditing the electric power engineering overhaul project risk based on semantic analysis is researched, a natural language processing technology is applied to the electric power engineering project risk audit, a large part of manual work in the audit work is replaced by a computer, the consumption of manpower and material resources is greatly saved, and the audit efficiency is improved.

The natural language processing technology can be divided into a word layer and an upper word layer in information retrieval, and in the first layer, an NLP technology used in the information retrieval mainly comprises word segmentation, compound phrase identification, proper nouns and the like. Since the automatic word segmentation has been proposed in the field of Chinese information processing in the early 80 s of the 20 th century, many experts and scholars have made favorable progress in this field, and many word segmentation methods have been proposed, and some more sophisticated technologies have been applied to commercial products. Most of the research is mainly limited to the analysis of structured audit data, and the fresh scholars carry out deep research on unstructured audit data. In a report published by International Data Corporation (IDC), it is shown that at most only 5% of the data in a business is structured data, the remainder is mostly unstructured, and 88% of the business managers consider these unstructured data stored outside the database to be the best targets for their contact and knowledge of the business.

Disclosure of Invention

The invention aims to overcome the defects of the technology and provides a power engineering overhaul project risk auditing method based on semantic analysis.

In order to achieve the purpose, the invention adopts the following technical scheme: a power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.

Preferably, step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.

Preferably, establishing the data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.

Preferably, the object data crawling comprises:

the first step is as follows: building a python web crawler environment;

the second step is that: running a python program to crawl target data;

the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.

Preferably, in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.

Preferably, in step S3, the data cleansing includes de-stop word and chinese error correction.

Preferably, in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.

Preferably, the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';

the feature processing includes: the first step is as follows: preprocessing the characteristics, namely selecting different preprocessing modes for different types of characteristics; the second step is that: carrying out feature standardization processing;

the establishment of the sample group is based on the characteristics of the major repair project plan information and the worksheet information, professional power business personnel select sample combinations with correlation, and the sample group for model training is established.

The modeling comprises the following steps:

(1) randomly ordering the sample set;

(2) dividing a sample set into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 70%, 10% and 20% of the total number of samples;

(3) training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.

Preferably, in step S5, the model application includes similarity analysis and label cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, providing the association degrees of a certain work ticket with all project plans, sorting according to size, taking the top 5 sorting values with the largest association degrees as final results, and the label cloud visualization specifically includes performing label cloud visualization analysis on audited text data to grasp the main content of the audited text data as a whole.

The method has the beneficial effects that (1) the python web crawler technology is used for collecting the appointed field information of the project of overhaul in a certain degree. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.

(2) And aiming at different auditing analysis requirements, identifying project construction content information in a planning and planning system and work ticket information in a PMS (permanent magnet system) by utilizing a semantic identification technology. And searching the work ticket information matched with the construction content according to the overhaul project list, if the work ticket information is matched with the overhaul project list, determining that the project is implemented, and if the work ticket information is not matched with the project, listing as a suspicious point and performing key point verification.

Drawings

FIG. 1 is a schematic diagram of the present invention employing the python web crawler software.

Detailed Description

As shown in fig. 1, a risk auditing method for electric power engineering overhaul projects based on semantic analysis includes the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. Wherein:

3.1 data Collection

3.1.1 use web crawler technology to obtain company audit data from different paths and build a data warehouse.

(1) A project plan file for a crawling planning plan management system, comprising:

a) production major repair professional project planning report

b) Proposal for professional project of production major repair

c) Production major repair professional project planning approval file

d) Production major repair professional project planning project library list

e) Completion report of major project of production and overhaul

(2) Capturing a service data file of a PMS2.0 of a planning plan management system, comprising:

a) work ticket document

b) Work permit report

c) Completion report

(3) Capturing professional data files related to power services disclosed on the network, comprising:

a) common word bank in power industry

b) Professional word bank in power industry

c) Names of electric power equipment such as Tianjin power transmission and transformation station

3.1.2 analyzing data structures of Audit data in target systems

(1) According to project plan files of a planning plan management system, a production major repair professional project plan information file table is established, and the production major repair professional project plan information file table comprises field information such as project codes, project names, management building units, project belonging units, voltage levels (KV), project classification, professional categories, professional subdivision, issuing states, project contents (limited to 300 words), project starting time, project ending time, year plans, prearrangement, grindable batch reply numbers, national network sub-batch, provincial sub-batch, Tianjin project ID, low-voltage projects, remarks, project codes, ranking numbers and the like.

(2) According to the service data file of the planning plan management system PMS2.0, a work ticket information file is established, and the work ticket information file comprises field information such as ticket types, work contents, work places, work place descriptions, ticket making departments, operation and maintenance units, work responsible persons, work ticket issuers, ticket numbers, ticket states, planned work starting time, planned end time, permitted work time, work licensors, end time, completion licensors, ticket types, associated task lists, returns, completion conditions, ticket makers, delay time, affiliated feeder names, affiliated city names, number of staff of work groups, ticket IDs and the like.

(3) And establishing an electric power related word bank file according to the professional data file related to the electric power service, wherein the electric power related word bank file comprises information such as electric power industry words, electric power industry special words, power stations, power transmission stations, transformer substations, transformers, power equipment enterprise directories and Tianjin city national grid company directories.

3.1.3 As shown in FIG. 1, target data capture is realized by adopting python web crawler software

The first step is as follows: building a python web crawler environment;

the second step is that: running a python program to crawl target data;

3.2 Chinese word segmentation

Training Chinese word segmentation model to realize word segmentation function of audit data

3.2.1 building a thesaurus of Audit needs

Downloading professional related word banks, local name word banks and national grid company specific word banks in the power industry from related websites, and establishing word banks for specific participles

3.2.2 participles

Using Chinese word segmentation software jieba with an open source on the network to perform word segmentation operation on the audit warehouse target file:

(1) dividing words of fields such as 'project name, project content, project classification and annual plan' in a production major repair professional project plan information file table, and counting word frequency;

(2) dividing words of documents such as a project planning report, a project recommendation, a project completion report and the like, and counting word frequency;

(3) performing word segmentation on fields such as 'ticket types, work contents, work places, work place descriptions, affiliated feeder names' and the like in the work tickets, and counting word frequency;

(4) and for the place with inaccurate word segmentation result, manual word frequency adjustment can be carried out, and word segmentation is carried out again, so that more accurate word segmentation effect is realized.

And displaying word segmentation results:

i [ "national grid", "Tianjin", "Diwu", "Zhongliangzhuang", "converting station", "switch cabinet", "insulation", "overhaul ]

[ "national grid", "Tianjin", "Diwu", "development area", "urban area", "distribution box station", "opening and closing station", "basic maintenance", "engineering" ]

[ "national grid", "Tianjin", "Diwu", "Lingyu", "distribution line", "foundation reinforcement", "engineering" ]

[ "national grid", "Tianjin", "treasure", "Baoan", "line", "three spans", "strain", "string change", "double hanging points", "drainage line", "overhaul" ]

[ "national grid", "Tianjin", "Diwu", "double King temple", "converting station", "switch cabinet", "overhaul" ]

[ "national grid", "Tianjin", "Diwu", "big mouth Tun", "substation", "Yi Jia Tun", "line", "cement", "pier", "attachment" ]

3.3 data cleansing

3.3.1 stop word: removing useless labels, punctuation marks and special marks appearing in the segmentation result;

3.3.2 Chinese error correction: performing badcase analysis according to the corpus analysis, checking the influence of the result of the wrong corpus, and not processing the result if the result is unnecessary; and if the influence of the wrong corpora on the problem is large, correcting the error by adopting a statistical method.

3.4 word segmentation result feature extraction

Keyword extraction is realized based on TF-IDF (Term Frequency-Inverse Document Frequency) technology. The importance level of a word in the entire text corpus is calculated from the frequency of occurrence of the word in the text and the frequency of occurrence of the word in the entire text corpus. A word or phrase is considered to be very representative if the frequency TF with which it appears in an article is high and very few in other text.

3.4.1 feature selection:

(1) the project schedule selects 5 fields of information, namely 'project code, project name, project content, project start time and project end time'.

(2) The work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time'.

3.4.2 feature handling

The first step is as follows: and (3) feature preprocessing, wherein different preprocessing modes are selected for different types of features:

(1) item coding and ticket ID belong to unique digital features, and processing is reserved;

(2) the 'ticket type' and the 'work place' belong to the category characteristics and are processed by One-hot coding;

here, there are 24 'ticket categories', and we use 6-bit binary number representation, the first two bits represent work sites, the middle two bits represent work categories, and the last two bits represent work ticket categories. The specific contents about the transformer/transmission station, the line and the pole are extracted through three fields of work content, work place and work place description to form a work place feature representation.

Characteristic representation of ticket categories

(3) The 'project name', 'project content', 'work place description' belong to text features, and are processed by using a word vector model;

the item name content is represented using a 96-dimensional word vector:

item content feature extraction and representation:

and (3) extracting and representing the working content features:

(4) the 'project start time', 'project end time', 'plan start time', 'plan end time' are time features, and the data type is converted into datetime and then converted into timestamp information.

Item start time	Time information (year, month, day)	Feature representation (time stamp)
			2018-01-01	(2018,1,1)	1514736000
2017-01-01	(2017,1,1)	1483200000
			2016-01-01	(2016,1,1)	1451577600

End time of project	Time information (year, month, day)	Feature representation (time stamp)
			2018-12-31	(2018,12,31)	1546185600
2017-12-31	(2017,12,31)	1514649600
			2016-12-31	(2016,12,31)	1483113600

Scheduled end time	Time information (year, month, day)	Feature representation (time stamp)
			2018/4/2 15:00:00	(2018,4,2,15,0,0)	1522652400
2017/3/27 16:00:00	(2017,3,27,16,0,0)	1490601600
			2016/8/24 17:00:00	(2016,8,24,17,0,0)	1472029200

The second step is that: feature normalization process

In order to eliminate the dimensional influence between indexes, data standardization processing is required to solve the comparability between data indexes. The data standardization processing mainly comprises two aspects of data chemotaxis processing and dimensionless processing.

The data homochemotaxis processing mainly solves the problem of data with different properties, directly sums indexes with different properties and cannot correctly reflect the comprehensive results of different acting forces, and firstly considers changing the data properties of inverse indexes to ensure that all the indexes are homochemotactic for the acting forces of the evaluation scheme and then sum to obtain correct results.

The data dimensionless process mainly addresses the comparability of data. Through the standardization processing, all the raw data are converted into non-dimensionalized index mapping evaluation values, namely, all the index values (all the characteristics) are in the same quantity level, and comprehensive comparison analysis can be performed.

3.4.3 set up sample set:

based on the major repair project plan information and the worksheet information characteristics, professional power business personnel select sample combinations with correlation and establish a sample group for model training. The sample format was as follows:

3.4.4 modeling:

(1) randomly ordering a sample set

(2) The sample set is divided into a training set, a verification set and a test set, which respectively account for 70 percent, 10 percent and 20 percent of the total number of samples

3.5 model applications

(1) Similarity analysis

And predicting a new sample according to the trained SVM classifier, giving the association degrees of a certain work order and all project plans, sorting according to the sizes, and taking the top 5 sorting values with the maximum association degrees as final results.

(2) Tag cloud visualization

By performing label cloud visualization analysis on the audited text data, the main content of the audited text data is integrally grasped. The label cloud is composed of a group of related labels and weights corresponding to the labels, and the labels are arranged in an alphabetical order or other orders or combined with the color depth to present a text visualization method for browsing of a user. Wherein, the size of the weight value determines the font size, color or other visual effects of the label. And automatically setting the color depth and the font size according to the word frequency of the word segmentation result and carrying out visual display.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.

2. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.

3. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: establishing a data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.

4. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: the target data capture comprises:

the first step is as follows: building a python web crawler environment;

the second step is that: running a python program to crawl target data;

5. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.

6. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S3, the data cleansing includes de-stop word and Chinese error correction.

7. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.

8. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 7, characterized in that: the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';

The modeling comprises the following steps:

(1) randomly ordering the sample set;

9. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S5, the model application includes similarity analysis and tag cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, giving the relevance of a certain work ticket to all project plans, sorting according to size, taking the top 5 sorting values with the largest relevance as a final result, and the tag cloud visualization specifically includes performing tag cloud visualization analysis on the text data to be audited, and grasping the main content of the text data to be audited as a whole.