CN112308388A - Electric power engineering overhaul project risk auditing method based on semantic analysis - Google Patents
Electric power engineering overhaul project risk auditing method based on semantic analysis Download PDFInfo
- Publication number
- CN112308388A CN112308388A CN202011135566.7A CN202011135566A CN112308388A CN 112308388 A CN112308388 A CN 112308388A CN 202011135566 A CN202011135566 A CN 202011135566A CN 112308388 A CN112308388 A CN 112308388A
- Authority
- CN
- China
- Prior art keywords
- data
- project
- electric power
- audit
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012550 audit Methods 0.000 claims abstract description 39
- 230000011218 segmentation Effects 0.000 claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000012545 processing Methods 0.000 claims abstract description 16
- 238000005516 engineering process Methods 0.000 claims abstract description 12
- 238000004140 cleaning Methods 0.000 claims abstract description 4
- 238000013439 planning Methods 0.000 claims description 14
- 230000008439 repair process Effects 0.000 claims description 10
- 238000007726 management method Methods 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 9
- 238000012360 testing method Methods 0.000 claims description 8
- 238000012800 visualization Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000013481 data capture Methods 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000004519 manufacturing process Methods 0.000 description 8
- 230000005540 biological transmission Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009193 crawling Effects 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000004568 cement Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000035605 chemotaxis Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000009413 insulation Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000007794 visualization technique Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Tourism & Hospitality (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Probability & Statistics with Applications (AREA)
- Educational Administration (AREA)
- Public Health (AREA)
- Water Supply & Treatment (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis. The method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. The python web crawler technology is used for collecting the specified field information of the annual overhaul project. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.
Description
Technical Field
The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis.
Background
Electric power engineering plays a very important role and position in national development. The audit is used as a supervision mechanism, and can check and supervise the financial income and expenditure conditions of relevant important items of all levels of government departments, financial institutions and enterprise and public institutions of the country by law so as to restrict negative economic activities, promote the stable operation of social economy and finally enable national economy to develop healthily. However, in the current stage, the electric power engineering audit still has some defects and problems, for example, the audit in the early stage of the engineering is insufficient, the audit attention on the construction process is insufficient, the preparation of the completion settlement audit material is not timely, and the audit work is not linked due to the staged audit, which seriously interferes with the development of the electric power engineering audit, also makes the electric power engineering audit not realize the timely discovery and disclosure of the problems, and finally makes the electric power engineering project smoothly complete. Aiming at the defects and problems existing in the current electric power engineering audit, a method for auditing the electric power engineering overhaul project risk based on semantic analysis is researched, a natural language processing technology is applied to the electric power engineering project risk audit, a large part of manual work in the audit work is replaced by a computer, the consumption of manpower and material resources is greatly saved, and the audit efficiency is improved.
The natural language processing technology can be divided into a word layer and an upper word layer in information retrieval, and in the first layer, an NLP technology used in the information retrieval mainly comprises word segmentation, compound phrase identification, proper nouns and the like. Since the automatic word segmentation has been proposed in the field of Chinese information processing in the early 80 s of the 20 th century, many experts and scholars have made favorable progress in this field, and many word segmentation methods have been proposed, and some more sophisticated technologies have been applied to commercial products. Most of the research is mainly limited to the analysis of structured audit data, and the fresh scholars carry out deep research on unstructured audit data. In a report published by International Data Corporation (IDC), it is shown that at most only 5% of the data in a business is structured data, the remainder is mostly unstructured, and 88% of the business managers consider these unstructured data stored outside the database to be the best targets for their contact and knowledge of the business.
Disclosure of Invention
The invention aims to overcome the defects of the technology and provides a power engineering overhaul project risk auditing method based on semantic analysis.
In order to achieve the purpose, the invention adopts the following technical scheme: a power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.
Preferably, step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.
Preferably, establishing the data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.
Preferably, the object data crawling comprises:
the first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
Preferably, in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.
Preferably, in step S3, the data cleansing includes de-stop word and chinese error correction.
Preferably, in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.
Preferably, the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';
the feature processing includes: the first step is as follows: preprocessing the characteristics, namely selecting different preprocessing modes for different types of characteristics; the second step is that: carrying out feature standardization processing;
the establishment of the sample group is based on the characteristics of the major repair project plan information and the worksheet information, professional power business personnel select sample combinations with correlation, and the sample group for model training is established.
The modeling comprises the following steps:
(1) randomly ordering the sample set;
(2) dividing a sample set into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 70%, 10% and 20% of the total number of samples;
(3) training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
Preferably, in step S5, the model application includes similarity analysis and label cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, providing the association degrees of a certain work ticket with all project plans, sorting according to size, taking the top 5 sorting values with the largest association degrees as final results, and the label cloud visualization specifically includes performing label cloud visualization analysis on audited text data to grasp the main content of the audited text data as a whole.
The method has the beneficial effects that (1) the python web crawler technology is used for collecting the appointed field information of the project of overhaul in a certain degree. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.
(2) And aiming at different auditing analysis requirements, identifying project construction content information in a planning and planning system and work ticket information in a PMS (permanent magnet system) by utilizing a semantic identification technology. And searching the work ticket information matched with the construction content according to the overhaul project list, if the work ticket information is matched with the overhaul project list, determining that the project is implemented, and if the work ticket information is not matched with the project, listing as a suspicious point and performing key point verification.
Drawings
FIG. 1 is a schematic diagram of the present invention employing the python web crawler software.
Detailed Description
As shown in fig. 1, a risk auditing method for electric power engineering overhaul projects based on semantic analysis includes the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. Wherein:
3.1 data Collection
3.1.1 use web crawler technology to obtain company audit data from different paths and build a data warehouse.
(1) A project plan file for a crawling planning plan management system, comprising:
a) production major repair professional project planning report
b) Proposal for professional project of production major repair
c) Production major repair professional project planning approval file
d) Production major repair professional project planning project library list
e) Completion report of major project of production and overhaul
(2) Capturing a service data file of a PMS2.0 of a planning plan management system, comprising:
a) work ticket document
b) Work permit report
c) Completion report
(3) Capturing professional data files related to power services disclosed on the network, comprising:
a) common word bank in power industry
b) Professional word bank in power industry
c) Names of electric power equipment such as Tianjin power transmission and transformation station
3.1.2 analyzing data structures of Audit data in target systems
(1) According to project plan files of a planning plan management system, a production major repair professional project plan information file table is established, and the production major repair professional project plan information file table comprises field information such as project codes, project names, management building units, project belonging units, voltage levels (KV), project classification, professional categories, professional subdivision, issuing states, project contents (limited to 300 words), project starting time, project ending time, year plans, prearrangement, grindable batch reply numbers, national network sub-batch, provincial sub-batch, Tianjin project ID, low-voltage projects, remarks, project codes, ranking numbers and the like.
(2) According to the service data file of the planning plan management system PMS2.0, a work ticket information file is established, and the work ticket information file comprises field information such as ticket types, work contents, work places, work place descriptions, ticket making departments, operation and maintenance units, work responsible persons, work ticket issuers, ticket numbers, ticket states, planned work starting time, planned end time, permitted work time, work licensors, end time, completion licensors, ticket types, associated task lists, returns, completion conditions, ticket makers, delay time, affiliated feeder names, affiliated city names, number of staff of work groups, ticket IDs and the like.
(3) And establishing an electric power related word bank file according to the professional data file related to the electric power service, wherein the electric power related word bank file comprises information such as electric power industry words, electric power industry special words, power stations, power transmission stations, transformer substations, transformers, power equipment enterprise directories and Tianjin city national grid company directories.
3.1.3 As shown in FIG. 1, target data capture is realized by adopting python web crawler software
The first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
3.2 Chinese word segmentation
Training Chinese word segmentation model to realize word segmentation function of audit data
3.2.1 building a thesaurus of Audit needs
Downloading professional related word banks, local name word banks and national grid company specific word banks in the power industry from related websites, and establishing word banks for specific participles
3.2.2 participles
Using Chinese word segmentation software jieba with an open source on the network to perform word segmentation operation on the audit warehouse target file:
(1) dividing words of fields such as 'project name, project content, project classification and annual plan' in a production major repair professional project plan information file table, and counting word frequency;
(2) dividing words of documents such as a project planning report, a project recommendation, a project completion report and the like, and counting word frequency;
(3) performing word segmentation on fields such as 'ticket types, work contents, work places, work place descriptions, affiliated feeder names' and the like in the work tickets, and counting word frequency;
(4) and for the place with inaccurate word segmentation result, manual word frequency adjustment can be carried out, and word segmentation is carried out again, so that more accurate word segmentation effect is realized.
And displaying word segmentation results:
i [ "national grid", "Tianjin", "Diwu", "Zhongliangzhuang", "converting station", "switch cabinet", "insulation", "overhaul ]
[ "national grid", "Tianjin", "Diwu", "development area", "urban area", "distribution box station", "opening and closing station", "basic maintenance", "engineering" ]
[ "national grid", "Tianjin", "Diwu", "Lingyu", "distribution line", "foundation reinforcement", "engineering" ]
[ "national grid", "Tianjin", "treasure", "Baoan", "line", "three spans", "strain", "string change", "double hanging points", "drainage line", "overhaul" ]
[ "national grid", "Tianjin", "Diwu", "double King temple", "converting station", "switch cabinet", "overhaul" ]
[ "national grid", "Tianjin", "Diwu", "big mouth Tun", "substation", "Yi Jia Tun", "line", "cement", "pier", "attachment" ]
3.3 data cleansing
3.3.1 stop word: removing useless labels, punctuation marks and special marks appearing in the segmentation result;
3.3.2 Chinese error correction: performing badcase analysis according to the corpus analysis, checking the influence of the result of the wrong corpus, and not processing the result if the result is unnecessary; and if the influence of the wrong corpora on the problem is large, correcting the error by adopting a statistical method.
3.4 word segmentation result feature extraction
Keyword extraction is realized based on TF-IDF (Term Frequency-Inverse Document Frequency) technology. The importance level of a word in the entire text corpus is calculated from the frequency of occurrence of the word in the text and the frequency of occurrence of the word in the entire text corpus. A word or phrase is considered to be very representative if the frequency TF with which it appears in an article is high and very few in other text.
3.4.1 feature selection:
(1) the project schedule selects 5 fields of information, namely 'project code, project name, project content, project start time and project end time'.
(2) The work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time'.
3.4.2 feature handling
The first step is as follows: and (3) feature preprocessing, wherein different preprocessing modes are selected for different types of features:
(1) item coding and ticket ID belong to unique digital features, and processing is reserved;
(2) the 'ticket type' and the 'work place' belong to the category characteristics and are processed by One-hot coding;
here, there are 24 'ticket categories', and we use 6-bit binary number representation, the first two bits represent work sites, the middle two bits represent work categories, and the last two bits represent work ticket categories. The specific contents about the transformer/transmission station, the line and the pole are extracted through three fields of work content, work place and work place description to form a work place feature representation.
Characteristic representation of ticket categories
(3) The 'project name', 'project content', 'work place description' belong to text features, and are processed by using a word vector model;
the item name content is represented using a 96-dimensional word vector:
item content feature extraction and representation:
and (3) extracting and representing the working content features:
(4) the 'project start time', 'project end time', 'plan start time', 'plan end time' are time features, and the data type is converted into datetime and then converted into timestamp information.
Item start time | Time information (year, month, day) | Feature representation (time stamp) |
2018-01-01 | (2018,1,1) | 1514736000 |
2017-01-01 | (2017,1,1) | 1483200000 |
2016-01-01 | (2016,1,1) | 1451577600 |
End time of project | Time information (year, month, day) | Feature representation (time stamp) |
2018-12-31 | (2018,12,31) | 1546185600 |
2017-12-31 | (2017,12,31) | 1514649600 |
2016-12-31 | (2016,12,31) | 1483113600 |
Scheduled end time | Time information (year, month, day) | Feature representation (time stamp) |
2018/4/2 15:00:00 | (2018,4,2,15,0,0) | 1522652400 |
2017/3/27 16:00:00 | (2017,3,27,16,0,0) | 1490601600 |
2016/8/24 17:00:00 | (2016,8,24,17,0,0) | 1472029200 |
The second step is that: feature normalization process
In order to eliminate the dimensional influence between indexes, data standardization processing is required to solve the comparability between data indexes. The data standardization processing mainly comprises two aspects of data chemotaxis processing and dimensionless processing.
The data homochemotaxis processing mainly solves the problem of data with different properties, directly sums indexes with different properties and cannot correctly reflect the comprehensive results of different acting forces, and firstly considers changing the data properties of inverse indexes to ensure that all the indexes are homochemotactic for the acting forces of the evaluation scheme and then sum to obtain correct results.
The data dimensionless process mainly addresses the comparability of data. Through the standardization processing, all the raw data are converted into non-dimensionalized index mapping evaluation values, namely, all the index values (all the characteristics) are in the same quantity level, and comprehensive comparison analysis can be performed.
3.4.3 set up sample set:
based on the major repair project plan information and the worksheet information characteristics, professional power business personnel select sample combinations with correlation and establish a sample group for model training. The sample format was as follows:
3.4.4 modeling:
(1) randomly ordering a sample set
(2) The sample set is divided into a training set, a verification set and a test set, which respectively account for 70 percent, 10 percent and 20 percent of the total number of samples
(3) Training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
3.5 model applications
(1) Similarity analysis
And predicting a new sample according to the trained SVM classifier, giving the association degrees of a certain work order and all project plans, sorting according to the sizes, and taking the top 5 sorting values with the maximum association degrees as final results.
(2) Tag cloud visualization
By performing label cloud visualization analysis on the audited text data, the main content of the audited text data is integrally grasped. The label cloud is composed of a group of related labels and weights corresponding to the labels, and the labels are arranged in an alphabetical order or other orders or combined with the color depth to present a text visualization method for browsing of a user. Wherein, the size of the weight value determines the font size, color or other visual effects of the label. And automatically setting the color depth and the font size according to the word frequency of the word segmentation result and carrying out visual display.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (9)
1. A power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.
2. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.
3. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: establishing a data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.
4. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: the target data capture comprises:
the first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
5. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.
6. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S3, the data cleansing includes de-stop word and Chinese error correction.
7. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.
8. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 7, characterized in that: the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';
the feature processing includes: the first step is as follows: preprocessing the characteristics, namely selecting different preprocessing modes for different types of characteristics; the second step is that: carrying out feature standardization processing;
the establishment of the sample group is based on the characteristics of the major repair project plan information and the worksheet information, professional power business personnel select sample combinations with correlation, and the sample group for model training is established.
The modeling comprises the following steps:
(1) randomly ordering the sample set;
(2) dividing a sample set into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 70%, 10% and 20% of the total number of samples;
(3) training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
9. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S5, the model application includes similarity analysis and tag cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, giving the relevance of a certain work ticket to all project plans, sorting according to size, taking the top 5 sorting values with the largest relevance as a final result, and the tag cloud visualization specifically includes performing tag cloud visualization analysis on the text data to be audited, and grasping the main content of the text data to be audited as a whole.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011135566.7A CN112308388A (en) | 2020-10-22 | 2020-10-22 | Electric power engineering overhaul project risk auditing method based on semantic analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011135566.7A CN112308388A (en) | 2020-10-22 | 2020-10-22 | Electric power engineering overhaul project risk auditing method based on semantic analysis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112308388A true CN112308388A (en) | 2021-02-02 |
Family
ID=74328345
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011135566.7A Pending CN112308388A (en) | 2020-10-22 | 2020-10-22 | Electric power engineering overhaul project risk auditing method based on semantic analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112308388A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113469555A (en) * | 2021-07-19 | 2021-10-01 | 国网冀北电力有限公司唐山供电公司 | AI technology-based power production management method |
CN113743108A (en) * | 2021-09-03 | 2021-12-03 | 国网经济技术研究院有限公司 | Distribution network engineering technology economic information division method |
CN117874565A (en) * | 2023-11-27 | 2024-04-12 | 国网江苏省电力有限公司扬州供电分公司 | Work ticket accuracy detection method based on neural network |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160038A (en) * | 2015-10-10 | 2015-12-16 | 广东卓维网络有限公司 | Data analysis method and system based on audit database |
CN107832429A (en) * | 2017-11-14 | 2018-03-23 | 广州供电局有限公司 | audit data processing method and system |
CN107977789A (en) * | 2017-12-05 | 2018-05-01 | 国网河南省电力公司南阳供电公司 | Based on the audit work method under big data information |
CN109299879A (en) * | 2018-09-30 | 2019-02-01 | 广东电网有限责任公司 | A kind of statistical method, device and the equipment of power grid audit issues |
CN110032607A (en) * | 2019-04-17 | 2019-07-19 | 成都市审计局 | A kind of auditing method based on big data |
CN111275409A (en) * | 2020-02-28 | 2020-06-12 | 国网上海市电力公司 | Power grid overhaul audit data processing system and processing method |
-
2020
- 2020-10-22 CN CN202011135566.7A patent/CN112308388A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105160038A (en) * | 2015-10-10 | 2015-12-16 | 广东卓维网络有限公司 | Data analysis method and system based on audit database |
CN107832429A (en) * | 2017-11-14 | 2018-03-23 | 广州供电局有限公司 | audit data processing method and system |
CN107977789A (en) * | 2017-12-05 | 2018-05-01 | 国网河南省电力公司南阳供电公司 | Based on the audit work method under big data information |
CN109299879A (en) * | 2018-09-30 | 2019-02-01 | 广东电网有限责任公司 | A kind of statistical method, device and the equipment of power grid audit issues |
CN110032607A (en) * | 2019-04-17 | 2019-07-19 | 成都市审计局 | A kind of auditing method based on big data |
CN111275409A (en) * | 2020-02-28 | 2020-06-12 | 国网上海市电力公司 | Power grid overhaul audit data processing system and processing method |
Non-Patent Citations (4)
Title |
---|
伍洋等: "面向审计领域的短文本分类技术研究", 《微电子学与计算机》 * |
李丽华 等: "基于深度学习的文本情感分析", 《湖北大学学报》 * |
蒋雨薇: "大数据环境下基于可视化技术的审计方法研究", 《北方经贸》 * |
陈伟 等: "基于文本数据分析的大数据审计方法研究", 《中国注册会计师》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113469555A (en) * | 2021-07-19 | 2021-10-01 | 国网冀北电力有限公司唐山供电公司 | AI technology-based power production management method |
CN113743108A (en) * | 2021-09-03 | 2021-12-03 | 国网经济技术研究院有限公司 | Distribution network engineering technology economic information division method |
CN117874565A (en) * | 2023-11-27 | 2024-04-12 | 国网江苏省电力有限公司扬州供电分公司 | Work ticket accuracy detection method based on neural network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112308388A (en) | Electric power engineering overhaul project risk auditing method based on semantic analysis | |
CN110334212A (en) | A kind of territoriality audit knowledge mapping construction method based on machine learning | |
CN108491438A (en) | A kind of technology policy retrieval analysis method | |
CN111401040B (en) | Keyword extraction method suitable for word text | |
CN102779143B (en) | Visualizing method for knowledge genealogy | |
CN104462216B (en) | Occupy committee's standard code converting system and method | |
CN106844527B (en) | Road surface disease identification and management decision-making method and system based on internet big data | |
CN110704577A (en) | Method and system for searching power grid scheduling data | |
CN111737421A (en) | Intellectual property big data information retrieval system and storage medium | |
CN106934054A (en) | The accurate analysis method of enterprise's segmented industry and its system based on big data | |
CN110334904A (en) | Key message types of infrastructures unit based on LightGBM belongs to determination method | |
CN111008215B (en) | Expert recommendation method combining label construction and community relation avoidance | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN115796797A (en) | Power grid science and technology project evaluation system and method based on two-dimensional cloud picture | |
CN113421037A (en) | Multi-source collaborative construction planning compilation method and device | |
CN113129188A (en) | Provincial education teaching evaluation system based on artificial intelligence big data | |
CN113538011B (en) | Method for associating non-booked contact information with booked user in electric power system | |
CN113590684A (en) | Non-tax payment big data analysis system | |
CN111666378A (en) | Chinese yearbook title classification method based on word vectors | |
CN118051612B (en) | Industry classification system and method | |
Szczech-Pietkiewicz et al. | Smart and sustainable city management in Asia and Europe: A bibliometric analysis | |
Wang | Analysis and evaluation of engineering job demand based on big data technology | |
ASCHERI et al. | Online Job Advertisements for Labour Market Statistics using R. | |
Zhang | Intelligent Mining Method of Massive Digital Archives Based on Artificial Intelligence | |
Wu | China’s metals industry (II) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210202 |