Nothing Special   »   [go: up one dir, main page]

CN112308388A - Electric power engineering overhaul project risk auditing method based on semantic analysis - Google Patents

Electric power engineering overhaul project risk auditing method based on semantic analysis Download PDF

Info

Publication number
CN112308388A
CN112308388A CN202011135566.7A CN202011135566A CN112308388A CN 112308388 A CN112308388 A CN 112308388A CN 202011135566 A CN202011135566 A CN 202011135566A CN 112308388 A CN112308388 A CN 112308388A
Authority
CN
China
Prior art keywords
data
project
electric power
audit
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011135566.7A
Other languages
Chinese (zh)
Inventor
崔霞
程子华
戴斐斐
孙常鹏
李伯让
徐征
李博
冯伟
张耀心
季忠俊
刘德玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202011135566.7A priority Critical patent/CN112308388A/en
Publication of CN112308388A publication Critical patent/CN112308388A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Educational Administration (AREA)
  • Public Health (AREA)
  • Water Supply & Treatment (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis. The method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. The python web crawler technology is used for collecting the specified field information of the annual overhaul project. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.

Description

Electric power engineering overhaul project risk auditing method based on semantic analysis
Technical Field
The invention relates to the technical field of power grids, in particular to a power engineering overhaul project risk auditing method based on semantic analysis.
Background
Electric power engineering plays a very important role and position in national development. The audit is used as a supervision mechanism, and can check and supervise the financial income and expenditure conditions of relevant important items of all levels of government departments, financial institutions and enterprise and public institutions of the country by law so as to restrict negative economic activities, promote the stable operation of social economy and finally enable national economy to develop healthily. However, in the current stage, the electric power engineering audit still has some defects and problems, for example, the audit in the early stage of the engineering is insufficient, the audit attention on the construction process is insufficient, the preparation of the completion settlement audit material is not timely, and the audit work is not linked due to the staged audit, which seriously interferes with the development of the electric power engineering audit, also makes the electric power engineering audit not realize the timely discovery and disclosure of the problems, and finally makes the electric power engineering project smoothly complete. Aiming at the defects and problems existing in the current electric power engineering audit, a method for auditing the electric power engineering overhaul project risk based on semantic analysis is researched, a natural language processing technology is applied to the electric power engineering project risk audit, a large part of manual work in the audit work is replaced by a computer, the consumption of manpower and material resources is greatly saved, and the audit efficiency is improved.
The natural language processing technology can be divided into a word layer and an upper word layer in information retrieval, and in the first layer, an NLP technology used in the information retrieval mainly comprises word segmentation, compound phrase identification, proper nouns and the like. Since the automatic word segmentation has been proposed in the field of Chinese information processing in the early 80 s of the 20 th century, many experts and scholars have made favorable progress in this field, and many word segmentation methods have been proposed, and some more sophisticated technologies have been applied to commercial products. Most of the research is mainly limited to the analysis of structured audit data, and the fresh scholars carry out deep research on unstructured audit data. In a report published by International Data Corporation (IDC), it is shown that at most only 5% of the data in a business is structured data, the remainder is mostly unstructured, and 88% of the business managers consider these unstructured data stored outside the database to be the best targets for their contact and knowledge of the business.
Disclosure of Invention
The invention aims to overcome the defects of the technology and provides a power engineering overhaul project risk auditing method based on semantic analysis.
In order to achieve the purpose, the invention adopts the following technical scheme: a power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.
Preferably, step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.
Preferably, establishing the data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.
Preferably, the object data crawling comprises:
the first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
Preferably, in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.
Preferably, in step S3, the data cleansing includes de-stop word and chinese error correction.
Preferably, in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.
Preferably, the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';
the feature processing includes: the first step is as follows: preprocessing the characteristics, namely selecting different preprocessing modes for different types of characteristics; the second step is that: carrying out feature standardization processing;
the establishment of the sample group is based on the characteristics of the major repair project plan information and the worksheet information, professional power business personnel select sample combinations with correlation, and the sample group for model training is established.
The modeling comprises the following steps:
(1) randomly ordering the sample set;
(2) dividing a sample set into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 70%, 10% and 20% of the total number of samples;
(3) training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
Preferably, in step S5, the model application includes similarity analysis and label cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, providing the association degrees of a certain work ticket with all project plans, sorting according to size, taking the top 5 sorting values with the largest association degrees as final results, and the label cloud visualization specifically includes performing label cloud visualization analysis on audited text data to grasp the main content of the audited text data as a whole.
The method has the beneficial effects that (1) the python web crawler technology is used for collecting the appointed field information of the project of overhaul in a certain degree. And combining the data stored by a PMS2.0 system of an electric power company in Tianjin city of the state network, utilizing a data warehouse to store information collected by a web crawler, creating an independent audit analysis environment, further processing the audit data with improved quality in the environment, storing the audit data according to an audit theme, and improving the expandability of audit analysis.
(2) And aiming at different auditing analysis requirements, identifying project construction content information in a planning and planning system and work ticket information in a PMS (permanent magnet system) by utilizing a semantic identification technology. And searching the work ticket information matched with the construction content according to the overhaul project list, if the work ticket information is matched with the overhaul project list, determining that the project is implemented, and if the work ticket information is not matched with the project, listing as a suspicious point and performing key point verification.
Drawings
FIG. 1 is a schematic diagram of the present invention employing the python web crawler software.
Detailed Description
As shown in fig. 1, a risk auditing method for electric power engineering overhaul projects based on semantic analysis includes the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application. Wherein:
3.1 data Collection
3.1.1 use web crawler technology to obtain company audit data from different paths and build a data warehouse.
(1) A project plan file for a crawling planning plan management system, comprising:
a) production major repair professional project planning report
b) Proposal for professional project of production major repair
c) Production major repair professional project planning approval file
d) Production major repair professional project planning project library list
e) Completion report of major project of production and overhaul
(2) Capturing a service data file of a PMS2.0 of a planning plan management system, comprising:
a) work ticket document
b) Work permit report
c) Completion report
(3) Capturing professional data files related to power services disclosed on the network, comprising:
a) common word bank in power industry
b) Professional word bank in power industry
c) Names of electric power equipment such as Tianjin power transmission and transformation station
3.1.2 analyzing data structures of Audit data in target systems
(1) According to project plan files of a planning plan management system, a production major repair professional project plan information file table is established, and the production major repair professional project plan information file table comprises field information such as project codes, project names, management building units, project belonging units, voltage levels (KV), project classification, professional categories, professional subdivision, issuing states, project contents (limited to 300 words), project starting time, project ending time, year plans, prearrangement, grindable batch reply numbers, national network sub-batch, provincial sub-batch, Tianjin project ID, low-voltage projects, remarks, project codes, ranking numbers and the like.
(2) According to the service data file of the planning plan management system PMS2.0, a work ticket information file is established, and the work ticket information file comprises field information such as ticket types, work contents, work places, work place descriptions, ticket making departments, operation and maintenance units, work responsible persons, work ticket issuers, ticket numbers, ticket states, planned work starting time, planned end time, permitted work time, work licensors, end time, completion licensors, ticket types, associated task lists, returns, completion conditions, ticket makers, delay time, affiliated feeder names, affiliated city names, number of staff of work groups, ticket IDs and the like.
(3) And establishing an electric power related word bank file according to the professional data file related to the electric power service, wherein the electric power related word bank file comprises information such as electric power industry words, electric power industry special words, power stations, power transmission stations, transformer substations, transformers, power equipment enterprise directories and Tianjin city national grid company directories.
3.1.3 As shown in FIG. 1, target data capture is realized by adopting python web crawler software
The first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
3.2 Chinese word segmentation
Training Chinese word segmentation model to realize word segmentation function of audit data
3.2.1 building a thesaurus of Audit needs
Downloading professional related word banks, local name word banks and national grid company specific word banks in the power industry from related websites, and establishing word banks for specific participles
3.2.2 participles
Using Chinese word segmentation software jieba with an open source on the network to perform word segmentation operation on the audit warehouse target file:
(1) dividing words of fields such as 'project name, project content, project classification and annual plan' in a production major repair professional project plan information file table, and counting word frequency;
(2) dividing words of documents such as a project planning report, a project recommendation, a project completion report and the like, and counting word frequency;
(3) performing word segmentation on fields such as 'ticket types, work contents, work places, work place descriptions, affiliated feeder names' and the like in the work tickets, and counting word frequency;
(4) and for the place with inaccurate word segmentation result, manual word frequency adjustment can be carried out, and word segmentation is carried out again, so that more accurate word segmentation effect is realized.
And displaying word segmentation results:
i [ "national grid", "Tianjin", "Diwu", "Zhongliangzhuang", "converting station", "switch cabinet", "insulation", "overhaul ]
[ "national grid", "Tianjin", "Diwu", "development area", "urban area", "distribution box station", "opening and closing station", "basic maintenance", "engineering" ]
[ "national grid", "Tianjin", "Diwu", "Lingyu", "distribution line", "foundation reinforcement", "engineering" ]
[ "national grid", "Tianjin", "treasure", "Baoan", "line", "three spans", "strain", "string change", "double hanging points", "drainage line", "overhaul" ]
[ "national grid", "Tianjin", "Diwu", "double King temple", "converting station", "switch cabinet", "overhaul" ]
[ "national grid", "Tianjin", "Diwu", "big mouth Tun", "substation", "Yi Jia Tun", "line", "cement", "pier", "attachment" ]
3.3 data cleansing
3.3.1 stop word: removing useless labels, punctuation marks and special marks appearing in the segmentation result;
3.3.2 Chinese error correction: performing badcase analysis according to the corpus analysis, checking the influence of the result of the wrong corpus, and not processing the result if the result is unnecessary; and if the influence of the wrong corpora on the problem is large, correcting the error by adopting a statistical method.
3.4 word segmentation result feature extraction
Keyword extraction is realized based on TF-IDF (Term Frequency-Inverse Document Frequency) technology. The importance level of a word in the entire text corpus is calculated from the frequency of occurrence of the word in the text and the frequency of occurrence of the word in the entire text corpus. A word or phrase is considered to be very representative if the frequency TF with which it appears in an article is high and very few in other text.
3.4.1 feature selection:
(1) the project schedule selects 5 fields of information, namely 'project code, project name, project content, project start time and project end time'.
(2) The work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time'.
3.4.2 feature handling
The first step is as follows: and (3) feature preprocessing, wherein different preprocessing modes are selected for different types of features:
(1) item coding and ticket ID belong to unique digital features, and processing is reserved;
(2) the 'ticket type' and the 'work place' belong to the category characteristics and are processed by One-hot coding;
here, there are 24 'ticket categories', and we use 6-bit binary number representation, the first two bits represent work sites, the middle two bits represent work categories, and the last two bits represent work ticket categories. The specific contents about the transformer/transmission station, the line and the pole are extracted through three fields of work content, work place and work place description to form a work place feature representation.
Characteristic representation of ticket categories
Figure BDA0002736501220000061
Figure BDA0002736501220000071
(3) The 'project name', 'project content', 'work place description' belong to text features, and are processed by using a word vector model;
the item name content is represented using a 96-dimensional word vector:
Figure BDA0002736501220000072
item content feature extraction and representation:
Figure BDA0002736501220000073
Figure BDA0002736501220000081
and (3) extracting and representing the working content features:
Figure BDA0002736501220000082
(4) the 'project start time', 'project end time', 'plan start time', 'plan end time' are time features, and the data type is converted into datetime and then converted into timestamp information.
Item start time Time information (year, month, day) Feature representation (time stamp)
2018-01-01 (2018,1,1) 1514736000
2017-01-01 (2017,1,1) 1483200000
2016-01-01 (2016,1,1) 1451577600
End time of project Time information (year, month, day) Feature representation (time stamp)
2018-12-31 (2018,12,31) 1546185600
2017-12-31 (2017,12,31) 1514649600
2016-12-31 (2016,12,31) 1483113600
Figure BDA0002736501220000083
Figure BDA0002736501220000091
Scheduled end time Time information (year, month, day) Feature representation (time stamp)
2018/4/2 15:00:00 (2018,4,2,15,0,0) 1522652400
2017/3/27 16:00:00 (2017,3,27,16,0,0) 1490601600
2016/8/24 17:00:00 (2016,8,24,17,0,0) 1472029200
The second step is that: feature normalization process
In order to eliminate the dimensional influence between indexes, data standardization processing is required to solve the comparability between data indexes. The data standardization processing mainly comprises two aspects of data chemotaxis processing and dimensionless processing.
The data homochemotaxis processing mainly solves the problem of data with different properties, directly sums indexes with different properties and cannot correctly reflect the comprehensive results of different acting forces, and firstly considers changing the data properties of inverse indexes to ensure that all the indexes are homochemotactic for the acting forces of the evaluation scheme and then sum to obtain correct results.
The data dimensionless process mainly addresses the comparability of data. Through the standardization processing, all the raw data are converted into non-dimensionalized index mapping evaluation values, namely, all the index values (all the characteristics) are in the same quantity level, and comprehensive comparison analysis can be performed.
3.4.3 set up sample set:
based on the major repair project plan information and the worksheet information characteristics, professional power business personnel select sample combinations with correlation and establish a sample group for model training. The sample format was as follows:
Figure BDA0002736501220000092
Figure BDA0002736501220000101
3.4.4 modeling:
(1) randomly ordering a sample set
(2) The sample set is divided into a training set, a verification set and a test set, which respectively account for 70 percent, 10 percent and 20 percent of the total number of samples
(3) Training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
3.5 model applications
(1) Similarity analysis
And predicting a new sample according to the trained SVM classifier, giving the association degrees of a certain work order and all project plans, sorting according to the sizes, and taking the top 5 sorting values with the maximum association degrees as final results.
(2) Tag cloud visualization
By performing label cloud visualization analysis on the audited text data, the main content of the audited text data is integrally grasped. The label cloud is composed of a group of related labels and weights corresponding to the labels, and the labels are arranged in an alphabetical order or other orders or combined with the color depth to present a text visualization method for browsing of a user. Wherein, the size of the weight value determines the font size, color or other visual effects of the label. And automatically setting the color depth and the font size according to the word frequency of the word segmentation result and carrying out visual display.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. A power engineering overhaul project risk auditing method based on semantic analysis is characterized by comprising the following steps: the method comprises the following steps: s1, collecting data; s2, training a whole word segmentation model; s3, data cleaning; s4, extracting word segmentation result features; s5, model application.
2. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: step S1 includes: (1) acquiring company audit data from different paths by using a web crawler technology, and establishing a data warehouse; (2) analyzing a data structure of audit data in a target system; (3) and adopting python web crawler software to realize target data capture.
3. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: establishing a data warehouse comprises: capturing a project plan file of a planning plan management system; capturing a service data file of a PMS2.0 of a planning plan management system; and capturing professional data files related to the power business disclosed on the network.
4. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 2, characterized in that: the target data capture comprises:
the first step is as follows: building a python web crawler environment;
the second step is that: running a python program to crawl target data;
the third step: and primarily screening the crawled target data according to the needs, reserving useful field information, and establishing an audit warehouse file.
5. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S2, the method includes: constructing a word bank required by auditing; and performing word segmentation operation on the audit warehouse target file by using Chinese word segmentation software jieba with an open source on the network.
6. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S3, the data cleansing includes de-stop word and Chinese error correction.
7. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S4, the word segmentation result feature extraction includes: selecting characteristics, processing the characteristics, establishing a sample group and establishing a model.
8. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 7, characterized in that: the feature selection comprises: selecting 5 fields of information of 'project code, project name, project content, project starting time and project ending time' from the project schedule; the work ticket information file selects 7 field information of 'ticket type, work content, work place description, ticket ID, planned start time and planned end time';
the feature processing includes: the first step is as follows: preprocessing the characteristics, namely selecting different preprocessing modes for different types of characteristics; the second step is that: carrying out feature standardization processing;
the establishment of the sample group is based on the characteristics of the major repair project plan information and the worksheet information, professional power business personnel select sample combinations with correlation, and the sample group for model training is established.
The modeling comprises the following steps:
(1) randomly ordering the sample set;
(2) dividing a sample set into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 70%, 10% and 20% of the total number of samples;
(3) training the SVM classifier by using the sample training set, finely adjusting parameters by using the verification set, and finally verifying the effectiveness of the model by using the test set.
9. The electric power engineering overhaul project risk auditing method based on semantic analysis according to claim 1, characterized in that: in step S5, the model application includes similarity analysis and tag cloud visualization, where the similarity analysis specifically includes predicting a new sample according to a trained SVM classifier, giving the relevance of a certain work ticket to all project plans, sorting according to size, taking the top 5 sorting values with the largest relevance as a final result, and the tag cloud visualization specifically includes performing tag cloud visualization analysis on the text data to be audited, and grasping the main content of the text data to be audited as a whole.
CN202011135566.7A 2020-10-22 2020-10-22 Electric power engineering overhaul project risk auditing method based on semantic analysis Pending CN112308388A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011135566.7A CN112308388A (en) 2020-10-22 2020-10-22 Electric power engineering overhaul project risk auditing method based on semantic analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011135566.7A CN112308388A (en) 2020-10-22 2020-10-22 Electric power engineering overhaul project risk auditing method based on semantic analysis

Publications (1)

Publication Number Publication Date
CN112308388A true CN112308388A (en) 2021-02-02

Family

ID=74328345

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011135566.7A Pending CN112308388A (en) 2020-10-22 2020-10-22 Electric power engineering overhaul project risk auditing method based on semantic analysis

Country Status (1)

Country Link
CN (1) CN112308388A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469555A (en) * 2021-07-19 2021-10-01 国网冀北电力有限公司唐山供电公司 AI technology-based power production management method
CN113743108A (en) * 2021-09-03 2021-12-03 国网经济技术研究院有限公司 Distribution network engineering technology economic information division method
CN117874565A (en) * 2023-11-27 2024-04-12 国网江苏省电力有限公司扬州供电分公司 Work ticket accuracy detection method based on neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160038A (en) * 2015-10-10 2015-12-16 广东卓维网络有限公司 Data analysis method and system based on audit database
CN107832429A (en) * 2017-11-14 2018-03-23 广州供电局有限公司 audit data processing method and system
CN107977789A (en) * 2017-12-05 2018-05-01 国网河南省电力公司南阳供电公司 Based on the audit work method under big data information
CN109299879A (en) * 2018-09-30 2019-02-01 广东电网有限责任公司 A kind of statistical method, device and the equipment of power grid audit issues
CN110032607A (en) * 2019-04-17 2019-07-19 成都市审计局 A kind of auditing method based on big data
CN111275409A (en) * 2020-02-28 2020-06-12 国网上海市电力公司 Power grid overhaul audit data processing system and processing method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160038A (en) * 2015-10-10 2015-12-16 广东卓维网络有限公司 Data analysis method and system based on audit database
CN107832429A (en) * 2017-11-14 2018-03-23 广州供电局有限公司 audit data processing method and system
CN107977789A (en) * 2017-12-05 2018-05-01 国网河南省电力公司南阳供电公司 Based on the audit work method under big data information
CN109299879A (en) * 2018-09-30 2019-02-01 广东电网有限责任公司 A kind of statistical method, device and the equipment of power grid audit issues
CN110032607A (en) * 2019-04-17 2019-07-19 成都市审计局 A kind of auditing method based on big data
CN111275409A (en) * 2020-02-28 2020-06-12 国网上海市电力公司 Power grid overhaul audit data processing system and processing method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
伍洋等: "面向审计领域的短文本分类技术研究", 《微电子学与计算机》 *
李丽华 等: "基于深度学习的文本情感分析", 《湖北大学学报》 *
蒋雨薇: "大数据环境下基于可视化技术的审计方法研究", 《北方经贸》 *
陈伟 等: "基于文本数据分析的大数据审计方法研究", 《中国注册会计师》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113469555A (en) * 2021-07-19 2021-10-01 国网冀北电力有限公司唐山供电公司 AI technology-based power production management method
CN113743108A (en) * 2021-09-03 2021-12-03 国网经济技术研究院有限公司 Distribution network engineering technology economic information division method
CN117874565A (en) * 2023-11-27 2024-04-12 国网江苏省电力有限公司扬州供电分公司 Work ticket accuracy detection method based on neural network

Similar Documents

Publication Publication Date Title
CN112308388A (en) Electric power engineering overhaul project risk auditing method based on semantic analysis
CN110334212A (en) A kind of territoriality audit knowledge mapping construction method based on machine learning
CN108491438A (en) A kind of technology policy retrieval analysis method
CN111401040B (en) Keyword extraction method suitable for word text
CN102779143B (en) Visualizing method for knowledge genealogy
CN104462216B (en) Occupy committee's standard code converting system and method
CN106844527B (en) Road surface disease identification and management decision-making method and system based on internet big data
CN110704577A (en) Method and system for searching power grid scheduling data
CN111737421A (en) Intellectual property big data information retrieval system and storage medium
CN106934054A (en) The accurate analysis method of enterprise's segmented industry and its system based on big data
CN110334904A (en) Key message types of infrastructures unit based on LightGBM belongs to determination method
CN111008215B (en) Expert recommendation method combining label construction and community relation avoidance
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN115796797A (en) Power grid science and technology project evaluation system and method based on two-dimensional cloud picture
CN113421037A (en) Multi-source collaborative construction planning compilation method and device
CN113129188A (en) Provincial education teaching evaluation system based on artificial intelligence big data
CN113538011B (en) Method for associating non-booked contact information with booked user in electric power system
CN113590684A (en) Non-tax payment big data analysis system
CN111666378A (en) Chinese yearbook title classification method based on word vectors
CN118051612B (en) Industry classification system and method
Szczech-Pietkiewicz et al. Smart and sustainable city management in Asia and Europe: A bibliometric analysis
Wang Analysis and evaluation of engineering job demand based on big data technology
ASCHERI et al. Online Job Advertisements for Labour Market Statistics using R.
Zhang Intelligent Mining Method of Massive Digital Archives Based on Artificial Intelligence
Wu China’s metals industry (II)

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210202