CN116483697A - Method and system for evaluating artificial intelligence model - Google Patents
Method and system for evaluating artificial intelligence model Download PDFInfo
- Publication number
- CN116483697A CN116483697A CN202310322860.6A CN202310322860A CN116483697A CN 116483697 A CN116483697 A CN 116483697A CN 202310322860 A CN202310322860 A CN 202310322860A CN 116483697 A CN116483697 A CN 116483697A
- Authority
- CN
- China
- Prior art keywords
- test
- target artificial
- artificial intelligent
- model
- intelligent model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013473 artificial intelligence Methods 0.000 title claims abstract description 81
- 238000000034 method Methods 0.000 title claims abstract description 70
- 238000012360 testing method Methods 0.000 claims abstract description 451
- 238000011156 evaluation Methods 0.000 claims abstract description 131
- 238000005516 engineering process Methods 0.000 claims abstract description 30
- 230000006870 function Effects 0.000 claims description 58
- 230000003993 interaction Effects 0.000 claims description 14
- 238000011076 safety test Methods 0.000 claims description 13
- 230000008485 antagonism Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 12
- 238000011990 functional testing Methods 0.000 claims description 7
- 238000009781 safety test method Methods 0.000 claims description 6
- 230000000977 initiatory effect Effects 0.000 claims 2
- 238000012545 processing Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 238000010801 machine learning Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 230000008846 dynamic interplay Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000003449 preventive effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3684—Test management for test design, e.g. generating new test cases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3688—Test management for test execution, e.g. scheduling of test suites
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3668—Testing of software
- G06F11/3672—Test management
- G06F11/3692—Test management for test results analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Human Resources & Organizations (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- General Engineering & Computer Science (AREA)
- Educational Administration (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Strategic Management (AREA)
- Game Theory and Decision Science (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method and a system for evaluating an artificial intelligence model, and belongs to the technical field of artificial intelligence. The method of the invention comprises the following steps: aiming at the performance of the target artificial intelligence model, determining a test evaluation index for evaluating the target artificial intelligence model, and establishing a test framework for evaluating the target artificial intelligence model based on the test evaluation index; generating a test strategy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index of the target artificial intelligent model to be tested; and selecting a preset test technology, and evaluating the target artificial intelligent model through the test strategy based on a test framework. According to the performance of the target artificial intelligent model, the test evaluation index is determined, the test framework is formulated according to the test evaluation index, the test strategy is provided based on the test framework, and the performance of the artificial intelligent model can be comprehensively tested based on the test strategy.
Description
Technical Field
The present invention relates to the field of artificial intelligence technology, and more particularly, to a method and system for evaluating an artificial intelligence model.
Background
The artificial intelligent model is widely applied to power grid analysis, and a typical scene mainly comprises three major directions of tide adjustment, quick stability judgment, preventive control auxiliary decision and the like. The power flow adjustment comprises specific tasks such as power flow convergence, designated target adjustment and the like, the rapid judgment and stabilization are associated with indexes such as transient stability and small interference stability, and the auxiliary decision of prevention and control is the operation mode adjustment aiming at different stability problems. The functions and targets of the three business scenes are different, so that the emphasis on the performance and the function index requirements of the artificial intelligent model are also different. Secondly, because of the diversity of power business scenes, the evaluation of the artificial intelligent model not only needs to have main indexes such as accuracy, but also needs to have application indexes such as model scale, storage space, training and application speed. Finally, the effect of the artificial intelligent model is closely related to the data set, and factors of the data set are considered for evaluating the model, namely, the model and the sample data are evaluated and comprehensively analyzed respectively.
The electric power system accumulates a large amount of business and simulation data after long-term operation, and based on the data and simulation data from the actual business system, the advanced artificial intelligence method is utilized to rapidly judge that the electric power system has application conditions. However, many models perform well on test sets, and the effect is not ideal in practical application.
Because online data is often concentrated in a partial region of the operation space, when the operation mode of the power grid changes due to faults or time, the adaptability of the original model is reduced. On the other hand, if a random method is adopted to generate the samples, the efficiency of supplementing the samples is lower due to the fact that the dimension of the input space of the power grid is too high, and the requirements are difficult to meet. Therefore, the evaluation is carried out by simply relying on the accuracy of the model on the test set, and the artificial intelligence model needs to be comprehensively evaluated.
Disclosure of Invention
In view of the above problems, the present invention proposes a method for evaluating an artificial intelligence model, comprising:
aiming at the performance and functional requirements of the target artificial intelligent model, determining a test evaluation index for evaluating the target artificial intelligent model, and establishing a test framework for evaluating the target artificial intelligent model based on the test evaluation index;
generating a test strategy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index of the target artificial intelligent model to be tested;
and selecting a preset test technology, and evaluating the target artificial intelligent model through the test strategy based on a test framework.
Optionally, the test evaluation index includes at least one of: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
Optionally, the test frame includes:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
Optionally, a data set for evaluating the target artificial intelligence model is selected according to a test evaluation index of the target artificial intelligence model to be tested.
Optionally, the testing strategy includes:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
Optionally, the preset test technology includes at least one of the following: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
In yet another aspect, the present invention also provides a system for evaluating an artificial intelligence model, comprising:
the initial unit is used for determining a test evaluation index for evaluating the target artificial intelligent model aiming at the performance of the target artificial intelligent model, and establishing a test framework for evaluating the target artificial intelligent model based on the test evaluation index;
the strategy making unit is used for generating a test strategy for evaluating the target artificial intelligent model according to the test evaluation index of the test framework and the target artificial intelligent model to be tested;
and the evaluation unit is used for selecting a preset test technology and evaluating the target artificial intelligent model through the test strategy based on a test framework.
Optionally, the test evaluation index determined by the initial unit includes at least one of the following: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
Optionally, the test framework established by the initial unit includes:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
Optionally, a data set for evaluating the target artificial intelligence model is selected according to a test evaluation index of the target artificial intelligence model to be tested.
Optionally, the test policy generated by the policy making unit includes:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
Optionally, the testing unit presets a testing technique including at least one of the following: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
In yet another aspect, the present invention also provides a computing device comprising: one or more processors;
a processor for executing one or more programs;
the method as described above is implemented when the one or more programs are executed by the one or more processors.
In yet another aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed, implements a method as described above.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a systematic method for evaluating an artificial intelligence model, which comprises the following steps: aiming at the performance of the target artificial intelligence model, determining a test evaluation index for evaluating the target artificial intelligence model, and establishing a test framework for evaluating the target artificial intelligence model based on the test evaluation index; generating a test strategy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index of the target artificial intelligent model to be tested; and selecting a preset test technology, and evaluating the target artificial intelligent model through the test strategy based on a test framework. According to the invention, the test evaluation index is determined according to the performance of the target artificial intelligent model, the test framework is formulated according to the test evaluation index, the test strategy is provided based on the test framework, and the indexes of the artificial intelligent model in all aspects such as functions, performance, safety, robustness, efficiency and the like can be comprehensively tested based on the test strategy of the invention.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a method test framework of the present invention;
FIG. 3 is a flow chart of the test performed by the method test framework of the present invention;
FIG. 4 is a schematic flow chart of the method test of the present invention;
fig. 5 is a block diagram of the system of the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
Example 1:
the invention provides a method for evaluating an artificial intelligence model, which is shown in figure 1 and comprises the following steps:
step 1, aiming at the performance of a target artificial intelligent model, determining a test evaluation index for evaluating the target artificial intelligent model, and establishing a test framework for evaluating the target artificial intelligent model based on the test evaluation index;
step 2, generating a test strategy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index of the target artificial intelligent model to be tested;
and 3, selecting a preset test technology, and evaluating the target artificial intelligent model through the test strategy based on a test framework.
Wherein the test evaluation index comprises at least one of the following: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
Wherein, test frame includes:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
And selecting a data set for evaluating the target artificial intelligent model according to the test evaluation index of the target artificial intelligent model to be tested.
Wherein the test strategy comprises:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
The preset test technology comprises at least one of the following steps: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
According to the performance and the function of the target artificial intelligent model, the test evaluation index is determined, the test framework is formulated according to the test evaluation index, the test strategy is provided based on the test framework, and the performance and the function of the artificial intelligent model can be comprehensively tested based on the test strategy.
The invention is further illustrated by the following examples:
firstly, aiming at the test evaluation of an artificial intelligent model, the invention provides a test evaluation index as follows:
correctness and validity of function: whether the model is capable of performing the intended function. Such as classification, prediction, etc. Furthermore, there is a need to measure the effectiveness of implementing these functions, for example: classification accuracy, false alarm rate, etc.
Code correctness and security: normative of code writing, whether the code has loopholes or not.
Adaptability to data sets: including the proportion of different samples in the dataset, the size of the dataset, the number of annotations of the dataset, and the contamination level of the dataset.
Resistance to challenge capabilities: the model's resistance to differently generated challenge samples is measured.
Dependency on software and hardware platform: and evaluating the performance difference of the depth model on different software and hardware platforms, and the adaptability and the dependence on the platforms. Including deep learning/machine learning framework dependencies, operating system dependencies, and hardware platform architecture dependencies.
Secondly, aiming at a main index evaluation system of a typical artificial intelligence machine learning model, the following test framework is provided:
as shown in fig. 2, the whole test frame system includes 7 core modules, which are respectively an evaluation parameter configuration module, a function test module, a robustness test module, a security test module, an efficiency test module, a comprehensive evaluation module and an interaction control module.
Based on the test framework, a specific artificial intelligent model evaluation test method is provided.
And setting evaluation parameters, selecting index characteristics of different aspects of the artificial intelligence model to be tested by an evaluation person according to different power service scenes and deployment modes, and setting the relative importance of the index characteristics. For example: a functional index, a robustness, a security index is selected.
Model and parameter introduction: and importing the model structure information and the model parameter information of the artificial intelligent model to be tested into an evaluation system. Model structure information is a hypothetical function describing the model, and is generally in the form of codes, and provides a call interface to the outside. And calling descriptive function information of the model to be tested through an interface in the evaluation system. The model parameter information is the best effect related parameters of the model obtained in the model training stage.
Test dataset importation: and providing the selected file path for the user, and analyzing the file path by the module and acquiring the test data set by the picture information of the test set. For testing of different aspects of the artificial intelligence model, different test data sets need to be imported.
The specific test of the evaluation index of the artificial intelligent model can select the index test of the corresponding aspect according to the specific scene and the requirement of the model index, and can also comprehensively test the indexes of different aspects of the artificial intelligent model to be tested.
Functional test: the corresponding functional indexes are tested for the functional requirements of specific business scenes on the artificial intelligent model, such as classification, prediction and the like. The function test module uses the normal test data set to count the performance index. The test data set is imported in the parameter configuration module, so that corresponding functional indexes such as test accuracy, false alarm rate and the like can be obtained only by operating an artificial intelligent model on the test data set in the parameter configuration module.
Robustness test: the module tests the robustness of the model under rationality anomaly data. Therefore, abnormal data needs to be prepared in a targeted manner before testing, so that the data set is more similar to the complex data set of the actual business system scene, and then model operation and index statistics are carried out.
Safety test: the module mainly tests the model's ability to resist against sample attacks. The specific functions comprise three parts of constructing a countermeasure sample, running a model, counting and calculating indexes.
And the module functions are the time performance and the space performance indexes of the running test model. Time performance index collection is obtained using the run time of the dynamic real-time detection code. The time taken to measure multiple times is averaged because of the short run time of the model and uncertainty in the operating system state, which also leads to some error in the time measurement. The space performance index test mainly tests consumption of CPU, memory, electric energy and the like during the running period of the model, and can acquire indexes based on the existing IT equipment test tool.
And a comprehensive evaluation module: the module function calculates weight coefficients based on weight coefficients provided by a user or user configuration parameters, synthesizes the evaluation results of each test module, calculates index scores of all aspects of the model, and calculates final comprehensive quality scores as final comprehensive evaluation indexes.
And the interaction control module is used for: the module provides a visual interface for a user, provides operation and viewing functions, and realizes dynamic interaction between the user and the system so as to realize the configuration of relevant parameters of the model test and the calling and the display of test results.
The module calls the interfaces of the characteristic test modules to complete the whole evaluation flow, including control of the system flow, such as new task, importing model, importing test set and executing task. According to the input instruction of the user, a corresponding background module is called for processing, and the processing result is displayed to a tester, wherein the processing result comprises various index data such as a P-R curve, an ROC curve, a confusion matrix and the like, and comprehensive evaluation scores and the like.
Finally, aiming at the test requirements of the characteristics of different aspects of the artificial intelligence model, the following specific test schemes are provided:
slough testing technique: the method is mainly used for evaluating the functional correctness of the artificial intelligence system. And constructing an metamorphic relation between input and output in multiple execution through the known test cases of the tested model, generating a subsequent test case based on the source test case based on the relation, taking the output of the subsequent test case as a test prediction, and indicating that the system has errors when the actual output is different from an expected result.
Ambiguity test technique: the adequacy of deep neural network testing is improved by generating a large amount of random data. Certain constraint formation generation strategies are first formulated to generate test cases, which are then used for a number of tests. When test execution fails, the potential loopholes of the system are found, and under the condition of enough test cases and case randomness, the probability of occurrence of the software loopholes with deeper hiding is greatly improved.
Variability test: the method is mainly used for verifying the quality of the test set. And (3) generating variants of the generated artificial intelligent model through a mutation operator, and then testing the original model and the variant model respectively to check that the variant model can not be different from the original model. When the variant execution differs from the execution of the original model, the input validation dataset is proved to be valid.
Resistance test: by testing against the sample whether the artificial intelligence model can work effectively in the case of a sample attack, the expected functionality and performance is compromised. The countering of the sample means that a tiny disturbance is added to the input sample, and compared with the original sample, the change generated by the tiny disturbance cannot be identified by naked eyes, but can cause the machine learning model to make wrong judgment or prediction.
The flow of the test of the present invention is further described below in conjunction with fig. 3 and 4:
the test flow of the invention is shown in figure 3, firstly, a test plan of the artificial intelligence model is formulated, the requirements of the application scene of the artificial intelligence on the model and the indexes of the artificial intelligence model to be tested are required to be considered in the stage, so that the test plan is formulated directly, and the output of the stage is the test plan.
Based on a test plan, writing artificial intelligence test cases, and writing test users for each test item, wherein the test users comprise preconditions, test case input, test case output and specific side test steps, and the output at the stage is the test case.
In order to execute the test case, a test environment is built based on special hardware equipment or cloud computing facilities and system software, and an output result is a test platform.
To perform various feature tests on artificial intelligence models, multiple data sets need to be prepared, such as test data sets and validation data sets, library stick test data sets containing exception data, security test data sets containing challenge samples.
And executing artificial intelligent model test according to the test case, and recording test results of the evaluation indexes.
And writing an artificial intelligent model test report, and comprehensively analyzing each index result of the test record.
Based on the test flow, objective and comprehensive evaluation of the artificial intelligent model can be completed.
For different indexes, the test flow, as shown in fig. 4, includes:
the first step: first, a validation test dataset is prepared for a typical artificial intelligence model index test. A dataset containing anomalous samples is then prepared, as needed, as well as a challenge sample dataset generated using a different method. The adaptability of the test model to the data set is also achieved by adjusting the proportions of the verification data, the countermeasure data and the anomaly data in the test data set.
And a second step of: the test environment is built, and the test environment can be built based on a special hardware platform or a cloud computing platform and combined with an open source machine learning/deep learning framework. After the test platform environment is built, an artificial intelligent model to be tested and corresponding optimal model description parameters/configuration are imported.
And a third step of: based on the test platform, the test of index characteristics of different aspects of the artificial intelligent model is executed by inputting different test data sets, and the test results are counted to obtain index data of each aspect of the model. For example, metrics in terms of functionality, performance, efficiency, etc., of the artificial intelligence model may be obtained based on the validation dataset test. An indicator of the robustness of the artificial intelligence model may be obtained based on a dataset test containing outlier samples. An indicator of the security aspect of the artificial intelligence model may be obtained based on a test on the dataset containing the anti-sample. The overall evaluation of the artificial intelligence model can be obtained through comprehensive calculation of indexes of various aspects.
The invention designs a set of index system aiming at artificial intelligent model evaluation, designs a test framework for artificial intelligent model evaluation, and provides a specific test flow based on the framework.
The method provided by the invention can realize comprehensive evaluation of various aspects of indexes of the artificial intelligence model, and obtain comprehensive and accurate measurement of different characteristics of the artificial intelligence. Based on the specific test flow, the comprehensive test of the artificial intelligent model function, performance, robustness, safety and engineering efficiency can be realized.
Example 2:
in yet another aspect, the present invention further proposes a system 200 for evaluating an artificial intelligence model, as shown in fig. 5, comprising:
an initial unit 201, configured to determine a test evaluation index for evaluating the target artificial intelligence model with respect to performance of the target artificial intelligence model, and establish a test framework for evaluating the target artificial intelligence model based on the test evaluation index;
the policy making unit 202 is configured to generate a test policy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index required to be tested by the target artificial intelligent model;
and the evaluation unit 203 is configured to select a preset test technology, and evaluate the target artificial intelligence model based on the test framework through the test strategy.
Wherein the test evaluation index determined by the initial unit includes at least one of the following: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
Wherein, the test framework established by the initial unit comprises:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
And selecting a data set for evaluating the target artificial intelligent model according to the test evaluation index of the target artificial intelligent model to be tested.
The test strategy generated by the strategy making unit comprises the following steps:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
The testing technology preset by the testing unit comprises at least one of the following steps: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
According to the performance of the target artificial intelligent model, the test evaluation index is determined, the test framework is formulated according to the test evaluation index, the test strategy is provided based on the test framework, and the performance of the artificial intelligent model can be comprehensively tested based on the test strategy.
Example 3:
based on the same inventive concept, the invention also provides a computer device comprising a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application SpecificIntegrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions within a computer storage medium to implement the corresponding method flow or corresponding functions to implement the steps of the method in the embodiments described above.
Example 4:
based on the same inventive concept, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a computer device, for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the steps of the methods in the above-described embodiments.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (14)
1. A method for evaluating an artificial intelligence model, the method comprising:
aiming at the performance and functional requirements of the target artificial intelligent model, determining a test evaluation index for evaluating the target artificial intelligent model, and establishing a test framework for evaluating the target artificial intelligent model based on the test evaluation index;
generating a test strategy for evaluating the target artificial intelligent model according to the test frame and the test evaluation index of the target artificial intelligent model to be tested;
and selecting a preset test technology, and evaluating the target artificial intelligent model through the test strategy based on a test framework.
2. The method of claim 1, wherein the test evaluation index comprises at least one of: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
3. The method of claim 1, wherein the test framework comprises:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
4. A method according to claim 3, wherein the data set for evaluating the target artificial intelligence model is selected based on a test evaluation index for which the target artificial intelligence model is to be tested.
5. The method of claim 1, wherein the test strategy comprises:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
6. The method of claim 1, wherein the predetermined testing technique comprises at least one of: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
7. A system for evaluating an artificial intelligence model, the system comprising:
the initial unit is used for determining a test evaluation index for evaluating the target artificial intelligent model aiming at the performance of the target artificial intelligent model, and establishing a test framework for evaluating the target artificial intelligent model based on the test evaluation index;
the strategy making unit is used for generating a test strategy for evaluating the target artificial intelligent model according to the test evaluation index of the test framework and the target artificial intelligent model to be tested;
and the evaluation unit is used for selecting a preset test technology and evaluating the target artificial intelligent model through the test strategy based on a test framework.
8. The system of claim 7, wherein the test evaluation index determined by the initiation unit comprises at least one of: the correctness and effectiveness indexes of the functions of the target artificial intelligent model, the correctness and safety indexes of the codes of the target artificial intelligent model, the adaptability indexes of the target artificial intelligent model to the data set, the resistance and attack resistance indexes of the target artificial intelligent model, and the dependence indexes of the target artificial intelligent model to the software and hardware platform.
9. The system of claim 7, wherein the test framework established by the initiation unit comprises:
the evaluation parameter configuration module is used for adjusting parameters of the test frame according to the test evaluation indexes;
the function test module is used for testing the function of the target artificial intelligent model;
the robustness testing module is used for testing the robustness of the target artificial intelligent model;
the safety testing module is used for testing the safety of the target artificial intelligent model;
the efficiency testing module is used for testing the time performance and the space performance of the target artificial intelligent model;
the comprehensive evaluation module is used for calculating the score of the test evaluation index according to the test results of the functional test module, the robustness test module, the safety test module and the efficiency test module, calculating the weight coefficient of each index based on the score of each test evaluation index and the weight coefficient provided by a user or the user configuration parameter, and determining the comprehensive evaluation score of the target artificial intelligent model;
and the interaction control module is used for importing the target artificial intelligent model and parameters and a data set for evaluating the target artificial intelligent model into the test framework.
10. The system of claim 9, wherein the data set for evaluating the target artificial intelligence model is selected based on a test evaluation index for which the target artificial intelligence model is to be tested.
11. The system of claim 7, wherein the test policy generated by the policy-making unit comprises:
determining a test evaluation index of a target artificial intelligence system to be tested, adjusting test frame parameters according to the determined test evaluation index, importing a target artificial intelligence model and parameters and a data set for evaluating the target artificial intelligence model into a test frame according to an interaction control model, selecting at least one of a test frame function test module, a robustness test module, a safety test module and an efficiency test module, and testing the target artificial intelligence model by a comprehensive evaluation module.
12. The system of claim 7, wherein the test technique preset by the test unit comprises at least one of: slough testing techniques, ambiguity testing techniques, variability testing techniques, and antagonism testing techniques;
testing the function of the target artificial intelligent model based on the function test module through the metamorphic test technology;
generating a large number of random data sets of the target artificial intelligent model according to test evaluation indexes required to be tested by the target artificial intelligent model by using a ambiguity test technology, and evaluating the artificial intelligent model by using the random data sets;
verifying the quality of a data set for evaluating the target artificial intelligent model through a variability test technique;
the security of the target artificial intelligence model is tested based on the security test module through the antagonism test technology.
13. A computer device, comprising:
one or more processors;
a processor for executing one or more programs;
the method of any of claims 1-6 is implemented when the one or more programs are executed by the one or more processors.
14. A computer readable storage medium, characterized in that a computer program is stored thereon, which computer program, when executed, implements the method according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310322860.6A CN116483697A (en) | 2023-03-29 | 2023-03-29 | Method and system for evaluating artificial intelligence model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310322860.6A CN116483697A (en) | 2023-03-29 | 2023-03-29 | Method and system for evaluating artificial intelligence model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116483697A true CN116483697A (en) | 2023-07-25 |
Family
ID=87222308
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310322860.6A Pending CN116483697A (en) | 2023-03-29 | 2023-03-29 | Method and system for evaluating artificial intelligence model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116483697A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118014392A (en) * | 2024-02-01 | 2024-05-10 | 中国铁塔股份有限公司 | Algorithm evaluation method, device, electronic equipment and storage medium |
CN118035110A (en) * | 2024-03-05 | 2024-05-14 | 北京壹玖捌捌电力科技发展有限公司 | Informationized model evaluation method and system based on big data |
CN118332304A (en) * | 2024-06-13 | 2024-07-12 | 天云融创数据科技(北京)有限公司 | Method and system for evaluating artificial intelligence model |
CN118626402A (en) * | 2024-08-14 | 2024-09-10 | 中国工业互联网研究院 | AI framework test method, apparatus, device and storage medium |
-
2023
- 2023-03-29 CN CN202310322860.6A patent/CN116483697A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118014392A (en) * | 2024-02-01 | 2024-05-10 | 中国铁塔股份有限公司 | Algorithm evaluation method, device, electronic equipment and storage medium |
CN118035110A (en) * | 2024-03-05 | 2024-05-14 | 北京壹玖捌捌电力科技发展有限公司 | Informationized model evaluation method and system based on big data |
CN118035110B (en) * | 2024-03-05 | 2024-08-06 | 北京壹玖捌捌电力科技发展有限公司 | Informationized model evaluation method and system based on big data |
CN118332304A (en) * | 2024-06-13 | 2024-07-12 | 天云融创数据科技(北京)有限公司 | Method and system for evaluating artificial intelligence model |
CN118332304B (en) * | 2024-06-13 | 2024-08-27 | 天云融创数据科技(北京)有限公司 | Method and system for evaluating artificial intelligence model |
CN118626402A (en) * | 2024-08-14 | 2024-09-10 | 中国工业互联网研究院 | AI framework test method, apparatus, device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116483697A (en) | Method and system for evaluating artificial intelligence model | |
CN110009171B (en) | User behavior simulation method, device, equipment and computer readable storage medium | |
Briegel et al. | One-atom maser: Statistics of detector clicks | |
US7617415B1 (en) | Code coverage quality estimator | |
CN111026664B (en) | Program detection method and detection system based on ANN and application | |
US9424379B2 (en) | Simulation system and method for testing a simulation of a device against one or more violation rules | |
CN116340934A (en) | Terminal abnormal behavior detection method, device, equipment and storage medium | |
CN106529283A (en) | Software defined network-oriented controller security quantitative analysis method | |
Liu et al. | Imperfect debugging software belief reliability growth model based on uncertain differential equation | |
Liu et al. | Misspecification analysis of two‐phase gamma‐Wiener degradation models | |
KR101368103B1 (en) | Risk-management device | |
US20070180411A1 (en) | Method and apparatus for comparing semiconductor-related technical systems characterized by statistical data | |
CN110059010A (en) | The buffer overflow detection method with fuzz testing is executed based on dynamic symbol | |
Baheri | Exploring the role of simulator fidelity in the safety validation of learning‐enabled autonomous systems | |
KR20190110871A (en) | Method and apparatus for simulating safety of automotive software to obtain a goal reliability index | |
KR101478935B1 (en) | Risk-profile generation device | |
CN104572470B (en) | A kind of integer overflow fault detection method based on transformation relation | |
CN118445412A (en) | Method and device for detecting risk text, storage medium and electronic equipment | |
Park et al. | Smart sensing of the RPV water level in NPP severe accidents using a GMDH algorithm | |
CN114489760A (en) | Code quality evaluation method and code quality evaluation device | |
CN113221316A (en) | Fault diagnosis strategy optimization method considering test uncertainty | |
CN114091644A (en) | Technical risk assessment method and system for artificial intelligence product | |
CN117647694B (en) | Quality detection method suitable for intelligent watch machining process | |
KR101478934B1 (en) | Risk-profile generation device | |
CN118332304B (en) | Method and system for evaluating artificial intelligence model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |