Nothing Special   »   [go: up one dir, main page]

WO2022095523A1 - 机器学习模型管理方法、装置和系统 - Google Patents

机器学习模型管理方法、装置和系统 Download PDF

Info

Publication number
WO2022095523A1
WO2022095523A1 PCT/CN2021/110111 CN2021110111W WO2022095523A1 WO 2022095523 A1 WO2022095523 A1 WO 2022095523A1 CN 2021110111 W CN2021110111 W CN 2021110111W WO 2022095523 A1 WO2022095523 A1 WO 2022095523A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
learning model
federated
model
server
Prior art date
Application number
PCT/CN2021/110111
Other languages
English (en)
French (fr)
Inventor
江涛
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21888222.3A priority Critical patent/EP4224369A4/en
Priority to JP2023526866A priority patent/JP7574438B2/ja
Publication of WO2022095523A1 publication Critical patent/WO2022095523A1/zh
Priority to US18/309,583 priority patent/US20230267326A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/40Software arrangements specially adapted for pattern recognition, e.g. user interfaces or toolboxes therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/098Distributed learning, e.g. federated learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/16Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using machine learning or artificial intelligence
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/34Network arrangements or protocols for supporting network services or applications involving the movement of software or configuration parameters 

Definitions

  • the present application relates to the technical field of machine learning, and in particular, to a method, apparatus and system for managing a machine learning model.
  • Federated learning is a distributed machine learning technique.
  • each federated learning client such as federated learning clients 1, 2, 3...k, uses local computing resources and local network business data for model training, and local
  • the model parameter update information ⁇ generated during the training process such as ⁇ 1 , ⁇ 2 , ⁇ 3 ...... ⁇ k , is sent to the federated learning server (FLS).
  • the federated learning server uses the aggregation algorithm to aggregate the models based on the model update parameters to obtain an aggregated machine learning model. Aggregate the machine learning model as the initial model for the next time the federated learning client performs model training.
  • the federated learning client and the federated learning server perform the above model training process multiple times, and stop training until the obtained converged machine learning model meets the preset conditions.
  • the application identification (service awareness) service is the basic value-added service of the network.
  • the telecom operator network can identify the application packet or the statistical data of the application packet to obtain which application category the application packet belongs to (for example, it belongs to A application or B application, etc.), and subsequently, different processing can be performed for different applications, such as billing, current limiting, and bandwidth guarantee.
  • the machine learning model corresponding to the application identification service that is, the machine learning model is an application identification machine learning model
  • a federated learning client is deployed in the network equipment of a telecom operator network, and the federated learning client is based on the application report of the network equipment.
  • Statistical data of text or application packets perform local training, and obtain an intermediate machine learning model.
  • the third party can update the information based on the intermediate machine learning model or model parameters.
  • the application packets or the statistical data of the application packets of the network device are obtained by reverse inference, and the application packets or the statistical data of the application packets are sensitive data, which brings security risks to the network of the telecom operator.
  • the present application provides a machine learning model management method, device and system, which saves computing resources as a whole and helps to improve the adaptability of the federated learning model.
  • a first aspect provides a machine learning model management method, which is applied to a federated learning server, and the federated learning server belongs to a first management domain and is connected to a machine learning model management center.
  • the method includes: first, obtaining a first machine learning model from a machine learning model management center. Second, perform federated learning with multiple federated learning clients in the first management domain based on the first machine learning model and the local network service data of the first management domain, respectively, to obtain a second machine learning Model. Next, the second machine learning model is sent to the machine learning model management center, so that the second machine learning model is used by the devices in the second management domain.
  • the machine learning model obtained in one management domain can be used by devices in other management domains. In this way, there is no need to repeatedly train machine learning models between different management domains, which saves computing resources from the perspective of the whole society.
  • the machine learning model on the machine learning model management center can integrate the length of network business data in multiple management domains (that is, the machine learning model is indirectly based on the network business data in multiple management domains.
  • the machine learning model obtained only based on the network business data of a single management domain its adaptivity can be greatly improved.
  • the subsequent input is more novel and more complex.
  • the model business performed by the network business data can also achieve better results.
  • each management domain independently trains the machine learning model, and the federated learning server obtains the initial machine learning model from the machine learning model management center. Therefore, even if the federated learning server in one management domain fails, when the failure recovers , the federated learning server can still obtain the latest machine learning model for sharing from the machine learning model management center (that is, during the failure, the machine learning model management center combines the updated machine learning model obtained by other federated learning servers ) as the initial machine learning model, which helps to reduce the number of federated learning and accelerate the convergence speed of the machine learning model. Compared with the traditional technology, this technical solution has faster convergence speed and stronger recovery ability of the machine learning model after the failure of the federated learning server is restored, that is to say, the robustness is better.
  • obtaining the first machine learning model from the machine learning model management center includes: sending machine learning model demand information to the machine learning model management center; receiving the machine learning model demand information determined by the machine learning model management center according to the machine learning model demand information The first machine learning model.
  • the federated learning server "takes the first machine learning model as needed", which helps to save the storage space of the device where the federated learning server is located.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, pooling algorithm, or security model.
  • the method further includes: sending the access permission information of the second machine learning model to the machine learning model management center.
  • the federated learning server can independently determine the access rights of the machine learning model trained by itself, that is, which federated learning servers can use the machine learning model. Subsequently, the machine learning model management center may provide the second machine learning model to the federated learning server with access rights.
  • the method further includes: sending the second machine learning model to the plurality of federated learning clients. Subsequently, the multiple federated learning clients may execute model services corresponding to the second machine learning model based on the second machine learning model.
  • This possible design provides an example of the application of the second machine learning model.
  • the multiple federated learning clients may perform application identification based on the second machine learning model.
  • sending the second machine learning model to the machine learning model management center includes: if the application effect of the second machine learning model satisfies a preset condition, sending the second machine learning model to the machine learning model management center .
  • the application effect of the second machine learning model meets the preset condition, which can be understood as: the application effect of the second machine learning model reaches the preset target.
  • federated learning is performed with multiple federated learning clients in the first management domain to obtain a second machine learning model, including : Send the first machine learning model to multiple federated learning clients in the first management domain, so that the multiple federated learning clients perform federated learning based on the first machine learning model and the network service data obtained by themselves, and obtain their respective The intermediate machine learning model is obtained; the multiple intermediate machine learning models obtained by the multiple federated learning clients are obtained, and a second machine learning model is obtained by aggregating the multiple intermediate machine learning models.
  • a machine learning model management method is provided, which is applied to a machine learning model management center, where the machine learning model management center is connected to a first federated learning server, and the first federated learning server belongs to a first management domain.
  • the method includes: first, sending a first machine learning model to a first federated learning server. Second, receive a second machine learning model from the first federated learning server; wherein, the second machine learning model is the first federated learning server based on the first machine learning model and the local network business data of the first management domain, and the first It is obtained by federated learning performed by multiple federated learning clients in the management domain.
  • the second machine learning model is the first federated learning server using the first machine learning model as the initial machine learning model, and based on the local network service data of the first management domain, and multiple federated learning models in the first management domain. It is obtained by the client performing federated learning.
  • the first machine learning model is replaced with a second machine learning model so that the second machine learning model is used by devices in the second administrative domain.
  • the method before sending the first machine learning model to the first federated learning server, the method further includes: receiving machine learning model requirement information sent by the first federated learning server; according to the machine learning model requirement information , determine the first machine learning model.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, pooling algorithm, or security model.
  • the second machine learning model is a machine learning model based on the first training framework.
  • the method further includes: converting the second machine learning model into a third machine learning model; wherein the third machine learning model is a machine learning model based on the second training framework, and the third machine learning model and the second machine learning model are The machine learning model corresponding to the business information of the same model.
  • the machine learning model management center also stores a fifth machine learning model
  • the fifth machine learning model is a machine learning model based on the second training framework
  • the fifth machine learning model and the first machine learning model are The machine learning model corresponding to the business information of the same model.
  • the method may also include replacing the fifth machine learning model with the third machine learning model. In this way, it helps to make machine learning models under other training frameworks for the same model business information to be the latest machine learning models.
  • the method further includes: receiving access permission information of the second machine learning model sent by the first federated learning server.
  • the method further includes: sending the second machine learning model to the second federated learning server; wherein the second federated learning server belongs to the second management domain.
  • the corresponding method provided in the second aspect may correspond to the corresponding method provided in the first aspect. Therefore, the beneficial effects that can be achieved may refer to the beneficial effects in the corresponding method, which will not be repeated here.
  • a federated learning system including: a federated learning server and multiple federated learning clients.
  • the federated learning server and the plurality of federated learning clients belong to the first management domain, and the federated learning server is connected to the machine learning model management center.
  • the federated learning server is used to obtain the first machine learning model from the machine learning model management center, and send the first machine learning model to the plurality of federated learning clients.
  • Each federated learning client in the plurality of federated learning clients is configured to perform federated learning based on the first machine learning model and the network service data obtained respectively, to obtain a respective intermediate machine learning model.
  • the federated learning server is also used to obtain multiple intermediate machine learning models obtained by the multiple federated learning clients, and based on the multiple intermediate machine learning models to aggregate to obtain a second machine learning model, and send the first machine learning model to the machine learning model management center. Two machine learning models, such that the second machine learning model is used by devices in the second management domain.
  • the federated learning server is also used to send the second machine learning model to multiple federated learning clients.
  • the multiple federated learning clients are further configured to execute model services corresponding to the second machine learning model based on the second machine learning model.
  • a network system including: a machine learning model management center, a federated learning server and multiple federated learning clients.
  • the federated learning server and the plurality of federated learning clients belong to the first management domain, and the federated learning server is connected to the machine learning model management center.
  • the machine learning model management center is used to send the first machine learning model to the federated learning server.
  • the federated learning server is configured to send the first machine learning model to the multiple federated learning clients.
  • Each federated learning client in the plurality of federated learning clients is configured to perform federated learning based on the first machine learning model and the network service data obtained respectively, to obtain a respective intermediate machine learning model.
  • the federated learning server is also used to obtain multiple intermediate machine learning models obtained by the multiple federated learning clients, and based on the multiple intermediate machine learning models to aggregate to obtain a second machine learning model, and send the first machine learning model to the machine learning model management center.
  • Two machine learning models such that the second machine learning model is used by devices in the second management domain.
  • the machine learning model management center is further configured to replace the first machine learning model with the second machine learning model.
  • the federated learning server is also used to send machine learning model requirement information to the machine learning model management center.
  • the machine learning model management center is also used for sending the first machine learning model to the federated learning server according to the demand information of the machine learning model.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, pooling algorithm, or security model.
  • the second machine learning model is a machine learning model based on the first training framework.
  • the machine learning model management center is also used to convert the second machine learning model into a third machine learning model; wherein the third machine learning model is a machine learning model based on the second training framework, and the third machine learning model and the second machine learning model
  • a machine learning model is a machine learning model corresponding to the business information of the same model.
  • the machine learning model management center also stores a fifth machine learning model
  • the fifth machine learning model is a machine learning model based on the second training framework
  • the fifth machine learning model and the first machine learning model are The machine learning model corresponding to the business information of the same model.
  • the machine learning model management center is also used to replace the fifth machine learning model with the third machine learning model.
  • the federated learning server is also used to send the second machine learning model to multiple federated learning clients.
  • the multiple federated learning clients are further configured to execute model services corresponding to the second machine learning model based on the second machine learning model.
  • the present application provides an apparatus for managing a machine learning model.
  • the machine learning model management apparatus is configured to execute any one of the methods provided in the first aspect.
  • the machine learning model management apparatus may specifically be a federated learning server.
  • the present application may divide the functional modules of the machine learning model management apparatus according to any of the methods provided in the first aspect.
  • each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the present application may divide the machine learning model management apparatus into a transceiver unit, a processing unit, and the like according to functions.
  • a transceiver unit a transceiver unit
  • a processing unit a processing unit
  • the present application may divide the machine learning model management apparatus into a transceiver unit, a processing unit, and the like according to functions.
  • the machine learning model management device includes: a memory and a processor, and the memory and the processor are coupled.
  • the memory is used for storing computer instructions
  • the processor is used for invoking the computer instructions to execute any one of the methods provided by the first aspect and any possible design manners thereof.
  • the present application provides an apparatus for managing a machine learning model.
  • the machine learning model management apparatus is configured to execute any one of the methods provided in the second aspect above.
  • the machine learning model management apparatus may specifically be a machine learning model management center.
  • the present application may divide the functional modules of the machine learning model management apparatus according to any of the methods provided in the second aspect.
  • each function module may be divided corresponding to each function, or two or more functions may be integrated into one processing module.
  • the present application may divide the machine learning model management apparatus into a receiving unit, a sending unit, a processing unit, and the like according to functions.
  • a receiving unit a sending unit
  • a processing unit a processing unit
  • the present application may divide the machine learning model management apparatus into a receiving unit, a sending unit, a processing unit, and the like according to functions.
  • the machine learning model management device includes: a memory and a processor, and the memory and the processor are coupled.
  • the memory is used for storing computer instructions
  • the processor is used for invoking the computer instructions to execute any one of the methods provided by the second aspect and any possible design manners thereof.
  • the present application provides a computer readable storage medium, such as a computer non-transitory readable storage medium.
  • a computer program (or instruction) is stored thereon, and when the computer program (or instruction) is run on a computer device, the computer device is made to perform any of the possible implementations provided by the first aspect or the second aspect. either method.
  • the present application provides a computer program product that, when executed on a computer device, enables any one of the methods provided by any one of the possible implementations of the first aspect or the second aspect to be executed.
  • the present application provides a chip system, comprising: a processor, where the processor is configured to call and run a computer program stored in the memory from a memory, and execute any of the first aspect or the implementation of the second aspect. a way.
  • the first aspect or the The sending action in the second aspect may specifically be replaced by sending under the control of the processor; the receiving action in the above-mentioned first aspect or the second aspect may be specifically replaced by receiving under the control of the processor.
  • any of the systems, devices, computer storage media, computer program products or chip systems provided above can be applied to the corresponding methods provided in the first aspect or the second aspect.
  • beneficial effects reference may be made to the beneficial effects in the corresponding method, which will not be repeated here.
  • FIG. 1 is a schematic structural diagram of a federated learning system applicable to an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a network system provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a machine learning model management system provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a system for applying a machine learning model management system to a network system according to an embodiment of the present application
  • FIG. 5 is a schematic diagram of a logical structure of a public cloud according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a logical structure of a management and control system provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a logical structure of a network device according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of another system for applying a machine learning model management system to a network system provided by an embodiment of the present application;
  • FIG. 9 is a schematic diagram of a hardware structure of a computer device provided by an embodiment of the present application.
  • FIG. 10 is an interactive schematic diagram of a machine learning model management method provided by an embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a federated learning process provided by an embodiment of the present application.
  • FIG. 12 is an interactive schematic diagram of another machine learning model management method provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a machine learning model management center provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a federated learning server provided by an embodiment of the present application.
  • Network services refer to the communication services that can be provided based on the network or network equipment. For example, broadband services, network slicing services, virtual network services, etc.
  • the network service data refers to the data generated during the operation of the network service or the related data of the generated data.
  • the application packet itself, statistical data of the application packet (such as packet loss rate of the packet, etc.), fault alarm information, and so on.
  • Model business refers to the business that can be provided based on machine learning models and network business data, such as application identification business, fault tracking and prediction business, key performance indicator (KPI) abnormal detection business, etc.
  • the model service corresponding to the machine learning model is the application identification service
  • the corresponding network service data includes application packets and statistical data of the application packets.
  • the model service corresponding to the machine learning model is the fault tracking and prediction service
  • the corresponding network service data includes fault alarm information.
  • Model business information refers to the relevant information of the model business, including the identification or type of the model business.
  • Machine learning is the use of algorithms to parse data, learn from it, and then make decisions and predictions about real-world events. Machine learning is to use a large amount of data to "train”, and learn how to complete a certain model business from the data through various algorithms.
  • a machine learning model is a file containing algorithm implementation code and parameters used to complete a model business.
  • the algorithm implementation code is used to describe the model structure of the machine learning model
  • the parameters are used to describe the attributes of each component of the machine learning model.
  • the file is hereinafter referred to as a machine learning model file.
  • sending a machine learning model hereinafter specifically refers to sending a machine learning model file.
  • the machine learning model is a logical functional module that completes the business of a certain model. For example, the values of input parameters are input into a machine learning model, and the values of output parameters of the machine learning model are obtained.
  • Machine learning models include artificial intelligence (AI) models such as neural network models.
  • AI artificial intelligence
  • the machine learning model trained based on the federated learning system may also be referred to as a federated learning model.
  • the machine learning model package contains the machine learning model itself (ie, the machine learning model file) and the description file of the machine learning model.
  • the description file of the machine learning model may include: description information of the machine learning model and a running script of the machine learning model, and the like.
  • the description information of the machine learning model refers to information used to describe the machine learning model.
  • the description information of the machine learning model may include at least one of model business information corresponding to the machine learning model and training requirements of the machine learning model.
  • the model service information may include a model service type or a model service identifier.
  • the model business type corresponding to the machine learning model is the application identification type.
  • the model business type corresponding to the machine learning model is the fault prediction type.
  • machine learning model 1 corresponds to application identification service 1 for identifying applications in application set A
  • machine learning model 2 corresponds to application identification service 2 for identifying applications in application set B.
  • the application set A and the application set B are different.
  • the training requirements may include at least one of the following: training environment, algorithm type, network structure, training framework, convergence algorithm or security model, etc.
  • the training environment which is the type of device that trains the machine learning model.
  • the training environment may include: an external PCEF support node (external PCEF support node, EPSN), a universal customer premises equipment (universal customer premise equipment, uCPE), or IP multimedia Subsystem (IP multimedia subsystem, IMS), etc.
  • PCEF is the English abbreviation of policy and charging enforcement function (policy and charging enforcement function).
  • IP is the English abbreviation of internet protocol.
  • Algorithm type which is the type of algorithm used to train the machine learning model, for example, neural network, linear regression, etc.
  • the types of neural networks may include: convolutional neural networks (CNN), long short-term memory (LSTM), or recurrent neural networks (RNN), etc.
  • the network structure is the network structure corresponding to the machine learning model.
  • the network structure may include: the features of the input layer (such as dimensions, etc.), the features of the output layer (such as dimensions, etc.), and the features of the hidden layers (such as dimension, etc.) etc.
  • a training framework also known as a machine learning platform, is a training framework used to train a machine learning model, specifically a system or method that integrates all machine learning including machine learning algorithms, including data representation and Data processing methods, data representation and methods for building machine learning models, methods for evaluating and using modeling results, etc.
  • the training framework may include: convolutional architecture for fast feature embedding (Caffe) training framework, Tensorflow training framework or Pytorch training framework, etc.
  • the aggregation algorithm is the algorithm used to train the machine learning model. Specifically, in the process of model training by the federated learning system, the algorithm used in the process of model aggregation of multiple intermediate machine learning models by the federated learning server.
  • the aggregation algorithm may include a weighted average algorithm or a federated stochastic variance reduced gradient (federated stochastic variance reduced gradient, FSVRG) algorithm, and the like.
  • Security mode is the security means (such as encryption algorithm, etc.) used in the process of machine learning model transmission.
  • the security mode requirement may include whether to use the security mode.
  • the security mode may include: secure multi-party computation (MPC), or secure hash algorithm (secure hash algorithm, SHA) 256 and the like.
  • the description file of the machine learning model may further include: the access rights of the machine learning model, and/or the billing policy of the machine learning model, etc. where access rights can be replaced with share rights.
  • the access rights of the machine learning model may include whether the machine learning model can be shared, that is, whether it can be used by other federated learning servers. Further optionally, the access authority of the machine learning model may also include: if the machine learning model can be shared, which federated learning server(s) can the machine learning model be used by, and/or which federated learning service(s) cannot be used by the machine learning model. end use.
  • the billing policy of the machine learning model refers to the payment policy that needs to be followed for using the machine learning model.
  • samples include training samples and test samples.
  • the training samples are the samples used to train the machine learning model.
  • a test sample is a sample used to test the measurement error (or accuracy) of a machine learning model.
  • words such as “exemplary” or “for example” are used to represent examples, illustrations or illustrations. Any embodiments or designs described in the embodiments of the present application as “exemplary” or “such as” should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as “exemplary” or “such as” is intended to present the related concepts in a specific manner.
  • first and second are only used for description purposes, and cannot be understood as indicating or implying relative importance or implying the number of indicated technical features.
  • a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • plural means two or more.
  • the meaning of the term “at least one” refers to one or more, and the meaning of the term “plurality” in this application refers to two or more.
  • a plurality of second messages refers to two or more more than one second message.
  • the size of the sequence number of each process does not mean the sequence of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not be used in the embodiment of the present application. Implementation constitutes any limitation.
  • determining B according to A does not mean that B is only determined according to A, and B may also be determined according to A and/or other information.
  • the term “if” may be interpreted to mean “when” or “upon” or “in response to determining” or “in response to detecting.”
  • the phrases “if it is determined" or “if a [statement or event] is detected” can be interpreted to mean “when determining" or “in response to determining... ” or “on detection of [recited condition or event]” or “in response to detection of [recited condition or event]”.
  • references throughout the specification to "one embodiment,” “an embodiment,” and “one possible implementation” mean that a particular feature, structure, or characteristic related to the embodiment or implementation is included in the present application at least one embodiment of .
  • appearances of "in one embodiment” or “in an embodiment” or “one possible implementation” in various places throughout this specification are not necessarily necessarily referring to the same embodiment.
  • the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
  • connection mentioned in the embodiments of the present application may be a direct connection, an indirect connection, a wired connection, or a wireless connection, that is, the embodiment of the present application can
  • the connection method is not limited.
  • FIG. 2 it is a schematic structural diagram of a network system 30 according to an embodiment of the present application.
  • the network system 30 shown in FIG. 2 may include a cloud platform (providing computing resource services, which can be used to deploy network applications, not shown in the figure) and at least two management domains connected to the cloud platform, such as management domain 1 and management domain 2 Wait.
  • the cloud platform may be the public cloud 301, other types of cloud platforms, or other network platforms that can deploy network applications; for the convenience of description, the public cloud 301 is used as an example for description below.
  • the management domain may be a telecom operator network, a virtual operator network, an enterprise network (such as a network system in industries such as banks, governments, and large enterprises), a campus network, and the like.
  • management domains are isolated in terms of security, and do not share "network service data" and "intermediate machine learning models” with each other.
  • intermediate machine learning model reference may be made to the relevant description of S102 below.
  • Different management domains may be different telecom operator networks, different virtual operator networks, different enterprise networks, and the like.
  • a management domain includes one or more management and control systems, such as management and control system 302-1, management and control system 302-2, and management and control system 302-3 in FIG. 2, and one or more management and control systems connected to each management and control system.
  • Multiple network devices such as the management and control system 302-1 in FIG. 2 is connected to the network device 303-11 and the network device 303-12, the management and control system 302-2 is connected to the network device 303-21 and the network device 303-22, and the management and control system 302- 3Connect the network device 303-31 and the network device 303-32.
  • the management and control system and the network device can be directly or indirectly connected.
  • the public cloud 301 can be constructed and operated by network equipment manufacturers, or by other third-party manufacturers, and communicates with various management domains in the form of cloud services.
  • the management and control system is responsible for managing and maintaining the entire life cycle of a single management domain.
  • the management and control system may be a core network management and control system, an access network management and control system, a transmission network management and control system, and the like.
  • the management and control system may be a network management system (NMS), a network element management system (EMS), or an operation support system (operation support system, OSS).
  • NMS network management system
  • EMS network element management system
  • OSS operation support system
  • multiple management and control systems may exist in the same administrative domain to manage different sub-administrative domains.
  • An administrative domain in the same area eg, the same district of a city, etc.
  • a sub-administrative domain For example, a telecommunication operator network in the same district of a city is regarded as a sub-telecom operator network.
  • Different sub-admin domains are geographically isolated.
  • the management domain 2 in FIG. 2 includes a sub-management domain 1 and a sub-management domain 2, etc., each sub-management domain includes a management and control system, and each management and control system is connected to a plurality of network devices.
  • Network device responsible for reporting network service data in the management domain or sub-management domain where it is located to the management and control system, such as alarm data of network devices, performance indicators of network devices, operation logs of network devices, traffic statistics of network devices, etc. , and execute the management and control instructions issued by the management and control system.
  • the network device may be a router, a switch, an optical line terminal (OLT), a base station, a core network device, and the like.
  • the network device may have computing resources and algorithm environment for machine learning model training, as well as data storage and processing capabilities (such as the ability to execute machine learning model training, etc.).
  • FIG. 1 is a schematic structural diagram of a federated learning system applicable to an embodiment of the present application.
  • the federated learning system includes a federated learning server, and multiple federated learning clients directly or indirectly connected to the federated learning server.
  • the federated learning server also undertakes the management of the federated learning system, such as the determination of federated learning clients participating in training, training instance generation, communication security, privacy protection, and training system reliability assurance.
  • the above-mentioned federated learning server and federated learning client may specifically be logical function modules.
  • FIG. 3 it is a schematic structural diagram of a machine learning model management system 40 according to an embodiment of the present application.
  • the machine learning model management system 40 shown in FIG. 3 includes a machine learning model management center 401, and multiple federated learning systems connected to the machine learning model management center 401, such as a federated learning system 402-1 and a federated learning system 402-2, Wherein, each federated learning system may include a federated learning server 403 and a federated learning client 1-k.
  • the machine learning model management center 401 is used for managing and providing machine learning models to the federated learning system.
  • the machine learning model managed by the machine learning model management center 401 can be used by multiple federated learning systems.
  • Managing machine learning models may include: generating a machine learning model package according to the machine learning model specification in the federated learning system, and generating signature files for the machine learning model package, etc., and storing the machine learning model package in the machine learning model market. It can be downloaded and used by other federated learning servers.
  • Each federated learning system is used to perform federated learning based on the machine learning model issued by the machine learning model management center 401 (ie, the first machine learning model), and report the federated learning result to the machine learning model management center 401 (ie, the second machine learning model). machine learning models).
  • the machine learning model management center 401 and the federated learning server in the federated learning system may communicate through a representational state transfer (REST) protocol.
  • REST representational state transfer
  • the above-mentioned machine learning model management center 401 may specifically be a logic function module.
  • FIG. 4 it is a schematic structural diagram of a system for applying a machine learning model management system 40 to a network system 30 according to an embodiment of the present application.
  • the machine learning model management center 401 is deployed on the public cloud 301 to provide machine learning model management services to different management domains.
  • the public cloud 301 may further include:
  • Machine learning model training platform 301A used to provide computing resources, machine learning algorithm frameworks, training algorithms, and machine learning model debugging tools required for machine learning model training. In addition, it provides functions such as data governance, feature engineering, machine learning algorithm selection, machine learning model parameter optimization, and machine learning model evaluation and testing required for machine learning model training. For example, the management domain may complete the training of a machine learning model (eg, a converged machine learning model) corresponding to a certain model business on the machine learning model training platform 301A.
  • a machine learning model eg, a converged machine learning model
  • Secure communication module 301B used to provide secure communication capability between the public cloud 301 and the management and control system.
  • the secure communication module 301B is used to encrypt the information transmitted between the public cloud 301 and the management and control system.
  • the machine learning model management center 401 can reuse the resources of the public cloud 301, such as computing resources and/or communication resources.
  • the machine learning model management center 401 can complete the training of a machine learning model corresponding to a model business on the machine learning model training platform 301A, and then provide the machine learning model to other management domains.
  • the secure communication module 301B is used to provide the capability of secure communication between the machine learning model management center 401 in the public cloud 301 and the machine learning server in the management and control system.
  • the secure communication module 301B is used to The machine learning model (or machine learning model package) transmitted between the machine learning model management center 401 and the machine learning server is encrypted.
  • the federated learning server is deployed on the management and control system.
  • the federated learning server 403-1 is deployed on the management and control system 302-2
  • the federated learning server 403-2 is deployed on the management and control system 302-3.
  • the management and control system (such as the management and control system 302-2) may further include:
  • Management and control basic platform 302A used to provide computing resources, communication resources, and external management and control interfaces to the federated learning client. Also, it is used to provide other software system capabilities.
  • Management and control northbound interface 302B used for communication between the management and control system and the public cloud 301 .
  • Management and control southbound interface 302C used for communication between the management and control system and network devices.
  • the management and control southbound interface 302C may include: a Google Remote Procedure Call Protocol (gRPC) interface, a representational state transfer (REST) interface, and the like.
  • gRPC Google Remote Procedure Call Protocol
  • REST representational state transfer
  • Secure communication module 302D used to provide secure communication capability between the management and control system and the public cloud 301, and secure communication capability between the management and control system and network devices.
  • the secure communication module 302D is used to encrypt the information transmitted between the management and control system and the public cloud 301, etc., and to encrypt the information transmitted between the management and control system and the network device, and the like.
  • the secure communication module 302D may include a first sub-module and a second sub-module.
  • the first sub-module is used to provide a secure communication capability between the management and control system and the public cloud 301.
  • the function of the first sub-module is the same as the above-mentioned security
  • the functions of the communication module 301B correspond.
  • the second sub-module is used to provide the security communication capability between the management and control system and the network device, and the function of the second sub-module corresponds to the function of the security communication module 303C described below.
  • the federated learning server can reuse the resources of the management and control system, such as computing resources and/or communication resources.
  • the basic management and control platform 302A is used to provide the federated learning server with computing resources and communication resources required for operation, and at the same time provide the federated learning server with an external management and control interface. Users can manage and configure the federated learning system on the control interface.
  • the management and control basic platform 302A is also used to provide the federated learning server with other required software system capabilities, such as user authentication, security certificates, rights management, and the like. For another example, referring to FIG.
  • the federated learning server can communicate with the machine learning model management center 401 in the public cloud 301 through the management and control northbound interface 302B.
  • the federated learning server can communicate with the machine learning client in the network device through the management and control southbound interface 302C.
  • the federated learning client is deployed on the network device.
  • the federated learning client 404-1 is deployed on the network device 303-21
  • the federated learning client 404-2 is deployed on the network device 303-22
  • the federated learning client 404-1 is deployed on the network device 303-22.
  • the terminal 404-3 is deployed on the network device 303-31
  • the federated learning client 404-4 is deployed on the network device 303-32.
  • network devices eg, network devices 303-21) may also include:
  • the local training module 303A has the computing capability of local training, the local data processing capability, and the training algorithm framework such as Tensorflow, Caffe and so on.
  • Network service module 303B used to execute the network service processing flow of the network device.
  • the control information in the network service processing flow may come from the inference result of the machine learning model (ie, the output information of the machine learning model), such as performing functions such as packet forwarding according to the inference result.
  • the network service module 303B also needs to send the network service data generated during the operation of the network service, such as performance indicators and alarm data, to the local training module 303A, so that the local training module 303A can update and optimize the model.
  • Secure communication module 303C used to provide secure communication capability between the management and control system and network devices.
  • the secure communication module 303C is used to encrypt the information transmitted between the management and control system and the network device.
  • the federated learning client can reuse the resources of the network device 401, such as computing resources and/or communication resources.
  • the federated learning client undertakes the communication interface and the management interface between the local model training module 303A and the federated learning server. Specifically: building secure communication with the federated learning server, downloading aggregated machine learning models, uploading parameter update information between the intermediate machine learning model and the initial machine learning model, applying training strategies to coordinate control of local algorithms, etc.
  • the federated learning client undertakes local management functions, including system access, security authentication, and startup loading of the federated learning local node. At the same time, it also undertakes the functions of local node security and privacy protection of federated learning, including data encryption, privacy protection, multi-party computing, etc.
  • the management and control system may further include a local training module 303A.
  • the network device may not include the local training module 303A.
  • FIG. 8 another system structure diagram of applying the machine learning model management system 40 to the network system 30 provided by the embodiment of the present application.
  • the machine learning model management center 401 is deployed on the public cloud 301 .
  • Both the federated learning server and the federated learning client are deployed on the management and control system.
  • the federated learning server 403-1 and the federated learning client 404-1 are deployed on the management and control system 302-2, and the federated learning client 404-2 is deployed on the management and control system 302-3.
  • the system shown in FIG. 8 is applicable to a scenario in which one administrative domain (such as the administrative domain 2 in FIG. 2 ) includes multiple sub-administrative domains.
  • a machine learning server is deployed on one of the management and control systems in a management domain, and a machine learning client connected to the machine learning server is deployed on other management and control systems.
  • a machine learning client connected to the machine learning server may or may not be deployed on the management and control system where the machine learning server is deployed.
  • the management and control system is responsible for information management at the management domain level (or network level) in the sub-management domain, such as management domain-level fault analysis, management-domain-level optimization strategies, and management-domain-level capacity management.
  • the management and control system can perform local training based on training samples with sub-administrative domain-level granularity.
  • Typical applications are the abnormality detection of indicators at the management domain level.
  • the management and control system performs unified training on the performance indicators reported by each network device under management, and generates a machine learning model for abnormality detection of indicators at the management domain level.
  • the federal center server and the federal center client can be deployed on the same management and control system, and the model training based on network device granularity is extended to management-based Model training at domain-level granularity. Expanded the scope of application of federated learning, suitable for training and optimization of machine learning models at the management domain level. And it helps to solve the problems of model training based on the granularity of network equipment, insufficient accuracy, insufficient generalization ability, and long data collection time of the machine learning model obtained due to the small number of training samples.
  • FIG. 4 and FIG. 8 can also be used in combination, for example, model training is performed based on network device granularity in some management domains, and model training is performed based on management domain-level granularity in some management domains, thereby forming a new embodiment.
  • the training process of the machine learning model is completed within the management domain, and the network service data is not sent outside the management domain. Improve the security of network service data in the management domain.
  • the sharing of machine learning models between management domains is realized, so that there is no need to repeat the training of machine learning models between different management domains, which saves computing resources from the perspective of the whole society, and can be used for each management domain (such as telecommunications). In terms of operator network), the construction cost and maintenance cost of the management domain are reduced.
  • the federated learning system can share resources (such as computing resources, communication resources, etc.) System changes, and additional communication security management measures such as no need to add firewalls, springboards, etc. are not required.
  • each device/functional module in any of the federated learning systems and network architectures provided above can refer to the machine learning model management method provided below (the machine learning model management method shown in FIG. 10 ). ), which are not repeated here.
  • an administrative domain includes one or more network devices.
  • the one or more network devices are controlled by the public cloud 301 .
  • a machine learning model management center can be deployed on the public cloud 301, and a federated learning client can be deployed on a network device.
  • the machine learning model management center is used to directly control the federated learning client without going through the federated learning server.
  • both the public cloud 301 or the management and control system can be implemented by one device, or implemented by multiple devices collaboratively. This embodiment of the present application does not limit this.
  • FIG. 9 it is a schematic diagram of a hardware structure of a computer device 70 according to an embodiment of the present application.
  • the computer device 70 can be used to implement the functions of the device deployed with the machine learning model management center 401 , the machine learning server, or the machine learning client.
  • the computer device 70 can be used to implement part or all of the functions of the above-mentioned public cloud 301 or the management and control system, and can also be used to implement the functions of the above-mentioned network equipment.
  • the computer device 70 shown in FIG. 9 may include a processor 701 , a memory 702 , a communication interface 703 and a bus 704 .
  • the processor 701 , the memory 702 and the communication interface 703 may be connected through a bus 704 .
  • the processor 701 is the control center that generates the computer device 70, and can be a general-purpose central processing unit (central processing unit, CPU), or can be other general-purpose processors or the like.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • processor 701 may include one or more CPUs, such as CPU 0 and CPU 1 shown in FIG. 9 .
  • Memory 702 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions
  • ROM read-only memory
  • RAM random access memory
  • a dynamic storage device that can also be an electrically erasable programmable read-only memory (EEPROM), a magnetic disk storage medium, or other magnetic storage device, or can be used to carry or store instructions or data structures in the form of desired program code and any other medium that can be accessed by a computer, but is not limited thereto.
  • EEPROM electrically erasable programmable read-only memory
  • magnetic disk storage medium or other magnetic storage device, or can be used to carry or store instructions or data structures in the form of desired program code and any other medium that can be accessed by a computer, but is not limited thereto.
  • the memory 702 may exist independently of the processor 701 .
  • the memory 702 may be connected to the processor 701 through a bus 704 for storing data, instructions or program codes.
  • the processor 701 calls and executes the instructions or program codes stored in the memory 702, it can implement the machine learning model management method provided by the embodiments of the present application, for example, the machine learning model management method shown in FIG. 10 .
  • the memory 702 may also be integrated with the processor 701 .
  • the communication interface 703 is used to connect the computer device 70 with other devices through a communication network, and the communication network can be an Ethernet, a radio access network (RAN), a wireless local area network (wireless local area networks, WLAN) and the like.
  • the communication interface 703 may include a receiving unit for receiving data, and a transmitting unit for transmitting data.
  • the bus 704 may be an industry standard architecture (industry standard architecture, ISA) bus, a peripheral component interconnect (peripheral component interconnect, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus or the like.
  • the bus can be divided into address bus, data bus, control bus and so on. For ease of presentation, only one thick line is used in FIG. 9, but it does not mean that there is only one bus or one type of bus.
  • FIG. 9 does not constitute a limitation on the computer device 70.
  • the computer device 70 may include more or less components than those shown, or a combination of certain components some components, or a different arrangement of components.
  • FIG. 10 it is an interactive schematic diagram of a machine learning model management method provided by an embodiment of the present application.
  • the method shown in FIG. 10 may be applied to the machine learning model management system 40 shown in FIG. 3 , and the machine learning model management system 40 may be deployed in the network system 30 shown in FIG. 4 or FIG. 8 .
  • the method shown in FIG. 10 may include the following steps S101-S105:
  • the machine learning model management center sends the first machine learning model to the first federated learning server.
  • the first federated learning server may be any federated learning server connected to the machine learning model management center.
  • the first machine learning model is a machine learning model corresponding to a certain model business stored in the machine learning model management center.
  • the machine learning model corresponding to the model business can be updated.
  • the first machine learning model may be the first machine learning model corresponding to the model business stored in the machine learning model management center, or may be a non-first machine learning model corresponding to the model business stored in the machine learning model management center. For specific examples thereof, reference may be made to the embodiment shown in FIG. 11 .
  • S101 may include: the machine learning model management center sends a machine learning model package to the first federated learning server, where the machine learning model package includes the first machine learning model (ie, a model file).
  • the machine learning model management center sends a machine learning model package to the first federated learning server, where the machine learning model package includes the first machine learning model (ie, a model file).
  • the machine learning model management center sends the machine learning model package to the first federated learning server based on the REST protocol.
  • the machine learning model package may further include a description file of the first machine learning model.
  • the machine learning model package sent by the machine learning model management center to the first federated learning server may be a machine learning model package after security processing operations such as encryption and/or scrambling, so as to reduce the transmission time of the machine learning model package. The risk of being stolen and modified in the process, thereby improving the security of the machine learning model package.
  • the first federated learning server can decode the encrypted machine learning model package.
  • the first federated learning server may descramble the scrambled machine learning model package.
  • a secure line can be used to reduce the risk of the machine learning model package being stolen and modified during the transmission process. This improves the security of the machine learning model package.
  • This embodiment of the present application does not limit the triggering condition of S101, and two implementation manners are listed below.
  • Mode 1 The machine learning model management center sends the first machine learning model to the first federated learning server at the request of the first federated learning server.
  • the first federated learning server sends the machine learning model requirement information to the machine learning model management center.
  • the first federated learning server sends the machine learning model requirement information to the machine learning model management center based on the REST protocol.
  • the machine learning model management center determines the first machine learning model according to the demand information of the machine learning model.
  • the first federated learning server "gets the first machine learning model as needed (that is, the first federated learning server obtains the first machine learning model from the first model management center only when it needs the first machine learning model)" ”, in this way, it helps to save the storage space of the device where the first federated learning server is located.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the machine learning model training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, convergence algorithm or security model.
  • the machine learning model management center can maintain the correspondence between the machine learning model identifier and the description information of the machine learning model.
  • the specific embodiment of the corresponding relationship is not limited in this embodiment of the present application.
  • the machine learning model management center may represent the corresponding relationship in the form of a table or the like.
  • the machine learning model management center may search for "the correspondence between the identifier of the machine learning model and the description information of the machine learning model” based on the machine learning model requirements sent by the first federated learning server, as shown in Table 2, Determine the identifier of the first machine learning model; and then obtain the model package of the first machine learning model by searching for "correspondence between the identifier of the machine learning model and the machine learning model package", such as by looking up Table 1. Then, the model package of the first machine learning model is sent to the first federated learning server.
  • Table 2 can be essentially considered as a part of Table 1.
  • Table 3 it is a specific example of the correspondence between the identifier of the machine learning model shown in Table 2 and the description information of the machine learning model.
  • the first federated learning server determines that the training requirements of the machine learning model are: the model business information corresponding to the machine learning model is the application identification type, the training environment of the machine learning model is EPSN, and the algorithm used for training the machine learning model
  • the type is CNN
  • the structure of CNN is "input layer: 100, output layer: 300, hidden layer: 5"
  • the training framework of CNN is Tensorflow
  • the security mode requirement of the machine learning model is MPC; then, based on Table 3, it can be seen that the machine The first machine learning model determined by the learning model management center is the machine learning model indicated by "SA 001".
  • Method 2 The machine learning model management center actively pushes the first machine learning model to the first federated learning server.
  • the first federated learning server needs to use the first machine learning model, it can directly obtain it locally without requesting it from the machine learning model management center. Therefore, it is helpful to save the time for obtaining the first machine learning model. .
  • This embodiment of the present application does not limit the triggering conditions for the machine learning model management center to actively push the first machine learning model to the first federated learning server. For example, after replacing the first machine learning model with a new machine learning model, the machine learning model management center may actively push the replaced first machine learning model to the first federated learning server. For another example, the machine learning model management center may actively push the first machine learning model to the first federated learning server when the first machine learning model is created for the first time.
  • mode 1 and mode 2 can also be used in combination to form a new embodiment.
  • the first federated learning server performs federated learning with multiple federated learning clients in the first management domain based on the first machine learning model and local network service data of the first management domain to obtain a second machine learning model.
  • the multiple federated learning clients are some or all of the federated learning clients connected to the first federated learning server.
  • the first federated learning server belongs to the first management domain.
  • the local network service data refers to the network service data corresponding to the first machine learning model in the first management domain obtained by the first federated learning server.
  • the local network service data is related to model service information corresponding to the first machine learning model.
  • the model service corresponding to the first machine learning model is an application identification service
  • the local network service data may be application packets and/or packet statistical data (such as packet loss rate of packets, etc.).
  • the model service corresponding to the first machine learning model is a fault tracking and prediction service
  • the local network service data may be fault alarm information.
  • S102 may include: the first federated learning server performs one or more federated learning with multiple federated learning clients in the first management domain based on the first machine learning model and the local network service data of the first management domain, to obtain the first federated learning server. Two machine learning models.
  • a federated learning starts when the federated learning server sends the initial machine learning model to multiple federated learning clients, until the federated learning server obtains the intermediate machine learning models obtained by the multiple federated learning clients, and the The acquired intermediate machine learning models are aggregated to obtain a process of aggregated machine learning models.
  • the model from which the federated learning client starts model training is called the initial machine learning model.
  • the federated learning client performs one or more local trainings based on the initial machine learning model and the training samples constructed by the network business data obtained by itself, and obtains a new machine learning model after each local training. If the new machine learning model satisfies the first preset condition, the new machine learning model is called an intermediate machine learning model. Otherwise, the federated learning client continues local training until an intermediate machine learning model is obtained.
  • the federated learning client determines that the new machine learning model satisfies the first preset condition.
  • the federated learning client uses the test sample constructed by the network service data obtained by itself to test the new machine learning model, the accuracy of the new machine learning model will be the same as the previous (or If the difference between the accuracy rates of the machine learning models obtained by local training is less than or equal to the second preset threshold, the federated learning client determines that the new machine learning model satisfies the first preset condition.
  • the federated learning client determines that the new machine learning model satisfies the first preset condition.
  • the embodiments of the present application do not limit the values of the first preset threshold, the second preset threshold, and the third preset threshold.
  • the federated learning server obtains an intermediate machine learning model from each federated learning client among multiple federated learning clients, and the model obtained after model aggregation is called a converged machine learning model. If the converged machine learning model satisfies the second preset condition, the converged machine learning model is used as the second machine learning model. Otherwise, the federated learning server sends the aggregated machine learning model to the multiple federated learning clients as the initial machine learning model for the next federated learning.
  • the federated learning server uses the test sample constructed by the network service data obtained by itself to test the aggregated machine learning model, the accuracy of the aggregated machine learning model is greater than or equal to the fourth preset threshold, then The federated learning server determines that the converged machine learning model satisfies the second preset condition.
  • the federated learning server uses the test sample constructed by the network service data obtained by itself to test the aggregated machine learning model, the accuracy of the aggregated machine learning model is the same as the last time (or multiple times of the last time) ) The difference between the accuracy rates of the converged machine learning models obtained is less than or equal to the fifth preset threshold, then the federated learning server determines that the converged machine learning model satisfies the second preset condition.
  • the federated learning server determines that the converged machine learning model satisfies the second preset condition.
  • the embodiments of the present application do not limit the values of the fourth preset threshold, the fifth preset threshold, and the sixth preset threshold.
  • S102 may include the following steps S102A-S102G. 11
  • the multiple federated learning clients in the first management domain include a first federated learning client and a second federated learning client as an example for description.
  • the relationship between the first machine learning model, the initial machine learning model, the intermediate machine learning model, the convergent machine learning model and the second machine learning model can be more clearly explained.
  • the first federated learning server sends the first machine learning model to the first federated learning client and the second federated learning client respectively.
  • the first federated learning server sends a model package of the first machine learning model to the first federated learning client, where the model package includes a model file of the first machine learning model.
  • the model package may also include the first Documentation for the machine learning model.
  • the first federated learning server sends a model package of the first machine learning model to the second federated learning client, where the model package contains the model file of the first machine learning model.
  • the first federated learning client uses the first machine learning model as an initial machine learning model, and performs local training based on the initial machine learning model and network service data obtained by itself to obtain a first intermediate machine learning model.
  • the second federated learning client uses the first machine learning model as an initial machine learning model, and performs local training based on the initial machine learning model and the network business data obtained by itself to obtain a second intermediate machine learning model.
  • the network service data obtained by the federated learning client specifically refers to the network service data generated by the network device.
  • the network service data obtained by the federated learning client specifically refers to one or more networks managed by the management and control system. Network service data generated and reported by the device.
  • the first federated learning client is deployed on the network device 303-21
  • the second federated learning client is deployed on the network device 303-22
  • the first federated learning client uses the first machine learning model as the current federated learning process the initial machine learning model, and use the local computing resources of the network device 303-21 and the network service data generated by the network device 303-21 to locally train the initial machine learning model to obtain the first intermediate machine learning model; similarly, the second The federated learning client performs local training based on the first machine learning model and the network service data generated by the network device 303-22 to obtain a second intermediate learning model.
  • the first federated learning client sends parameter update information of the first intermediate machine learning model relative to the initial machine learning model to the first federated learning server.
  • the second federated learning client sends parameter update information of the second intermediate machine learning model relative to the initial machine learning model to the first federated learning server.
  • the first federated learning client packages the parameter update information of the first intermediate learning model relative to the initial machine learning model into a first parameter update file, and sends it to the first federated learning server.
  • the first machine learning model includes parameter A and parameter B
  • the values of parameter A and parameter B are a1 and b1 respectively
  • the first intermediate learning model obtained after the first federated learning client performs local training contains The values of parameter A and parameter B are a2 and b2 respectively
  • the parameter update information of the first intermediate learning model relative to the initial machine learning model includes: update information a2-a1 of parameter A, and update information b2-b1 of parameter B .
  • the second federated learning client packages the parameter update information of the second intermediate learning model relative to the first machine learning model into a second parameter update file, and sends it to the first federated learning server.
  • the first federated learning server obtains a first intermediate machine learning model based on the initial machine learning model and parameter update information sent by the first federated learning client.
  • the first federated learning server obtains the second intermediate machine learning model based on the initial machine learning model and the parameter update information sent by the second federated learning client.
  • the first federated learning server uses an aggregation algorithm to perform model aggregation on the first intermediate machine learning model and the second intermediate machine learning model to obtain an aggregated machine learning model.
  • the first federated learning server obtains the value of the parameter A of the first intermediate machine learning model based on the update information a2-a1 of the parameter A and the value a1 of the parameter A of the initial machine learning model a2; based on the update information b2-b1 of the parameter B and the value b1 of the parameter B of the initial machine learning model, the value b2 of the parameter B of the first intermediate machine learning model is obtained; then, the parameter A of the initial machine learning model is and parameter B are respectively assigned to a2 and b2 to obtain the first intermediate machine learning model.
  • the first federated learning server may obtain the second intermediate machine learning model based on the initial machine learning model and the parameter update information sent by the second federated learning client.
  • the aggregation algorithm may be a weighted average algorithm.
  • the first federated learning server designates the first federated learning server according to the completeness of the network service data obtained by the first federated learning client and the network service data obtained by the second federated learning client.
  • the weight of the parameter update information reported by the federated learning client and the second federated learning client so that the weighted summation of the parameter update information for the same parameter reported by the first federated learning client and the second federated learning client, and then do Averaged to obtain the parameter update information of the parameter.
  • the federated learning server can also update the information according to the parameters, and formulate the training strategy of the federated learning client in the next federated learning process.
  • S102C-S102D is an implementation manner for the federated learning server to obtain multiple intermediate machine learning models obtained by multiple federated learning clients.
  • the specific implementation is not limited to this.
  • a federated learning client can directly send an intermediate machine learning model to the federated learning server.
  • the first federated learning server determines whether the converged machine learning model satisfies the second preset condition.
  • S102F The first federated learning server determines the converged machine learning model as the first machine learning model. After execution of S102F, it returns to execution of S102A.
  • the first federated learning server determines the converged machine learning model as the second machine learning model.
  • the first federated learning server sends the second machine learning model (ie, the model file of the second machine learning model) to the first federated learning client and the second federated learning client, respectively.
  • the second machine learning model ie, the model file of the second machine learning model
  • the first federated learning server sends a model package of the second machine learning model to the first federated learning client, where the model package includes a model file of the second machine learning model.
  • the model package may also include a second Documentation for the machine learning model.
  • the first federated learning server sends a model package of the second machine learning model to the second federated learning client, where the model package includes the model file of the second machine learning model.
  • the model package may also include the first 2. The documentation for the machine learning model.
  • the multiple federated learning clients may execute model services corresponding to the second machine learning model based on the second machine learning model. For example, if the second machine learning model is an application identification model, the plurality of federated learning clients may identify the application based on the second machine learning model.
  • the first federated learning server sends the second machine learning model to the machine learning model management center, so that the second machine learning model is used by devices in the second management domain.
  • the second machine learning model can be used by devices in other management domains.
  • the device in the second management domain may be a device on which the second federated learning server is deployed.
  • the second federated learning server belongs to the second management domain. That is, the second machine learning model can be used by the second federated learning server in the second management domain.
  • the second federated learning server may be any federated learning server that has the authority to use the second machine learning model.
  • which or those federated learning servers have the permission to access the second machine learning model, which may be predefined, or may be the federated learning client (ie the first federated learning server) that generates the second machine learning model. )definite.
  • the first federated learning server may also send the access permission information of the second machine learning model to the machine learning model management center. Subsequently, the machine learning model management center may generate a model package of the second machine learning model based on the access permission information.
  • the access permission information refers to information used to represent the federated learning server that is allowed to use the second machine learning model.
  • the specific implementation manner of the access permission information is not limited in this embodiment of the present application.
  • the access permission information may be an identifier of a federated learning server that is allowed to use the second machine learning model.
  • the access permission information may be predefined information "indicating that the second machine learning model can be used by all other federated learning servers".
  • the first federated learning server may not send the access permission information of the second machine learning model to the machine learning model management center.
  • the second machine learning model can also continue to be used by the first federated learning server.
  • the device in the second management domain may be a device deployed with a federated learning client.
  • the federated learning client belongs to the second management domain.
  • the federated learning client may be any federated learning client that has permission to use the second machine learning model. That is, the second machine learning model can be used by federated learning clients in the second administrative domain.
  • the device in the second management domain may also be a model service execution device (that is, a device capable of using a machine learning model to execute a corresponding model service).
  • a model service e.g, service identification service
  • an operator does not have a federated learning server or client, it can still obtain the machine learning model provided by the federated learning server of other operators from the machine learning model center.
  • S104 may include the following S104A-S104C:
  • the first federated learning server obtains the application effect of the second machine learning model.
  • the application effect of the second machine learning model can be understood as: the trial effect of the second machine learning model.
  • the first federated learning server tries the second machine learning model based on the network service data in the management domain to which the first federated learning server belongs, so as to obtain the trial effect (ie the application effect) of the second machine learning model.
  • the first federated learning server sends the second machine learning model to multiple federated learning clients connected to it, and each federated learning client in the multiple federated learning clients is based on the second machine learning model and its respective
  • the acquired network service data is used to execute the model service corresponding to the second machine learning model to obtain the execution result, and the execution result should be sent to the first federated learning server.
  • the first federated learning server aggregates multiple execution results sent by the multiple federated learning clients to obtain a trial effect (ie, an application effect) of the second machine learning model.
  • the model service corresponding to the second machine learning model is a recognition service (such as application recognition service, etc.)
  • the above execution result may be the recognition rate of the second machine learning model, that is, the second machine learning model can recognize The proportion of objects that participate in the recognition.
  • the application identification service is specifically: identifying which application (such as a video playing application) the packet belongs to.
  • the model business corresponding to the second machine learning model is an application identification business
  • the first federated learning server summarizes and obtains: within a preset time period, a message is input to the second machine learning model, and the second machine learning model Identify which application each of the b packets belongs to, a>b, a and b are both integers, then the recognition rate of the second machine learning model is
  • the above execution result may be the number of packets that are not identified by the second machine learning model within a period of time ; or the proportion of objects that cannot be recognized by the second machine learning model to the objects participating in the recognition, etc.
  • the above identification services can be replaced with prediction services (such as fault tracking and prediction services, etc.), and at this time, the above execution result may be the prediction rate of the second machine learning model, that is, the second machine learning model can predict.
  • the above identification type service may be replaced with a detection type service (such as a KPI anomaly detection service, etc.), and the above execution result may be the detection rate of the second machine learning model.
  • the above identification type service can also be replaced with other types of services. In this case, the specific implementation manner of the above execution result can be obtained by reasoning from the example in S104A.
  • the application effect of the second machine learning model satisfies the preset condition, which can be understood as: the application effect of the second machine learning model reaches the preset target.
  • the preset targets are different.
  • the above execution result may be the recognition rate of the second machine learning model
  • the application effect of the second machine learning model reaches the preset target, which may be: the second The recognition rate of the machine learning model is greater than or equal to the preset recognition rate, or, it may be: the recognition rate of the second machine learning model is greater than or equal to the historical recognition rate, and the historical recognition rate may be the recognition rate of the first machine learning model or the like.
  • the first federated learning server performs a new round of federation with the multiple federated learning clients based on the second machine learning model and the new local network service data learn, get a new second machine learning model.
  • the "new local network service data" here is compared to the local network service data used in the process of obtaining the second machine learning model by training.
  • the first federated learning server can determine whether the application effect of the new second machine learning model satisfies the preset conditions, and so on, until the application effect of the new second machine learning model obtained at a certain time satisfies the preset conditions , the first federated learning server sends the new second machine learning model that satisfies the preset condition to the machine learning model management center.
  • the machine learning model management center replaces the first machine learning model with the second machine learning model.
  • the machine learning model management center replaces the model package of the first machine learning model with the model package of the second machine learning model. More specifically, the machine learning model management center replaces the model file of the first machine learning model with the model file of the second machine learning model.
  • the machine learning model management center replaces the machine learning model file 1 with the model file of the second machine learning model.
  • the subsequent machine learning model management center needs to send machine learning to devices in the first management domain or devices in other management domains, such as the first federated learning server or other federated learning servers (such as the second federated learning server).
  • the model file of the second machine learning model can be sent.
  • the machine learning model management center sends the second machine learning model to the devices in the second management domain (including sending under request and actively pushing) , which may include: the machine learning model management center sends the second machine learning model to the device in the second management domain when it is determined that the second management domain has the right to use the second machine learning model.
  • the machine learning model management center sends the second machine learning model to the second federated learning server (including sending under request and actively pushing), which may include: The machine learning model management center sends the second machine learning model to the second federated learning server when it is determined that the second federated learning server has the right to use the second machine learning model.
  • the method further includes: the machine learning model management center performs operations such as virus scanning and sensitive word scanning on the received second machine learning model, so as to determine that the second machine learning model has not been infected during the transmission process. Modification, thereby determining the security of the second machine learning model.
  • the machine learning model management center may also determine whether the second machine learning model is safe based on the network security assessment report made by the third-party software.
  • the machine learning model management center may replace the first machine learning model with the second machine learning model under the condition that the second machine learning model is determined to be safe.
  • the machine learning model management center may also perform model format verification on the second machine learning model to determine that the second machine learning model is from a trusted authentication network, thereby determining the security of the second machine learning model.
  • the machine learning model management center can maintain the identity of the certified network, and based on the maintained identity of the certified network, determine whether the second machine learning model is from the certified network, and if it is from the certified network, it means that the second machine learning model is from the certified network. The model is safe, otherwise the second machine learning model is not safe.
  • the process of using the second machine learning model by the second federated learning server in the second management domain may include:
  • the machine learning model management center sends the second machine learning model to the second federated learning server.
  • the machine learning model management center sends the second machine learning model to the second federated learning server at the request of the second federated learning server.
  • the specific implementation manner may be based on the above description of Mode 1, which will not be repeated here.
  • the machine learning model management center actively pushes the second machine learning model to the second federated learning server.
  • the specific implementation manner may be based on the above description of Mode 2, which will not be repeated here.
  • the second federated learning server performs federated learning with multiple federated learning clients in the second management domain based on the second machine learning model and the local network data of the second management domain to obtain a third machine learning model.
  • the specific implementation manner can be obtained based on the above description of FIG. 11 , and details are not repeated here.
  • the second federated learning server can send the third machine learning model to the machine learning model management center; the machine learning model management center can replace the second machine learning model with the third machine learning model, so that the third machine learning model Used by devices in the third administrative domain.
  • the third management domain is different from the second management domain.
  • the third management domain and the first management domain may be the same or different.
  • the third machine learning model can also be used by devices in the second administrative domain.
  • the second federated learning server can send the third machine learning model to its connected federated learning client, and the federated learning client can execute the model business corresponding to the third machine learning model based on the third machine learning model.
  • this optional implementation manner can also be considered as an example in which the second federated learning server in the second management domain and the federated learning client in the second management domain jointly use the second machine learning model.
  • the process of using the second machine learning model by the federated learning client in the second management domain may include:
  • the machine learning model management center sends the second machine learning model to the federated learning client in the second management domain.
  • the machine learning model management center may send the second machine learning model to the federated learning client in the case of receiving the request sent by the federated learning client in the second management domain.
  • the machine learning model management center may actively push the second machine learning model to the federated learning client in the second management domain.
  • the federated learning client in the second management domain can execute the model service corresponding to the second machine learning model based on the second machine learning model.
  • This optional implementation can be applied to a scenario where the machine learning model management center directly controls the federated learning client without going through the federated learning server.
  • the method may further include the following steps S106-S107:
  • the machine learning model management center generates a model package of the second machine learning model based on the second machine learning model.
  • the model package contains a model file of the second machine learning model and a description file of the second machine learning model.
  • the machine learning model management center generates a description file for the second machine learning model, where the description file may include access rights of the second machine learning model, description information of the second machine learning model, and running scripts of the second machine learning model. Then, the machine learning model management center generates a model package of the second machine learning model from the model file of the second machine learning model and the description file of the second machine learning model according to the packaging specification of the machine learning model package.
  • the machine learning model management center signs the model package of the second machine learning model, and obtains a signature file of the model package of the second machine learning model.
  • S107 is to perform integrity protection on the model package of the second machine learning model, to indicate that the model package of the second machine learning model comes from the machine learning model management center, rather than from other devices/systems.
  • the machine learning model center when the machine learning model center sends the model package of the second machine learning model to any federated learning server, it can also send the signature file of the model package to the federated learning server, so that the federated learning service
  • the terminal can determine whether the model package comes from the machine learning model management center based on the signature file.
  • the above-mentioned machine learning model center sends the model package of the first machine learning model to the first federated learning server, it can also send the signature file of the data package to the first federated learning server.
  • the second machine learning model is a machine learning model based on the first training framework.
  • the method may further include the following step 1:
  • Step 1 The machine learning model management center converts the second machine learning model into a third machine learning model.
  • the third machine learning model is a machine learning model based on the second training framework, and the third machine learning model and the second machine learning model are machine learning models corresponding to the same model business information.
  • the machine learning model management center uses a model conversion tool to translate the algorithm implementation code and parameters in the model file of the second machine learning model based on the first training framework into the corresponding algorithm implementation code based on the second training framework. and parameters.
  • the model conversion tool may be implemented by software and/or hardware.
  • the machine learning model management center converts the machine learning model supported by the first training framework into the machine learning model supported by the second training framework.
  • the first training framework is the Tensorflow training framework
  • the second training framework is the Pytorch training framework
  • the information of the relu activation layer (operation unit) can be translated into torch.nn.ReLu().
  • the machine learning model management center also stores a fifth machine learning model
  • the fifth machine learning model is a machine learning model based on the second training framework
  • the fifth machine learning model and the first machine learning model are the same model.
  • the machine learning model corresponding to the business information may further include the following step 2:
  • Step 2 The machine learning model management center replaces the fifth machine learning model with the third machine learning model.
  • the machine learning models corresponding to the same model business information may correspond to different training frameworks.
  • the machine learning model management center replaces the first machine learning model with the second machine learning model.
  • the learning model is converted into a machine learning model under other training frameworks to provide machine learning models for other federated learning servers that support the second training framework.
  • the machine learning model management center determines that it maintains a machine learning model (that is, the fifth machine learning model) based on other training frameworks that corresponds to the same model business information as the first machine learning model, then the machine learning model management center also The fifth machine learning model can be replaced with the third machine learning model, thereby ensuring that the machine learning models under other training frameworks for the business information of the same model are the latest machine learning models.
  • a machine learning model that is, the fifth machine learning model
  • machine learning model 1 and machine learning model 2 in Table 2 are machine learning models corresponding to the same model business information, and the difference between the two is that they are applied to different training frameworks.
  • the first machine learning model is machine learning model 1
  • the fifth machine learning model is machine learning model 2
  • Table 1 use the model file of the second machine learning model to replace the model file of the first machine learning model (that is, the machine In the case of learning model file 1)
  • the model file of the fifth machine learning model ie, machine learning model file 2 is replaced with the model file of the third machine learning model.
  • the machine learning model for the application recognition business can run on three AI model training frameworks. Therefore, after the machine learning model based on one of the training models is replaced, the machine learning model management center can update the other two synchronously. The machine learning model corresponding to the training framework.
  • the machine learning model obtained by the federated learning server executing the federated learning can be used by other federated learning servers. In this way, there is no need to repeatedly train machine learning models between different federated learning servers, which saves computing resources from the perspective of the entire society.
  • the federated learning server obtains the initial machine learning model from the machine learning model management center, which helps the federated learning server determine a machine learning model that is closest to the training requirements of the machine learning model as the initial machine learning model, thereby helping to reduce federation The number of learning times to speed up the convergence of the machine learning model.
  • the machine learning model on the machine learning model management center can integrate the length of network business data in multiple management domains (that is, the machine learning model is indirectly based on the network business data in multiple management domains. Federated learning), its adaptability can be greatly improved compared to the machine learning model obtained only based on network business data of a single management domain. For each management domain, the subsequent input of more novel and complex network business data Executed model business can also achieve better results.
  • each management domain independently trains the machine learning model, so if the federated learning server in one management domain fails, other management domains can still continue to perform federated learning, and make The Machine Learning Model Management Center continues to update machine learning models.
  • the federated learning server obtains the initial machine learning model from the machine learning model management center, even if the federated learning server in a management domain fails, when the failure recovers, the federated learning server can still access the machine from the machine
  • the learning model management center obtains the latest machine learning model for sharing (that is, during the failure, the machine learning model management center combines the updated machine learning model obtained by other connected learning servers) as the initial machine learning model, which helps It is used to reduce the number of federated learning and speed up the convergence speed of the machine learning model.
  • the embodiments corresponding to FIG. 10 and FIG. 11 are further explained by taking the example that the machine learning model corresponds to the application identification service, that is, the machine learning model is an application identification (SA) machine learning model.
  • SA application identification
  • the management domain is specifically a telecom operator network
  • a first federated learning server is deployed on EMS1 of the telecom operator network A.
  • EMS1 is used to manage the first network device and the second network device.
  • the first network device A first federated learning client is deployed on the second network device
  • a second federated learning client is deployed on the second network device
  • the first federated learning client and the second federated learning client are respectively connected to the first federated learning server.
  • a second federated learning server is deployed on EMS2 of the telecom operator network B
  • a federated learning client connected to the second federated learning server is deployed on a network device managed by EMS2.
  • FIG. 12 is an interactive schematic diagram of still another machine learning model management method provided by an embodiment of the present application. Wherein, the method shown in FIG. 12 may include the following steps S201-S212:
  • the machine learning model management center sends the SA machine learning model 001 to the EMS1 of the telecom operator network A.
  • S202 The EMS1 sends the SA machine learning model 001 to the first network device and the second network device respectively.
  • the first network device uses the SA machine learning model 001 as the initial machine learning model, and performs local training based on the initial machine learning model and the application packets or statistical data of the application packets of the first network device to obtain the first network device.
  • Intermediate Machine Learning Model (labeled SA Machine Learning Model 002).
  • the second network device uses the SA machine learning model 001 as the initial machine learning model, and performs local training based on the initial machine learning model and the application packets or the statistical data of the application packets of the second network device to obtain the second intermediate machine Learning model (labeled SA Machine Learning Model 003).
  • the first network device sends the parameter update information of the SA machine learning model 002 relative to the SA machine learning model 001 to the EMS1.
  • the second network device sends the parameter update information of the SA machine learning model 003 relative to the SA machine learning model 001 to the EMS1.
  • EMS1 acquires the SA machine learning model 002 based on the SA machine learning model 001 and the parameter update information sent by the first network device.
  • the EMS1 acquires the SA machine learning model 003 based on the SA machine learning model 001 and the parameter update information sent by the second network device.
  • the EMS1 uses a convergence algorithm to perform model convergence on the SA machine learning model 002 and the SA machine learning model 003 to obtain a converged machine learning model (marked as SA machine learning model 004 ).
  • S206 EMS1 determines whether the SA machine learning model 004 satisfies the second preset condition. For the relevant description about the second preset condition, reference may be made to the relevant description under S102, which will not be repeated here.
  • S207: EMS1 uses the SA machine learning model 004 as the new SA machine learning model 001.
  • the EMS1 sends the SA machine learning model 004 to the first network device and the second network device respectively.
  • S209 EMS1 sends the SA machine learning model 004 to the machine learning model management center, so that the SA machine learning model 004 is used by the EMS2 in the telecom operator network B.
  • the machine learning model management center sends SA machine learning model 004 to EMS2.
  • EMS2 Based on the SA machine learning model 004 and the statistical data of application messages or application messages in the telecom operator network B, EMS2 performs federated learning with multiple network devices managed by EMS2 to obtain the SA machine learning model 005;
  • the network device sends the SA machine learning model 005.
  • the multiple network devices may perform application identification based on the SA machine learning model 005 .
  • the specific implementation process can be obtained from the above S202-S208.
  • the machine learning model management center replaces the SA machine learning model 001 with the SA machine learning model 004.
  • the machine learning model management center generates a model package of the SA machine learning model 004 based on the SA machine learning model 004.
  • the machine learning model management center signs the model package of the SA machine learning model 004, and obtains a signature file of the model package of the SA machine learning model 004.
  • the EMS1 since the EMS1, the first network device and the second network device in the telecom operator network A perform federated learning in the telecom operator network A, the statistics of application packets or application packets in the telecom operator network A Data (including: application packets or statistical data of application packets of the first network device, and statistical data of application packets or application packets of the second network device), and the intermediate machine learning model of telecom operator network A (such as SA machine learning model 002 and SA machine learning model 003) do not need to be transmitted to third parties, thus improving data privacy security.
  • the intermediate machine learning model of telecom operator network A such as SA machine learning model 002 and SA machine learning model 003
  • the network device in the telecommunication operator's network B may perform application identification based on the SA machine learning model 005.
  • the SA machine learning model 005 is obtained by integrating the application packets or the statistical data of the application packets in the telecom operator network B and the application packets or the statistical data of the application packets in the telecom operator network A. Therefore, the network device in the telecom operator network B performs application identification based on the SA machine learning model 005, which helps to improve the accuracy of application identification.
  • a machine learning model management device such as a machine learning model management center or a federated learning server
  • each functional module may be divided into each function, or two Or two or more functions are integrated in one processing module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. It should be noted that, the division of modules in the embodiments of the present application is schematic, and is only a logical function division, and there may be other division manners in actual implementation.
  • FIG. 13 is a schematic structural diagram of a machine learning model management center provided by an embodiment of the present application.
  • the machine learning model management center 100 shown in FIG. 13 can be used to implement the functions of the machine learning model management center in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
  • the machine learning model management center may be the machine learning model management center 401 shown in FIG. 3 .
  • the machine learning model management center 100 is connected to the first federated learning server, and the first federated learning server belongs to the first management domain.
  • the machine learning model management center 100 includes a sending unit 1001 , a receiving unit 1002 and a processing unit 1003 .
  • the sending unit 1001 is configured to send the first machine learning model to the first federated learning server.
  • the processing unit 1003 is configured to replace the first machine learning model with the second machine learning model, so that the second machine learning model is used by the device in the second management domain.
  • the sending unit 1001 may be configured to perform S101
  • the receiving unit 1002 may be configured to perform the receiving step corresponding to S104
  • the processing unit 1003 may be configured to perform S105.
  • the receiving unit 1002 is further configured to receive the machine learning model requirement information sent by the first federated learning server.
  • the processing unit 1003 is further configured to determine the first machine learning model according to the machine learning model requirement information.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the machine learning model training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, convergence algorithm or security model.
  • the second machine learning model is a machine learning model based on the first training framework.
  • the processing unit 1003 is further configured to: convert the second machine learning model into a third machine learning model; wherein the third machine learning model is a machine learning model based on the second training framework, and the third machine learning model and the second machine learning model A model is a machine learning model corresponding to the business information of the same model.
  • the receiving unit 1002 is further configured to receive the access permission information of the second machine learning model sent by the first federated learning server.
  • the sending unit 1001 is further configured to send the second machine learning model to the second federated learning server, and the second federated learning server belongs to the second management domain.
  • the receiving unit 1002 is further configured to receive a fourth machine learning model from the second federated learning server; wherein, the fourth machine learning model is a local network service of the second federated learning server based on the second machine learning model and the second management domain Data, obtained through federated learning with multiple federated learning clients in the second management domain.
  • the processing unit 1003 is further configured to replace the second machine learning model with the fourth machine learning model.
  • the functions of the above-mentioned sending unit 1001 and receiving unit 1002 may be implemented through the communication interface 703 .
  • the functions of the above processing unit 1003 can be implemented by the processor 701 calling the degree code in the memory 702 .
  • FIG. 14 it is a schematic structural diagram of a federated learning server provided by an embodiment of the present application.
  • the federated learning server 110 shown in FIG. 14 can be used to implement the functions of the federated learning server in the above method embodiments, and thus can also achieve the beneficial effects of the above method embodiments.
  • the federated learning server 110 may be the federated learning server as shown in FIG. 3 .
  • the federated learning server 110 belongs to the first management domain and is connected to the machine learning model management center.
  • the federated learning server 110 includes a transceiver unit 1101 and a processing unit 1102 .
  • the transceiver unit 1101 is configured to acquire the first machine learning model from the machine learning model management center.
  • the processing unit 1102 is configured to perform federated learning with multiple federated learning clients in the first management domain based on the first machine learning model and local network service data of the first management domain to obtain a second machine learning model.
  • the transceiver unit 1101 is further configured to send the second machine learning model to the machine learning model management center, so that the second machine learning model is used by devices in the second management domain.
  • the transceiver unit 1101 may be configured to perform the receiving steps corresponding to S101 and S104.
  • the processing unit 1102 may be configured to perform the steps performed by the federated learning server in S102.
  • the transceiver unit 1101 is specifically configured to: send the machine learning model requirement information to the machine learning model management center; and receive the first machine learning model determined by the machine learning model management center according to the machine learning model requirement information.
  • the machine learning model requirement information includes model business information corresponding to the machine learning model and/or machine learning model training requirements.
  • the machine learning model training requirements include at least one of the following: training environment, algorithm type, network structure, training framework, convergence algorithm or security model.
  • the transceiver unit 1101 is further configured to send the access permission information of the second machine learning model to the machine learning model management center.
  • the transceiver unit 1101 is further configured to send the second machine learning model to multiple federated learning clients.
  • the transceiver unit 1101 is specifically configured to, if the application effect of the second machine learning model satisfies a preset condition, send the second machine learning model to the machine learning model management center.
  • the sending and receiving unit 1101 is further configured to send the first machine learning model to multiple federated learning clients in the first management domain, so that the multiple federated learning clients obtain the model based on the first machine learning model and
  • the network business data is federated learning, and the respective intermediate machine learning models are obtained.
  • the processing unit 1102 is specifically configured to acquire multiple intermediate machine learning models obtained by multiple federated learning clients, and aggregate the multiple intermediate machine learning models to obtain a second machine learning model.
  • the functions of the above-mentioned transceiver unit 1101 may be implemented through the communication interface 703 .
  • the above-mentioned functions of the processing unit 1102 can be implemented by the processor 701 calling the degree code in the memory 702 .
  • the apparatus includes: a processor and a memory, where the memory is used for storing computer programs and instructions, and the processor is used for invoking the computer programs and instructions to execute the above method.
  • the apparatus includes: a processor and a memory, where the memory is used for storing computer programs and instructions, and the processor is used for invoking the computer programs and instructions to execute the above method.
  • the apparatus includes: a processor and a memory, where the memory is used for storing computer programs and instructions, and the processor is used for invoking the computer programs and instructions to execute the above method.
  • Another embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium.
  • the instructions are executed by the terminal in the method process shown in the above method embodiments, the machine learning model management center or the first The corresponding steps performed by the federated learning server or federated learning client.
  • the disclosed methods may be implemented as computer program instructions encoded in a machine-readable format on a computer-readable storage medium or on other non-transitory media or articles of manufacture.
  • the computer may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • a software program it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer-executed instructions are loaded and executed on the computer, the flow or function according to the embodiments of the present application is generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website site, computer, server, or data center over a wire (e.g.
  • coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means to transmit to another website site, computer, server or data center.
  • Computer-readable storage media can be any available media that can be accessed by a computer or data storage devices including one or more servers, data centers, etc., that can be integrated with the media.
  • Useful media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种机器学习模型管理方法、装置和系统,涉及机器学习技术领域。该方法应用于联邦学习服务端,联邦学习服务端归属于第一管理域,该方法包括: 从机器学习模型管理中心获取第一机器学习模型; 基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型; 向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用,如此有助于节省计算资源,提高机器学习模型的适应性。

Description

机器学习模型管理方法、装置和系统
本申请要求于2020年11月3日提交中国国家知识产权局、申请号为202011212838.9、申请名称为“机器学习模型管理方法、装置和系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及机器学习技术领域,尤其涉及机器学习模型管理方法、装置和系统。
背景技术
电信运营商网络作为信息通信的基础设施,是一个需要高度智能化、自动化的自治系统。机器学习模型能够提供强大分析、判断、预测等能力,因此,将机器学习模型应用于电信运营商网络的规划、建设、维护、运行和优化等工作中,成为业界的研究热点。
联邦学习(federated learning)是一种分布式机器学习技术。如图1所示,每个联邦学习客户端(federated learning client,FLC),如联邦学习客户端1、2、3……k,利用本地计算资源和本地网络业务数据进行模型训练,并将本地训练过程中产生的模型参数更新信息Δω,如Δω 1、Δω 2、Δω 3……Δω k,发送给联邦学习服务端(federated learning server,FLS)。联邦学习服务端基于模型更新参数采用汇聚算法进行模型汇聚,得到汇聚机器学习模型。汇聚机器学习模型作为联邦学习客户端下一次执行模型训练的初始模型。联邦学习客户端和联邦学习服务端多次执行上述模型训练过程,直到得到的汇聚机器学习模型满足预设条件时,停止训练。
由此可以看出,参与联邦学习的各方需要将各自的中间机器学习模型或者模型参数更新信息集中到一起,才能享受联邦学习的好处。然而,电信领域中,根据地区政策/法令或用户需求,电信运营商网络的网络业务数据(包括设备数据、设备所支撑的网络业务数据及相关用户数据等)需要隐私保护,不能泄露给第三方,由于基于中间机器学习模型可以经反推得到电信运营商网络的网络业务数据的特征,因此中间机器学习模型也不能泄露给第三方,由此,各电信运营商网络只能各自训练各自的联邦学习模型,这不仅因重复训练而浪费了计算资源,还因网络业务数据的局限性而降低了各个电信运营商网络的联邦学习模型的适应性。
例如,应用识别(service awareness)业务是网络的基础增值业务,电信运营商网络通过对应用报文或应用报文的统计数据进行识别,可以得到该应用报文属于何种应用类别(如属于A应用还是B应用等),后续,可以针对不同的应用进行不同的处理如计费、限流、带宽保障。以机器学习模型对应应用识别业务,即机器学习模型是应用识别机器学习模型为例,某电信运营商网络的网络设备中部署有联邦学习客户端,该联邦学习客户端基于该网络设备的应用报文或应用报文的统计数据,进行本地训练,得到中间机器学习模型,如果该中间机器学习模型或模型参数更新信息泄露给了第三方,第三方可以基于该中间机器学习模型或模型参数更新信息反推得到该网络设备的应用报文或应用报文的统计数据,而应用报文或应用报文的统计数据属于敏感数据,由此给该电信运营商网络带来了安全隐患。
发明内容
本申请提供了一种机器学习模型管理方法、装置和系统,从整体上节省计算资源,并且 有助于提高联邦学习模型的适应性。
第一方面,提供一种机器学习模型管理方法,应用于联邦学习服务端,联邦学习服务端归属于第一管理域,且与机器学习模型管理中心连接。该方法包括:首先,从机器学习模型管理中心获取第一机器学习模型。其次,使与多个联邦学习客户端各自基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型。接着,向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
该技术方案中,一个管理域中获得的机器学习模型可以被其他管理域中的设备使用。这样,不同管理域之间不需要重复训练机器学习模型,从整个社会角度看节省了计算资源。
另外,随着时间的推移,机器学习模型管理中心上的机器学习模型,能够集多个管理域中网络业务数据之长(即机器学习模型是间接地基于多个管理域中的网络业务数据经联邦学习得到),其适应性(adaptivity)相对于仅基于单个管理域的网络业务数据得到的机器学习模型可以有较大的提高,对于每一个管理域而言,后续输入更新颖、更复杂的网络业务数据执行的模型业务,也能获得较好的效果。
此外,每个管理域独立进行机器学习模型的训练,且联邦学习服务端从机器学习模型管理中心获取初始机器学习模型,因此,即使一个管理域中的联邦学习服务端发生故障,当故障恢复时,该联邦学习服务端依然可以从机器学习模型管理中心获取来自当前最新的用于共享的机器学习模型(即故障期间,机器学习模型管理中心结合其他联邦学习服务端获得的更新后的机器学习模型)作为初始机器学习模型,从而有助于减少联邦学习次数,加快机器学习模型的收敛速度。相比传统技术,本技术方案在联邦学习服务端故障恢复后,机器学习模型的收敛速度较快,恢复能力更强,也就是说,鲁棒性更好。
在一种可能的设计中,从机器学习模型管理中心获取第一机器学习模型,包括:向机器学习模型管理中心发送机器学习模型需求信息;接收机器学习模型管理中心根据机器学习模型需求信息确定的第一机器学习模型。
也就是说,联邦学习服务端对第一机器学习模型“随用随取”,这样,有助于节省联邦学习服务端所在设备的存储空间。
在一种可能的设计中,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
在一种可能的设计中,训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
在一种可能的设计中,该方法还包括:向机器学习模型管理中心发送第二机器学习模型的访问权限信息。
也就是说,联邦学习服务端可以自主确定自身训练得到的机器学习模型的访问权限,即该机器学习模型可以被哪些联邦学习服务端使用。后续,机器学习模型管理中心可以对具有访问权限的联邦学习服务端提供第二机器学习模型。
在一种可能的设计中,该方法还包括:向多个联邦学习客户端发送第二机器学习模型。后续,该多个联邦学习客户端可以基于第二机器学习模型执行第二机器学习模型对应的模型业务。该可能的设计提供了第二机器学习模型的应用的示例。
例如,假设第二机器学习模型对应的模型业务是应用识别业务,则该多个联邦学习客户端可以基于第二机器学习模型进行应用识别。
在一种可能的设计中,向机器学习模型管理中心发送第二机器学习模型,包括:如果第二机器学习模型的应用效果满足预设条件,则向机器学习模型管理中心发送第二机器学习模型。
可选的,第二机器学习模型的应用效果满足预设条件,可以理解为:第二机器学习模型的应用效果达到预设目标。
这样,有助于提高联邦学习服务端发往机器学习模型管理中心的机器学习模型的精确度/准确率,从而进一步缩短其他联邦学习客户端使用该机器学习模型进行联邦学习时,机器学习模型的收敛时间。
在一种可能的设计中,基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型,包括:向第一管理域中的多个联邦学习客户端发送第一机器学习模型,以使该多个联邦学习客户端分别基于第一机器学习模型和各自获取的网络业务数据进行联邦学习,得到各自的中间机器学习模型;获取该多个联邦学习客户端得到的多个中间机器学习模型,并基于该多个中间机器学习模型汇聚得到第二机器学习模型。
第二方面,提供一种机器学习模型管理方法,应用于机器学习模型管理中心,机器学习模型管理中心与第一联邦学习服务端连接,第一联邦学习服务端归属于第一管理域。该方法包括:首先,向第一联邦学习服务端发送第一机器学习模型。其次,从第一联邦学习服务端接收第二机器学习模型;其中,第二机器学习模型为第一联邦学习服务端基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习得到的。具体的,第二机器学习模型为第一联邦学习服务端以第一机器学习模型为初始机器学习模型,并基于第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习得到的。接着,用第二机器学习模型替换第一机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
在一种可能的设计中,在向第一联邦学习服务端发送第一机器学习模型之前,该方法还包括:接收第一联邦学习服务端发送的机器学习模型需求信息;根据机器学习模型需求信息,确定第一机器学习模型。
在一种可能的设计中,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
在一种可能的设计中,训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
在一种可能的设计中,第二机器学习模型是基于第一训练框架的机器学习模型。该方法还包括:将第二机器学习模型转换为第三机器学习模型;其中,第三机器学习模型是基于第二训练框架的机器学习模型,且第三机器学习模型和第二机器学习模型是同一模型业务信息对应的机器学习模型。
在一种可能的设计中,机器学习模型管理中心还存储有第五机器学习模型,第五机器学习模型是基于第二训练框架的机器学习模型,第五机器学习模型和第一机器学习模型是同一模型业务信息对应的机器学习模型。该方法还可以包括:用第三机器学习模型替换第五机器学习模型。这样,有助于使得针对同一模型业务信息的其他训练框架下的机器学习模型是最新的机器学习模型。
在一种可能的设计中,该方法还包括:接收第一联邦学习服务端发送的第二机器学习模 型的访问权限信息。
在一种可能的设计中,该方法还包括:向第二联邦学习服务端发送第二机器学习模型;其中,第二联邦学习服务端归属于第二管理域。从第二联邦学习服务端接收第四机器学习模型;其中,第四机器学习模型为第二联邦学习服务端基于第二机器学习模型和本地网络业务数据,与多个联邦学习客户端进行联邦学习得到的;用第四机器学习模型替换第二机器学习模型。该可能的设计提供了一种第二管理域中的设备使用第二机器学习模型的具体实现方式。
可以理解的是,第二方面提供的相应方法可以与第一方面提供的相应方法对应,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
第三方面,提供一种联邦学习系统,包括:联邦学习服务端和多个联邦学习客户端。联邦学习服务端和该多个联邦学习客户端归属于第一管理域,且联邦学习服务端与机器学习模型管理中心连接。联邦学习服务端,用于从机器学习模型管理中心获取第一机器学习模型,并向该多个联邦学习客户端发送第一机器学习模型。该多个联邦学习客户端中的每个联邦学习客户端,用于基于第一机器学习模型和各自获取的网络业务数据,进行联邦学习,得到各自的中间机器学习模型。联邦学习服务端,还用于获取该多个联邦学习客户端得到的多个中间机器学习模型,并基于该多个中间机器学习模型汇聚得到第二机器学习模型,向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
在一种可能的设计中,联邦学习服务端,还用于向多个联邦学习客户端发送第二机器学习模型。多个联邦学习客户端,还用于基于第二机器学习模型执行第二机器学习模型对应的模型业务。
第四方面,提供一种网络系统,包括:机器学习模型管理中心、联邦学习服务端和多个联邦学习客户端。联邦学习服务端和该多个联邦学习客户端归属于第一管理域,且联邦学习服务端与机器学习模型管理中心连接。机器学习模型管理中心,用于向联邦学习服务端发送第一机器学习模型。联邦学习服务端,用于向该多个联邦学习客户端发送第一机器学习模型。该多个联邦学习客户端中的每个联邦学习客户端,用于基于第一机器学习模型和各自获取的网络业务数据,进行联邦学习,得到各自的中间机器学习模型。联邦学习服务端,还用于获取该多个联邦学习客户端得到的多个中间机器学习模型,并基于该多个中间机器学习模型汇聚得到第二机器学习模型,向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。机器学习模型管理中心,还用于用第二机器学习模型替换第一机器学习模型。
在一种可能的设计中,联邦学习服务端还用于,向机器学习模型管理中心发送机器学习模型需求信息。机器学习模型管理中心还用于,根据机器学习模型需求信息,向联邦学习服务端发送第一机器学习模型。
在一种可能的设计中,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
在一种可能的设计中,训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
在一种可能的设计中,第二机器学习模型是基于第一训练框架的机器学习模型。机器学习模型管理中心还用于,将第二机器学习模型转换为第三机器学习模型;其中,第三机器学习模型是基于第二训练框架的机器学习模型,且第三机器学习模型和第二机器学习模型是同一模型业务信息对应的机器学习模型。
在一种可能的设计中,机器学习模型管理中心还存储有第五机器学习模型,第五机器学习模型是基于第二训练框架的机器学习模型,第五机器学习模型和第一机器学习模型是同一模型业务信息对应的机器学习模型。机器学习模型管理中心还用于,用第三机器学习模型替换第五机器学习模型。
在一种可能的设计中,联邦学习服务端,还用于向多个联邦学习客户端发送第二机器学习模型。多个联邦学习客户端,还用于基于第二机器学习模型执行第二机器学习模型对应的模型业务。
第五方面,本申请提供了一种机器学习模型管理装置。该机器学习模型管理装置用于执行上述第一方面提供的任一种方法。该情况下,该机器学习模型管理装置具体可以是联邦学习服务端。
在一种可能的设计方式中,本申请可以根据上述第一方面提供的任一种方法,对该机器学习模型管理装置进行功能模块的划分。例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。
示例性的,本申请可以按照功能将该机器学习模型管理装置划分为收发单元和处理单元等。上述划分的各个功能模块执行的可能的技术方案和有益效果的描述均可以参考上述第一方面或其相应的可能的设计提供的技术方案,此处不再赘述。
在另一种可能的设计中,该机器学习模型管理装置包括:存储器和处理器,存储器和处理器耦合。存储器用于存储计算机指令,处理器用于调用该计算机指令,以执行如第一方面及其任一种可能的设计方式提供的任一种方法。
第六方面,本申请提供了一种机器学习模型管理装置。该机器学习模型管理装置用于执行上述第二方面提供的任一种方法。该情况下,该机器学习模型管理装置具体可以是机器学习模型管理中心。
在一种可能的设计方式中,本申请可以根据上述第二方面提供的任一种方法,对该机器学习模型管理装置进行功能模块的划分。例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。
示例性的,本申请可以按照功能将该机器学习模型管理装置划分为接收单元、发送单元和处理单元等。上述划分的各个功能模块执行的可能的技术方案和有益效果的描述均可以参考上述第二方面或其相应的可能的设计提供的技术方案,此处不再赘述。
在另一种可能的设计中,该机器学习模型管理装置包括:存储器和处理器,存储器和处理器耦合。存储器用于存储计算机指令,处理器用于调用该计算机指令,以执行如第二方面及其任一种可能的设计方式提供的任一种方法。
第七方面,本申请提供了一种计算机可读存储介质,如计算机非瞬态的可读存储介质。其上储存有计算机程序(或指令),当该计算机程序(或指令)在计算机设备上运行时,使得该计算机设备执行上述第一方面或第二方面中的任一种可能的实现方式提供的任一种方法。
第八方面,本申请提供了一种计算机程序产品,当其在计算机设备上运行时,使得第一方面或第二方面中的任一种可能的实现方式提供的任一种方法被执行。
第九方面,本申请提供了一种芯片系统,包括:处理器,处理器用于从存储器中调用并运行该存储器中存储的计算机程序,执行第一方面中或第二方面的实现方式提供的任一种方法。
可以理解的是,在上述第五方面的另一种可能的设计、第六方面的另一种可能的设计、 第七至第九方面提供的任何一种技术方案中:上述第一方面或第二方面中的发送动作,具体可以替换为在处理器的控制下发送;上述第一方面或第二方面中的接收动作,具体可以替换为在处理器的控制下接收。
可以理解的是,上述提供的任一种系统、装置、计算机存储介质、计算机程序产品或芯片系统等均可以应用于第一方面或第二方面提供的对应的方法,因此,其所能达到的有益效果可参考对应的方法中的有益效果,此处不再赘述。
在本申请中,上述任一种装置的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。
本申请的这些方面或其他方面在以下的描述中会更加简明易懂。
附图说明
图1为适用于本申请实施例的一种联邦学习系统的结构示意图;
图2为本申请实施例提供的一种网络系统的结构示意图;
图3为本申请实施例提供的一种机器学习模型管理系统的结构示意图;
图4为本申请实施例提供的一种将机器学习模型管理系统应用到网络系统中的系统结构示意图;
图5为本申请实施例提供的一种公有云的逻辑结构示意图;
图6为本申请实施例提供的一种管控系统的逻辑结构示意图;
图7为本申请实施例提供的一种网络设备的逻辑结构示意图;
图8为本申请实施例提供的另一种将机器学习模型管理系统应用到网络系统中的系统结构示意图;
图9为本申请实施例提供的一种计算机设备的硬件结构示意图;
图10为本申请实施例提供的一种机器学习模型管理方法的交互示意图;
图11为本申请实施例提供的一种联邦学习过程的流程示意图;
图12为本申请实施例提供的另一种机器学习模型管理方法的交互示意图;
图13为本申请实施例提供的一种机器学习模型管理中心的结构示意图;
图14为本申请实施例提供的一种联邦学习服务端的结构示意图。
具体实施方式
以下,说明本申请实施例中所涉及的一些术语和技术:
1)、网络业务、网络业务数据、模型业务、模型业务信息
网络业务,是指基于网络或网络设备所能提供的通信服务。例如,宽带业务、网络切片业务、虚拟网络业务等。
网络业务数据,是指网络业务在运行中产生的数据或所产生数据的相关数据。例如,应用报文本身、应用报文的统计数据(如报文的丢包率等)、故障告警信息等。
模型业务,是指基于机器学习模型和网络业务数据所能提供的业务,如应用识别业务、故障跟踪预测业务、关键绩效指标(key performance indicator,KPI)异常检测(abnormal detection)业务等。如果机器学习模型对应的模型业务是应用识别业务,则相应的网络业务数据包括应用报文、应用报文的统计数据。如果机器学习模型对应的模型业务是故障跟踪预测业务,则相应的网络业务数据包括故障告警信息。
模型业务信息,是指模型业务的有关信息,包括模型业务的标识或类型等。
2)、机器学习、机器学习模型、机器学习模型文件
机器学习,是使用算法来解析数据、从中学习,然后对真实世界中的事件做出决策和预测。机器学习是用大量的数据来“训练”,通过各种算法从数据中学习如何完成某模型业务。
在一些示例中,机器学习模型是包含用于完成某模型业务所采用的算法实现代码和参数的文件。其中,算法实现代码用于描述机器学习模型的模型结构,参数用于描述机器学习模型各构成部分的属性。为了方便描述,下文中将该文件称为机器学习模型文件。例如,下文中发送机器学习模型具体是指发送机器学习模型文件。
在另一些示例中,机器学习模型是完成某模型业务的逻辑功能模块。例如,将输入参数的值输入到机器学习模型,得到该机器学习模型的输出参数的值。
机器学习模型包括人工智能(artificial intelligence,AI)模型如神经网络模型等。
需要说明的是,在本申请实施例中,基于联邦学习系统训练得到的机器学习模型,也可以被称为联邦学习模型。
3)、机器学习模型包
机器学习模型包,包含机器学习模型本身(即机器学习模型文件)和机器学习模型的说明文件。其中,机器学习模型的说明文件可以包括:机器学习模型的描述信息和机器学习模型的运行脚本等。
其中,机器学习模型的描述信息,是指用于描述机器学习模型的信息。
可选的,机器学习模型的描述信息可以包括:机器学习模型对应的模型业务信息和机器学习模型训练需求中的至少一种。
模型业务信息可以包括模型业务类型或者模型业务标识。
例如,如果一个机器学习模型用于进行应用(application)识别,则该机器学习模型对应的模型业务类型,是应用识别类型。又如,如果一个机器学习模型用于进行故障预测,则该机器学习模型对应的模型业务类型是故障预测类型。
属于同一模型业务类型的不同模型业务,具有不同的模型业务标识。例如,机器学习模型1对应于应用识别业务1,用于识别应用集合A中的应用,机器学习模型2对应于应用识别业务2,用于识别应用集合B中的应用。其中,应用集合A和应用集合B不同。
训练需求可以包括以下至少一种:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型等。
训练环境,是训练机器学习模型的设备的类型。以本申请实施例提供的技术方案应用于核心网为例,训练环境可以包括:外部PCEF支持节点(external PCEF support node,EPSN)、通用客户端设备(universal customer premise equipment,uCPE),或IP多媒体子系统(IP multimedia subsystem,IMS)等。其中,PCEF是策略及计费执行功能(policy and charging enforcement function)的英文缩写。IP是网络之间互连的协议(internet protocol)的英文缩写。
算法类型,是训练机器学习模型所使用的算法的类型,例如,神经网络、线性回归等。进一步的,神经网络的类型可以包括:卷积神经网络(convolutional neural networks,CNN)、长短期记忆网络(long short-term memory,LSTM)、或循环神经网络(recurrent neural network,RNN)等。
网络结构,是机器学习模型对应的网络结构。以算法类型是神经网络为例,网络结构可以包括:输入层(input)的特征(如维度等)、输出层(output)的特征(如维度等)和隐 藏层(hiden layers)的特征(如维度等)等。
训练框架,也可以被称为机器学习框架(machine learning platform),是训练机器学习模型所使用的训练框架,具体是整合包括机器学习算法在内的所有机器学习的系统或方法,包括数据表示与数据处理的方法、数据表示和建立机器学习模型的方法、评价和使用建模结果的方法等。以算法类型是神经网络为例,训练框架可以包括:用于快速特征嵌入的卷积结构(convolutional architecture for fast feature embedding,Caffe)训练框架、Tensorflow训练框架或Pytorch训练框架等。
汇聚算法,是训练机器学习模型所使用的算法,具体是联邦学习系统进行模型训练的过程中,由联邦学习服务端对多个中间机器学习模型进行模型汇聚的过程中,所使用的算法。例如,汇聚算法可以包括:加权平均算法或联邦随机方差折减梯度(federated stochastic variance reduced gradient,FSVRG)算法等。
安全模式,是机器学习模型传输的过程中所使用的安全手段(如加密算法等)。可选的,安全模式需求可以包括是否使用安全模式。进一步可选的,如果使用安全模式,则具体使用哪一种安全模式。例如,该安全模式可以包括:多方安全计算(secure multi-party computation,MPC),或安全哈希算法(secure hash algorithm,SHA)256等。
可选的,机器学习模型的说明文件还可以包括:机器学习模型的访问权限,和/或机器学习模型的计费策略等。其中,访问权限可以替换为共享权限。
可选的,机器学习模型的访问权限可以包括机器学习模型是否能被共享,即是否可以被其他联邦学习服务端使用。进一步可选的,机器学习模型的访问权限还可以包括:如果机器学习模型可以被共享,则机器学习模型具体可以被哪个或哪些联邦学习服务端使用,和/或不能被哪个或哪些联邦学习服务端使用。
机器学习模型的计费策略,是指使用机器学习模型需要遵循的付费策略。
4)、训练样本和测试样本
在机器学习中,样本包括训练样本和测试样本。其中,训练样本是用来训练机器学习模型所使用的样本。测试样本是用来测试机器学习模型的测量误差(或准确率)所使用的样本。
5)、其他术语
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。
在本申请的实施例中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个第二报文是指两个或两个以上的第二报文。
应理解,在本文中对各种所述示例的描述中所使用的术语只是为了描述特定示例,而并非旨在进行限制。如在对各种所述示例的描述和所附权利要求书中所使用的那样,单数形式“一个(“a”,“an”)”和“该”旨在也包括复数形式,除非上下文另外明确地指示。
还应理解,本文中所使用的术语“和/或”是指并且涵盖相关联的所列出的项目中的一个 或多个项目的任何和全部可能的组合。术语“和/或”,是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中的字符“/”,一般表示前后关联对象是一种“或”的关系。
还应理解,在本申请的各个实施例中,各个过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
还应理解,术语“包括”(也称“includes”、“including”、“comprises”和/或“comprising”)当在本说明书中使用时指定存在所陈述的特征、整数、步骤、操作、元素、和/或部件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元素、部件、和/或其分组。
还应理解,术语“如果”可被解释为意指“当...时”(“when”或“upon”)或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定...”或“如果检测到[所陈述的条件或事件]”可被解释为意指“在确定...时”或“响应于确定...”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
应理解,说明书通篇中提到的“一个实施例”、“一实施例”、“一种可能的实现方式”意味着与实施例或实现方式有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”、“一种可能的实现方式”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
还应理解,本申请实施例中提到的“连接”,可以是直接连接,也可以是间接连接,可以是有线连接,也可以是无线连接,也就是说,本申请实施例对设备之间的连接方式不作限定。
以下,结合附图对本申请实施例提供的技术方案进行说明。
如图2所示,为本申请实施例提供的一种网络系统30的结构示意图。图2所示的网络系统30可以包括云平台(提供计算资源服务,可用于部署网络应用,图中未示出)以及与云平台连接的至少两个管理域,如管理域1、管理域2等。其中,云平台可以是公有云301,也可以是其他类型的云平台,还可以是其他可以部署网络应用的网络平台;为叙述方便,下文中均以公有云301为例进行说明。
示例的,管理域可以是电信运营商网络、虚拟运营商网络、企业网络(如银行、政府、大型企业等行业的网络系统)和园区网络等。
可选地,不同管理域之间在安全上是隔离的,相互之间不共享“网络业务数据”和“中间机器学习模型”。关于中间机器学习模型的定义可以参考下文S102的相关描述。不同管理域可以是不同的电信运营商网络、不同的虚拟运营商网络、不同的企业网络等。
在一些实施例中,一个管理域包括一个或多个管控系统,如图2中的管控系统302-1、管控系统302-2和管控系统302-3,以及与每个管控系统连接的一个或多个网络设备,如图2中的管控系统302-1连接网络设备303-11和网络设备303-12,管控系统302-2连接网络设备303-21和网络设备303-22,管控系统302-3连接网络设备303-31和网络设备303-32。
其中,管控系统和网络设备之间可以直接或间接连接。
公有云301,可以由网络设备制造商建设运营,也可以由其他第三方厂商建设运营,以云 服务方式与各管理域通信。
管控系统,负责实现对单个管理域的全生命周期进行管理和维护。例如,管控系统可以是核心网管控系统、接入网管控系统、传送网管控系统等。例如,当管理域具体是电信运营商网络时,管控系统可以是网络管理系统(network management system,NMS)、网元管理系统(network element management system,EMS)、或者运营支撑系统(operation support system,OSS)。
可选的,同一管理域中可能会存在多个管控系统,以管理不同的子管理域。同一区域(如城市的同一个区等)的管理域被称为一个子管理域,例如,城市的同一个区的电信运营商网络作为一个子电信运营商网络。不同子管理域在地域上是隔离的。
例如,图2中的管理域2中包含子管理域1和子管理域2等,每个子管理域包含一个管控系统,每个管控系统连接多个网络设备。
网络设备,负责向管控系统上报其所在管理域或子管理域中的网络业务数据,如网络设备的告警数据、网络设备的性能指标、网络设备的运行日志、网络设备的话统(traffic statistics)等,并执行管控系统下发的管理控制指令。例如,网络设备可以是路有器、交换机、光线路终端(optical line terminal,OLT)、基站、核心网设备等。网络设备可以具备进行机器学习模型训练的计算资源和算法环境,具备数据存储和处理能力(如执行机器学习模型的训练能力等)。
图1为适用于本申请实施例的一种联邦学习系统的结构示意图。联邦学习系统包括联邦学习服务端,以及与联邦学习服务端直接或间接连接的多个联邦学习客户端。
联邦学习服务端与多个联邦学习客户端配合以进行联邦学习的过程,可以参考图11对应的方法流程。另外,联邦学习服务端还承担联邦学习系统的管理工作,例如参与训练的联邦学习客户端的确定、训练实例生成、通信安全、隐私保护、训练系统可靠性保障等。
上述联邦学习服务端和联邦学习客户端具体可以是逻辑功能模块。
如图3所示,为本申请实施例提供的一种机器学习模型管理系统40的结构示意图。
图3所示的机器学习模型管理系统40包括机器学习模型管理中心401,以及与机器学习模型管理中心401连接的多个联邦学习系统,如联邦学习系统402-1和联邦学习系统402-2,其中,每一个联邦学习系统可以包括联邦学习服务端403和联邦学习客户端1-k。
机器学习模型管理中心401,用于管理并向联邦学习系统提供机器学习模型。其中,机器学习模型管理中心401所管理的机器学习模型可以被多个联邦学习系统使用。管理机器学习模型,可以包括:按照联邦学习系统中的机器学习模型规范生成机器学习模型包,并为机器学习模型包生成签名文件等,以及,将机器学习模型包存放在机器学习模型市场,以供其他联邦学习服务端下载使用。
如表1所示,为本申请实施例提供的一种机器学习模型市场中的机器学习模型包的示例。
表1
Figure PCTCN2021110111-appb-000001
Figure PCTCN2021110111-appb-000002
每个联邦学习系统,用于基于机器学习模型管理中心401下发的机器学习模型(即第一机器学习模型),进行联邦学习,并向机器学习模型管理中心401上报联邦学习结果(即第二机器学习模型)。
在一个示例中,机器学习模型管理中心401与联邦学习系统中的联邦学习服务端之间可以通过表述性状态传递(representational state transfer,REST)协议通信。
上述机器学习模型管理中心401具体可以是逻辑功能模块。
如图4所示,为本申请实施例提供的一种将机器学习模型管理系统40应用到网络系统30中的系统结构示意图。
图4中,机器学习模型管理中心401部署在公有云301上,以向不同的管理域提供机器学习模型管理服务。此外,如图5所示,公有云301还可以包括:
机器学习模型训练平台301A:用于提供机器学习模型训练所需要的计算资源、机器学习算法框架、训练算法和机器学习模型调试工具等。并且,提供机器学习模型训练所需要的数据治理、特征工程、机器学习算法选择,机器学习模型参数优化、机器学习模型评估和测试等功能。例如,管理域可以在机器学习模型训练平台301A上完成某一模型业务对应的机器学习模型(如汇聚机器学习模型)的训练。
安全通信模块301B:用于提供公有云301与管控系统之间的安全通信能力。例如,安全通信模块301B用于对公有云301与管控系统之间传输的信息进行加密等。
机器学习模型管理中心401部署在公有云301上后,机器学习模型管理中心401可以复用公有云301的资源,如计算资源和/或通信资源等。例如,结合图5,机器学习模型管理中心401可以在机器学习模型训练平台301A上完成某一模型业务对应的机器学习模型的训练,然后,将该机器学习模型提供给其他管理域。又如,结合图5,安全通信模块301B用于提供公有云301中的机器学习模型管理中心401与管控系统中的机器学习服务端之间进行安全通信能力,例如,安全通信模块301B用于对机器学习模型管理中心401与机器学习服务端之间传输的机器学习模型(或机器学习模型包)进行加密等。
图4中,联邦学习服务端部署在管控系统上,如联邦学习服务端403-1部署在管控系统302-2上,联邦学习服务端403-2部署在管控系统302-3上。此外,如图6所示,管控系统(如管控系统302-2)还可以包括:
管控基础平台302A:用于向联邦学习客户端提供计算资源、通信资源,以及对外管理控制界面。并且,还用于提供其他软件系统能力。
管控北向接口302B:用于管控系统与公有云301通信。
管控南向接口302C:用于管控系统与网络设备通信。管控南向接口302C可以包括:谷歌远程过程调用协议RPC(google remote procedure call protocol,gRPC)接口、表现层状态转移(representational state transfer,REST)接口等。
安全通信模块302D:用于提供管控系统与公有云301之间的安全通信能力,以及管控系统与网络设备之间的安全通信能力。例如,安全通信模块302D用于对管控系统与公有云301之间传输的信息进行加密等,以及,对管控系统与网络设备之间传输的信息进行加密等。
需要说明的是,安全通信模块302D可以包括第一子模块和第二子模块,第一子模块用于 提供管控系统与公有云301之间的安全通信能力,第一子模块的功能与上述安全通信模块301B的功能相对应。第二子模块用于提供管控系统与网络设备之间的安全通信能力,第二子模块的功能与下述安全通信模块303C的功能相对应。
联邦学习服务端部署在管控系统上后,联邦学习服务端可以复用管控系统的资源,如计算资源和/或通信资源等。例如,结合图6,管控基础平台302A用于向联邦学习服务端提供运行所需要的计算资源、通信资源,同时向联邦学习服务端提供对外管理控制界面。用户可以在该控制界面上实现对联邦学习系统的管理和配置。管控基础平台302A还用于向联邦学习服务端提供所需要的其他软件系统能力,如用户认证、安全证书、权限管理等。再如,结合图6,联邦学习服务端可以通过管控北向接口302B与公有云301中的机器学习模型管理中心401通信。又如,结合图6,联邦学习服务端可以通过管控南向接口302C与网络设备中的机器学习客户端通信。
图4中,联邦学习客户端部署在网络设备上,如联邦学习客户端404-1部署在网络设备303-21上,联邦学习客户端404-2部署在网络设备303-22上,联邦学习客户端404-3部署在网络设备303-31上,联邦学习客户端404-4部署在网络设备303-32上。此外,如图7所示,网络设备(如网络设备303-21)还可以包括:
本地训练模块303A:具备本地训练的计算能力、本地数据处理能力,以及训练算法框架如Tensorflow、Caffe等。
网络业务模块303B:用于执行网络设备的网络业务处理流程。其中,网络业务处理流程中的控制信息可以来自机器学习模型的推理结果(即机器学习模型的输出信息),如按照推理结果执行报文转发等功能。网络业务模块303B还需要将网络业务运行过程中产生的网络业务数据如性能指标、告警数据等发送给本地训练模块303A,以使得本地训练模块303A进行模型的更新和优化。
安全通信模块303C:用于提供管控系统与网络设备之间的安全通信能力。例如,安全通信模块303C用于对管控系统与网络设备之间传输的信息进行加密等。
联邦学习客户端部署在网络设备上后,联邦学习客户端可以复用网络设备401的资源,如计算资源和/或通信资源等。例如,结合图7,联邦学习客户端承担本地模型训练模块303A与联邦学习服务端之间的通信接口和管理接口。具体的:构建与联邦学习服务端之间的安全通信、下载汇聚机器学习模型、上传中间机器学习模型与初始机器学习模型之间的参数更新信息、应用训练策略协同控制本地算法等。作为联邦学习系统在本地的代理,联邦学习客户端承担本地管理功能,包括联邦学习本地节点的系统接入、安全认证、启动加载等功能。同时还承担联邦学习本地节点安全隐私保护的功能,包括数据加密、隐私保护、多方计算等。
可选的,管控系统还可以包括本地训练模块303A,该情况下,网络设备中可以不包含本地训练模块303A。
如图8所示,为本申请实施例提供的另一种将机器学习模型管理系统40应用到网络系统30中的系统结构示意图。在图8中,机器学习模型管理中心401部署在公有云301上。联邦学习服务端和联邦学习客户端均部署在管控系统上。如联邦学习服务端403-1和联邦学习客户端404-1部署在管控系统302-2上,联邦学习客户端404-2部署在管控系统302-3上。
图8所示的系统适用于一个管理域(如图2中的管理域2)包含多个子管理域的场景中。该场景中,可选的,一个管理域中的其中一个管控系统上部署机器学习服务端,其他管控系统上部署与该机器学习服务端连接的机器学习客户端。进一步可选的,部署有机器学习服务端 的管控系统上可以部署也可以不部署,与该机器学习服务端连接的机器学习客户端。
图8所示的网络架构中,管控系统负责子管理域中管理域级(或网络级)的信息管理,如管理域级故障分析、管理域级的优化策略、管理域级的容量管理等。管控系统可以基于子管理域级粒度的训练样本进行本地训练。典型的应用如管理域级的指标异常检测,具体的,管控系统对所管理各网络设备上报的性能指标进行统一训练,生成管理域级的指标异常检测机器学习模型。
相比图4所示的网络架构,图8所示的网络架构中,联邦中心服务端和联邦中心客户端可以部署在同一管控系统上,并且,由基于网络设备粒度的模型训练扩展到了基于管理域级粒度的模型训练。扩展了联邦学习的适用范围,适用于针对管理域级的机器学习模型的训练和优化。并且有助于解决基于网络设备粒度的模型训练,因训练样本量小而导致的训练所得到的机器学习模型的精确度不足、泛化能力不足、数据收集时间长人工投入大等问题。
需要说明的是,图4和图8也可以结合使用,例如,部分管理域中基于网络设备粒度进行模型训练,部分管理域中基于管理域级粒度进行模型训练,从而构成新的实施例。
图4和图8所示的网络架构中,与传统技术相比,一方面,机器学习模型的训练过程均在管理域内部完成,网络业务数据不发送到管理域之外,因此,有助于提高管理域中的网络业务数据的安全性。另一方面,实现了管理域之间的机器学习模型共享,这样,不同管理域之间不需要重复训练机器学习模型,从整个社会角度看节省了计算资源,且对每个管理域(如电信运营商网络)来讲,降低了管理域的建设成本和维护成本。
另外,在互相共享的情况下,不同管理域可以基于最新的机器学习模型开始执行联邦学习,这样,节省了数据样本采集、数据治理、特征工程、模型训练、模型测试等处理过程,从而大大缩短了模型更新周期。
此外,图4和图8所示的网络架构中,联邦学习系统可以共用现有网络系统中的资源(如计算资源,通信资源等),这有助于降低因引入联邦学习系统而导致的网络系统的变更,并且,不需要附加通信安全管理措施如不需要增加防火墙、跳板机等。
需要说明的是,上文提供的任意一种联邦学习系统和网络架构中的各设备/功能模块的相关功能可以参考下文提供的机器学习模型管理方法(如图10所示的机器学习模型管理方法)中的相关步骤,此处不再赘述。
在一些实施例中,一个管理域包括一个或多个网络设备。由公有云301对该一个或多个网络设备进行控制。基于该实施例,公有云301上可以部署机器学习模型管理中心,网络设备上可以部署联邦学习客户端。机器学习模型管理中心用于直接对联邦学习客户端进行控制,而不需要经过联邦学习服务端。
在硬件实现上,公有云301或管控系统均可以由一个设备实现,或者由多个设备协同实现。本申请实施例对此不进行限定。
如图9所示,为本申请实施例提供的一种计算机设备70的硬件结构示意图。该计算机设备70可以用于实现部署有机器学习模型管理中心401、机器学习服务端、或机器学习客户端的设备的功能。例如,该计算机设备70可以用于实现上述公有云301或管控系统的部分或全部功能,也可以用于实现上述网络设备的功能。
图9所示的计算机设备70可以包括:处理器701、存储器702、通信接口703以及总线704。处理器701、存储器702以及通信接口703之间可以通过总线704连接。
处理器701是生成计算机设备70的控制中心,可以是一个通用中央处理单元(central  processing unit,CPU),也可以是其他通用处理器等。其中,通用处理器可以是微处理器或者是任何常规的处理器等。
作为一个示例,处理器701可以包括一个或多个CPU,例如图9中所示的CPU 0和CPU 1。
存储器702可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
一种可能的实现方式中,存储器702可以独立于处理器701存在。存储器702可以通过总线704与处理器701相连接,用于存储数据、指令或者程序代码。处理器701调用并执行存储器702中存储的指令或程序代码时,能够实现本申请实施例提供的机器学习模型管理方法,例如,图10所示的机器学习模型管理方法。
另一种可能的实现方式中,存储器702也可以和处理器701集成在一起。
通信接口703,用于计算机设备70与其他设备通过通信网络连接,所述通信网络可以是以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。通信接口703可以包括用于接收数据的接收单元,以及用于发送数据的发送单元。
总线704,可以是工业标准体系结构(industry standard architecture,ISA)总线、外部设备互连(peripheral component interconnect,PCI)总线或扩展工业标准体系结构(extended industry standard architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
需要指出的是,图9中示出的结构并不构成对计算机设备70的限定,除图9所示部件之外,计算机设备70可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图10所示,为本申请实施例提供的一种机器学习模型管理方法的交互示意图。
图10所示的方法可以应用于如图3所示的机器学习模型管理系统40中,该机器学习模型管理系统40可以部署在如图4或图8所示的网络系统30中。
图10所示的方法可以包括以下步骤S101-S105:
S101:机器学习模型管理中心向第一联邦学习服务端发送第一机器学习模型。其中,第一联邦学习服务端可以是与机器学习模型管理中心连接的任意一个联邦学习服务端。
第一机器学习模型,是机器学习模型管理中心存储的某一模型业务对应的机器学习模型。该模型业务对应的机器学习模型是可以更新的。第一机器学习模型可以是机器学习模型管理中心存储的该模型业务对应的首个机器学习模型,也可以是机器学习模型管理中心存储的该模型业务对应的非首个机器学习模型。其具体示例可以参考图11所示的实施例。
可选的,S101可以包括:机器学习模型管理中心向第一联邦学习服务端发送机器学习模型包,该机器学习模型包包含第一机器学习模型(即模型文件)。
进一步可选的,机器学习模型管理中心基于REST协议向第一联邦学习服务端发送该机器学习模型包。另外,该机器学习模型包还可以包括第一机器学习模型的说明文件。
可选的,机器学习模型管理中心向第一联邦学习服务端发送的机器学习模型包可以是经加密和/或加扰等安全处理操作后的机器学习模型包,以降低机器学习模型包在传输过程中被窃取和被修改的风险,从而提高机器学习模型包的安全性。
基于此,第一联邦学习服务端在接收到经加密的机器学习模型包之后,可以对该经加密的机器学习模型包进行解码。第一联邦学习服务端在接收到经加扰的机器学习模型包之后,可以对该经加扰的机器学习模型包进行解扰。
可选的,在由机器学习模型管理中心向第一联邦学习服务端传输机器学习模型包的过程中可以采用安全专线等,以降低机器学习模型包在传输过程中被窃取和被修改的风险,从而提高机器学习模型包的安全性。
本申请实施例对S101的触发条件不进行限定,以下列举两种实现方式。
方式1:机器学习模型管理中心在第一联邦学习服务端的请求下,向第一联邦学习服务端发送第一机器学习模型。
具体的,第一联邦学习服务端向机器学习模型管理中心发送机器学习模型需求信息,例如,第一联邦学习服务端基于REST协议向机器学习模型管理中心发送机器学习模型需求信息。然后,机器学习模型管理中心根据该机器学习模型需求信息,确定第一机器学习模型。
也就是说,第一联邦学习服务端对第一机器学习模型“随用随取(即第一联邦学习服务器需要第一机器学习模型时,才向第一模型管理中心获取第一机器学习模型)”,这样,有助于节省第一联邦学习服务端所在设备的存储空间。
可选的,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
可选的,机器学习模型训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
机器学习模型管理中心可以维护机器学习模型的标识与机器学习模型的描述信息之间的对应关系。本申请实施例对该对应关系的具体体现形式不进行限定,例如,机器学习模型管理中心可以以表格等形式表示该对应关系。
如表2所示,为本申请实施例提供的一种机器学习模型的标识与机器学习模型的描述信息之间的对应关系的示例。
表2
Figure PCTCN2021110111-appb-000003
示例的,机器学习模型管理中心可以基于第一联邦学习服务端发送的机器学习模型需求,通过查找“机器学习模型的标识与机器学习模型的描述信息之间的对应关系”,如查表2,确定第一机器学习模型的标识;再通过查找“机器学习模型的标识与机器学习模型包之间的对应关系”,如查表1,得到第一机器学习模型的模型包。然后,向第一联邦学习服务端发送第一机器学习模型的模型包。
需要说明的是,由于机器学习模型的说明文件包含机器学习模型的描述信息,因此,在一个示例中,表2本质上可以认为是表1的一部分。
如表3所示,为表2所示的机器学习模型的标识与机器学习模型的描述信息之间的对应关系的具体示例。
表3
Figure PCTCN2021110111-appb-000004
其中,“输入层:100,输出层:300,隐藏层:5”表示神经网络的输入层、输出层和隐藏层的维度分别为100、300和5。其他网络结构的解释与此类似,此处不再赘述。
在一个示例中,假设第一联邦学习服务端确定机器学习模型训练需求为:机器学习模型对应的模型业务信息是应用识别类型,机器学习模型的训练环境是EPSN,训练机器学习模型所使用的算法类型是CNN,CNN的结构为“输入层:100,输出层:300,隐藏层:5”,CNN的训练框架是Tensorflow,机器学习模型的安全模式需求是MPC;那么,基于表3可知,机器学习模型管理中心所确定的第一机器学习模型是“SA 001”所指示的机器学习模型。
方式2:机器学习模型管理中心主动向第一联邦学习服务端推送第一机器学习模型。
这样,第一联邦学习服务端在需要使用第一机器学习模型时,可以直接从本地获取,而不需要再从机器学习模型管理中心请求,因此,有助于节省获取第一机器学习模型的时间。
本申请实施例对机器学习模型管理中心主动向第一联邦学习服务端推送第一机器学习模型的触发条件不进行限定。例如,机器学习模型管理中心可以在使用新的机器学习模型替换了第一机器学习模型之后,主动向第一联邦学习服务端推送替换后的第一机器学习模型。又 如,机器学习模型管理中心可以在首次创建第一机器学习模型时,主动向第一联邦学习服务端推送第一机器学习模型。
需要说明的是,上述方式1和方式2也可以结合使用,从而构成新的实施例。
S102:第一联邦学习服务端基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型。其中,该多个联邦学习客户端是与第一联邦学习服务端连接的部分或全部联邦学习客户端。第一联邦学习服务端归属于第一管理域。
本地网络业务数据,是指第一联邦学习服务端获取的第一管理域中的第一机器学习模型对应的网络业务数据。本地网络业务数据与第一机器学习模型对应的模型业务信息相关。例如,如果第一机器学习模型对应的模型业务是应用识别业务,则本地网络业务数据可以是应用报文和/或报文的统计数据(如报文的丢包率等)。又如,如果第一机器学习模型对应的模型业务是故障跟踪预测业务,则本地网络业务数据可以是故障告警信息。
S102可以包括:第一联邦学习服务端基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行一次或多次联邦学习,得到第二机器学习模型。
一次联邦学习,是指由联邦学习服务端向多个联邦学习客户端发送初始机器学习模型开始,至该联邦学习服务端获取到该多个联邦学习客户端各自得到的中间机器学习模型,并对所获取的中间机器学习模型进行汇聚,得到汇聚机器学习模型的过程。
在一次联邦学习过程中,联邦学习客户端开始进行模型训练时所基于的模型被称为初始机器学习模型。
在一次联邦学习过程中,联邦学习客户端基于初始机器学习模型和自身获得的网络业务数据构建的训练样本,进行一次或多次本地训练,每次本地训练结束后得到一个新的机器学习模型。如果该新的机器学习模型满足第一预设条件,则将该新的机器学习模型称为中间机器学习模型。否则,联邦学习客户端继续进行本地训练,直到得到中间机器学习模型。
在一个示例中,如果该联邦学习客户端使用自身获得的网络业务数据构建的测试样本,对该新的机器学习模型进行测试时,该新的机器学习模型的准确率大于等于第一预设阈值,则该联邦学习客户端确定该新的机器学习模型满足第一预设条件。
在另一个示例中,如果该联邦学习客户端使用自身获得的网络业务数据构建的测试样本,对该新的机器学习模型进行测试时,该新的机器学习模型的准确率,与上一次(或上多次)本地训练得到的机器学习模型的准确率之差小于等于第二预设阈值,则该联邦学习客户端确定该新的机器学习模型满足第一预设条件。
在另一个示例中,如果本地训练的训练次数达到第三预设阈值,则该联邦学习客户端确定该新的机器学习模型满足第一预设条件。
本申请实施例对第一预设阈值、第二预设阈值和第三预设阈值的取值均不进行限定。
在一次联邦学习过程中,联邦学习服务端对多个联邦学习客户端中的每个联邦学习客户端获得的中间机器学习模型,进行模型汇聚后得到的模型被称为汇聚机器学习模型。如果该汇聚机器学习模型满足第二预设条件,则将该汇聚机器学习模型作为第二机器学习模型。否则,联邦学习服务端将该汇聚机器学习模型作为下一次联邦学习的初始机器学习模型,下发给该多个联邦学习客户端。
在一个示例中,如果该联邦学习服务端使用自身获得的网络业务数据构建的测试样本, 对该汇聚机器学习模型进行测试时,该汇聚机器学习模型的准确率大于等于第四预设阈值,则该联邦学习服务端确定该汇聚机器学习模型满足第二预设条件。
在另一个示例中,如果该联邦学习服务端使用自身获得的网络业务数据构建的测试样本,对该汇聚机器学习模型进行测试,该汇聚机器学习模型的准确率,与上一次(或上多次)得到的汇聚机器学习模型的准确率之差小于等于第五预设阈值,则该联邦学习服务端确定该汇聚机器学习模型满足第二预设条件。
在另一个示例中,如果联邦学习次数达到第六预设阈值,则该联邦学习服务端确定该汇聚机器学习模型满足第二预设条件。
本申请实施例对第四预设阈值、第五预设阈值和第六预设阈值的取值均不进行限定。
可选的,如图11所示,S102可以包括以下步骤S102A-S102G。其中,图11中是以第一管理域中的多个联邦学习客户端包括第一联邦学习客户端和第二联邦学习客户端为例进行说明的。基于图11,可以更清楚地说明第一机器学习模型、初始机器学习模型、中间机器学习模型、汇聚机器学习模型和第二机器学习模型之间的关系。
S102A:第一联邦学习服务端向第一联邦学习客户端和第二联邦学习客户端分别发送第一机器学习模型。
例如,第一联邦学习服务端向第一联邦学习客户端发送第一机器学习模型的模型包,该模型包包含第一机器学习模型的模型文件,可选的,该模型包还可以包含第一机器学习模型的说明文件。类似地,第一联邦学习服务端向第二联邦学习客户端发送第一机器学习模型的模型包,该模型包包含第一机器学习模型的模型文件,可选的,该模型包还可以包含第一机器学习模型的说明文件。
S102B:第一联邦学习客户端将第一机器学习模型作为初始机器学习模型,并基于初始机器学习模型和自身获取的网络业务数据,进行本地训练,得到第一中间机器学习模型。第二联邦学习客户端将第一机器学习模型作为初始机器学习模型,并基于初始机器学习模型和自身获取的网络业务数据,进行本地训练,得到第二中间机器学习模型。
具体地,如果联邦学习客户端部署在网络设备上(如图4所示),则该联邦学习客户端获取的网络业务数据具体是指该网络设备产生的网络业务数据。又例如,如果联邦学习客户端部署在管控系统上(如图8所示),则该联邦学习客户端获取的网络业务数据具体是指该管控系统获取到的其所管理的一个或多个网络设备产生并上报的网络业务数据。
具体地,假设第一联邦学习客户端部署于网络设备303-21,第二联邦学习客户端部署于网络设备303-22,第一联邦学习客户端将第一机器学习模型作为本次联邦学习过程的初始机器学习模型,并利用网络设备303-21的本地计算资源和网络设备303-21产生的网络业务数据对初始机器学习模型进行本地训练,得到第一中间机器学习模型;类似地,第二联邦学习客户端基于第一机器学习模型、网络设备303-22产生的网络业务数据进行本地训练,得到第二中间学习模型。
S102C:第一联邦学习客户端将第一中间机器学习模型相对于初始机器学习模型的参数更新信息,发送给第一联邦学习服务端。第二联邦学习客户端将第二中间机器学习模型相对于初始机器学习模型的参数更新信息,发送给第一联邦学习服务端。
例如,续S102B中的例子,第一联邦学习客户端将第一中间学习模型相对于初始机器学习模型的参数更新信息打包成第一参数更新文件,发送给第一联邦学习服务端。例如,假设第一机器学习模型包含参数A和参数B,且参数A和参数B的取值分别为a1和b1,第一联邦 学习客户端进行本地训练后得到的第一中间学习模型所包含的参数A和参数B的取值分别为a2和b2,则第一中间学习模型相对于初始机器学习模型的参数更新信息包含:参数A的更新信息a2-a1,以及参数B的更新信息b2-b1。类似地,第二联邦学习客户端将第二中间学习模型相对于第一机器学习模型的参数更新信息打包成第二参数更新文件,发送给第一联邦学习服务端。
S102D:第一联邦学习服务端基于初始机器学习模型和第一联邦学习客户端发送的参数更新信息,获取第一中间机器学习模型。第一联邦学习服务端基于初始机器学习模型和第二联邦学习客户端发送的参数更新信息,获取第二中间机器学习模型。然后,第一联邦学习服务端采用汇聚算法,对第一中间机器学习模型和第二中间机器学习模型进行模型汇聚,得到汇聚机器学习模型。
例如,续S102C中的例子,第一联邦学习服务端基于参数A的更新信息a2-a1,和初始机器学习模型的参数A的取值a1,得到第一中间机器学习模型的参数A的取值a2;基于参数B的更新信息b2-b1,和初始机器学习模型的参数B的取值b1,得到第一中间机器学习模型的参数B的取值b2;进而,将初始机器学习模型的参数A和参数B分别赋值成a2和b2,得到第一中间机器学习模型。类似地,第一联邦学习服务端可以基于初始机器学习模型和第二联邦学习客户端发送的参数更新信息,获取第二中间机器学习模型。
可选的,汇聚算法可以是加权平均算法,如第一联邦学习服务端根据第一联邦学习客户端获取的网络业务数据的完备度和第二联邦学习客户端获取的网络业务数据,指定第一联邦学习客户端和第二联邦学习客户端上报的参数更新信息的权重,从而对第一联邦学习客户端和第二联邦学习客户端上报的针对同一参数的参数更新信息进行加权求和,再做平均,得到该参数的参数更新信息。
需要说明的是,训练不同的机器学习模型时,可能会采用不同的汇聚算法,以满足不同的联邦学习训练目标,如降低循环迭代的次数等。在模型汇聚计算的同时,联邦学习服务端还可以根据参数更新信息,制定下一次联邦学习过程中联邦学习客户端的训练策略。
另外需要说明的是,上述S102C-S102D为联邦学习服务端获取多个联邦学习客户端得到的多个中间机器学习模型的一种实现方式,当然具体实现时不限于此。例如,联邦学习客户端可以直接向该联邦学习服务端发送中间机器学习模型。
S102E:第一联邦学习服务端判断该汇聚机器学习模型是否满足第二预设条件。
若否,则执行S102F。若是,则执行S102G。
S102F:第一联邦学习服务端将该汇聚机器学习模型确定为第一机器学习模型。执行S102F之后,返回执行S102A。
S102G:第一联邦学习服务端将该汇聚机器学习模型确定为第二机器学习模型。
S103:第一联邦学习服务端向第一联邦学习客户端和第二联邦学习客户端分别发送第二机器学习模型(即第二机器学习模型的模型文件)。
例如,第一联邦学习服务端向第一联邦学习客户端发送第二机器学习模型的模型包,该模型包包含第二机器学习模型的模型文件,可选的,该模型包还可以包含第二机器学习模型的说明文件。类似地,第一联邦学习服务端向第二联邦学习客户端发送第二机器学习模型的模型包,该模型包包含第二机器学习模型的模型文件,可选的,该模型包还可以包含第二机器学习模型的说明文件。
后续,该多个联邦学习客户端可以基于第二机器学习模型执行第二机器学习模型对应的 模型业务。例如,如果第二机器学习模型是应用识别模型,则该多个联邦学习客户端可以基于第二机器学习模型对应用进行识别。
S104:第一联邦学习服务端向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
也就是说,本申请实施例提供的技术方案应用于网络系统(如图4或图8所示的网络系统)中时,第二机器学习模型可以被其他管理域中的设备使用。
在一种实现方式中,第二管理域中的设备可以是部署有第二联邦学习服务端的设备。其中,第二联邦学习服务端归属于第二管理域。也就是说,第二机器学习模型可以被第二管理域中的第二联邦学习服务端使用。
第二联邦学习服务端可以是具有使用第二机器学习模型的权限的任意一个联邦学习服务端。具体实现时,哪个或那些联邦学习服务端具有访问第二机器学习模型的权限,可以是预定义的,也可以是由生成第二机器学习模型的联邦学习客户端(即第一联邦学习服务端)确定的。可选的,第一联邦学习服务器还可以向机器学习模型管理中心发送第二机器学习模型的访问权限信息。后续,机器学习模型管理中心可以基于该访问权限信息生成第二机器学习模型的模型包。
其中,该访问权限信息是指用于表征允许使用第二机器学习模型的联邦学习服务端的信息。本申请实施例对该访问权限信息的具体实现方式不进行限定,例如,该访问权限信息可以是允许使用第二机器学习模型的联邦学习服务端的标识。又如,如果第二机器学习模型可以被其他所有联邦学习服务端使用,则该访问权限信息可以是预定义的“表示第二机器学习模型可以被其他所有联邦学习服务端使用”的信息。
可选的,如果第二机器学习模型可以被其他所有联邦学习服务端使用,则第一联邦学习服务器可以不向机器学习模型管理中心发送第二机器学习模型的访问权限信息。
当然,第二机器学习模型也可以继续被第一联邦学习服务端使用。
在另一种实现方式中,第二管理域中的设备可以是部署有联邦学习客户端的设备。其中,该联邦学习客户端归属于第二管理域。该联邦学习客户端可以是具有使用第二机器学习模型的权限的任意一个联邦学习客户端。也就是说,第二机器学习模型可以被第二管理域中的联邦学习客户端使用。
在另一种实现方式中,第二管理域中的设备还可以是模型业务执行设备(即具备使用机器学习模型执行相应模型业务的能力的设备),当其从机器学习模型管理中心获取第二机器学习模型后,可以基于第二机器学习模型和第二管理域中的网络数据执行相应的模型业务(如业务识别业务)。这就意味着,某个运营商(如虚拟运营商)虽然不具有联邦学习服务端或者客户端,也仍然可以从机器学习模型中心获取其他运营商的联邦学习服务端提供的机器学习模型。
可选的,S104可以包括以下S104A-S104C:
S104A:第一联邦学习服务端获取第二机器学习模型的应用效果。
第二机器学习模型的应用效果,可以理解为:第二机器学习模型的试用效果。例如,第一联邦学习服务端基于在第一联邦学习服务端所归属的管理域中的网络业务数据,试用第二机器学习模型,从而得到第二机器学习模型的试用效果(即应用效果)。
具体的:第一联邦学习服务端向其所连接的多个联邦学习客户端发送第二机器学习模型,该多个联邦学习客户端中的每个联邦学习客户端基于第二机器学习模型和各自获取的网络业 务数据,执行第二机器学习模型对应的模型业务,得到执行结果,并该将执行结果发送给第一联邦学习服务端。第一联邦学习服务端汇总该多个联邦学习客户端发送的多个执行结果,得到第二机器学习模型的试用效果(即应用效果)。
其中,本申请实施例不限定执行汇总所使用的规则。
其中,第二机器学习模型对应的模型业务不同时,上述执行结果不同。
在一个示例中,第二机器学习模型对应的模型业务是识别类业务(如应用识别业务等)时,上述执行结果可以是第二机器学习模型的识别率,即第二机器学习模型能够识别的对象占参与识别的对象的比例。示例的,应用识别业务具体为:识别报文属于哪个应用(如视频播放应用)。当第二机器学习模型对应的模型业务是应用识别业务时,第一联邦学习服务端汇总得到:预设时间段内,向第二机器学习模型输入了a个报文,而第二机器学习模型识别出其中的b个报文中的每个报文分别属于哪个应用,a>b,a和b均是整数,则第二机器学习模型的识别率为
Figure PCTCN2021110111-appb-000005
在另一个示例中,第二机器学习模型对应的模型业务是识别类业务(如应用识别业务等)时,上述执行结果可以是第二机器学习模型在一个时间段内未识别的报文的数量;或者是第二机器学习模型不能识别的对象占参与识别的对象的比例等。
需要说明的是,上述识别类业务可以替换为预测类业务(如故障跟踪预测业务等),此时,上述执行结果可以是第二机器学习模型的预测率,即第二机器学习模型能够预测的对象占参与预测的对象的比例。或者,上述识别类业务可以替换为检测类业务(如KPI异常检测业务等),则上述执行结果可以是第二机器学习模型的检测率。当然,上述识别类业务还可以替换为其他类型的业务,此时,上述执行结果的具体实现方式可以由S104A中的示例推理得到。
S104B:如果确定该应用效果满足预设条件,则第一联邦学习服务端向机器学习模型管理中心发送第二机器学习模型。
第二机器学习模型的应用效果满足预设条件,可以理解为:第二机器学习模型的应用效果达到预设目标。其中,第二机器学习模型对应的模型业务不同时,预设目标不同。
例如,如果第二机器学习模型对应的模型业务是应用识别业务,上述执行结果可以是第二机器学习模型的识别率,则第二机器学习模型的应用效果达到预设目标,可以是:第二机器学习模型的识别率大于等于预设识别率,或者,可以是:第二机器学习模型的识别率大于等于历史识别率,该历史识别率可以是第一机器学习模型的识别率等。
S104C:如果确定该应用效果不满足预设条件,则第一联邦学习服务端基于该第二机器学习模型和新的本地网络业务数据,与该多个联邦学习客户端,进行新一轮的联邦学习,得到新的第二机器学习模型。其中,这里的“新的本地网络业务数据”是相比训练得到第二机器学习模型的过程中所使用的本地网络业务数据而言的。
后续,第一联邦学习服务端可以确定该新的第二机器学习模型的应用效果是否满足预设条件,以此类推,直到某一次得到的新的第二机器学习模型的应用效果满足预设条件时,第一联邦学习服务端向机器学习模型管理中心发送该满足预设条件的新的第二机器学习模型。
这样,有助于提高第一联邦学习服务端发往机器学习模型管理中心的机器学习模型的精确度/准确率,从而进一步缩短其他联邦学习客户端使用该机器学习模型进行联邦学习时,机器学习模型的收敛时间。
S105:机器学习模型管理中心用第二机器学习模型替换第一机器学习模型。
具体的,机器学习模型管理中心用第二机器学习模型的模型包替换第一机器学习模型的 模型包。更具体的,机器学习模型管理中心用第二机器学习模型的模型文件替换第一机器学习模型的模型文件。
例如,结合表1,假设第一机器学习模型是机器学习模型1,则机器学习模型管理中心使用第二机器学习模型的模型文件替换机器学习模型文件1。这样,后续机器学习模型管理中心在需要向第一管理域中的设备或其他管理域中的设备,如第一联邦学习服务端或者其他联邦学习服务端(如第二联邦学习服务端)发送机器学习模型1时,可以发送第二机器学习模型的模型文件。
可选的,如果第二机器学习模型不能被所有管理域中的设备使用,则机器学习模型管理中心向第二管理域中的设备发送第二机器学习模型(包括在请求下发送和主动推送),可以包括:机器学习模型管理中心在确定第二管理域具有对第二机器学习模型的使用权限的情况下,向第二管理域中的设备发送第二机器学习模型。
例如,如果第二机器学习模型不能被所有联邦学习服务端使用,则机器学习模型管理中心向第二联邦学习服务端发送第二机器学习模型(包括在请求下发送和主动推送),可以包括:机器学习模型管理中心在确定第二联邦学习服务端具有对第二机器学习模型的使用权限的情况下,向第第二联邦学习服务端发送第二机器学习模型。
可选的,在S105之前,该方法还包括:机器学习模型管理中心对接收到的第二机器学习模型进行病毒扫描、敏感词扫描等操作,以确定第二机器学习模型在传输过程中没有被修改,从而确定第二机器学习模型的安全性。机器学习模型管理中心还可以基于第三方软件做出的网络安全评估报告,确定第二机器学习模型是否安全。在S105中,机器学习模型管理中心可以在确定第二机器学习模型安全的情况下,用第二机器学习模型替换第一机器学习模型。
另外,机器学习模型管理中心还可以对第二机器学习模型进行模型格式校验,以确定第二机器学习模型来自可信的认证网络,从而确定第二机器学习模型的安全性。例如,机器学习模型管理中心可以维护已认证网络的标识,并基于所维护的已认证网络的标识,确定第二机器学习模型是否来自已认证网络,如果来自已认证网络,则说明第二机器学习模型安全,否则说明第二机器学习模型不安全。
可选的,第二管理域中的第二联邦学习服务端使用第二机器学习模型的过程,可以包括:
首先,机器学习模型管理中心向第二联邦学习服务端发送第二机器学习模型。
例如,机器学习模型管理中心在第二联邦学习服务端的请求下,向第二联邦学习服务端发送第二机器学习模型。其具体实现方式可以基于上述对方式1的描述,此处不再赘述。或者,机器学习模型管理中心主动向第二联邦学习服务端推送第二机器学习模型。其具体实现方式可以基于上述对方式2的描述,此处不再赘述。
其次,第二联邦学习服务端基于第二机器学习模型和第二管理域的本地网络数据,与第二管理域中的多个联邦学习客户端进行联邦学习,得到第三机器学习模型。其具体实现方式可以基于上述对图11的描述得到,此处不再赘述。
后续:
一方面,第二联邦学习服务端可以将第三机器学习模型发送给机器学习模型管理中心;机器学习模型管理中心可以用第三机器学习模型替换第二机器学习模型,以使第三机器学习模型被第三管理域中的设备使用。其中,第三管理域与第二管理域不同。第三管理域与第一管理域可以相同,也可以不同。当然,第三机器学习模型也可以被第二管理域中的设备使用。
另一方面,第二联邦学习服务端可以向其所连接的联邦学习客户端发送第三机器学习模 型,该联邦学习客户端可以基于第三机器学习模型执行第三机器学习模型对应的模型业务。
可以理解的是,该可选的实现方式也可以认为是:第二管理域中的第二联邦学习服务端与第二管理域中的联邦学习客户端共同使用第二机器学习模型的示例。
可选的,第二管理域中的联邦学习客户端使用第二机器学习模型的过程,可以包括:
首先,机器学习模型管理中心向第二管理域中的联邦学习客户端发送第二机器学习模型。
例如,机器学习模型管理中心可以在接收到第二管理域中的联邦学习客户端发送的请求的情况下,向该联邦学习客户端发送第二机器学习模型。又如,机器学习模型管理中心可以主动向第二管理域中的联邦学习客户端推送第二机器学习模型。
其次,第二管理域中的联邦学习客户端可以基于第二机器学习模型执行第二机器学习模型对应的模型业务。
该可选的实现方式可以适用于机器学习模型管理中心直接对联邦学习客户端进行控制,而不需要经过联邦学习服务端的场景中。
可选的,该方法还可以包括以下步骤S106-S107:
S106:机器学习模型管理中心基于第二机器学习模型,生成第二机器学习模型的模型包。该模型包包含第二机器学习模型的模型文件和第二机器学习模型的说明文件。
例如,机器学习模型管理中心为第二机器学习模型生成说明文件,该说明文件可以包含第二机器学习模型的访问权限、第二机器学习模型的描述信息和第二机器学习模型的运行脚本等。然后,机器学习模型管理中心按照机器学习模型包的打包规范,将第二机器学习模型的模型文件和第二机器学习模型的说明文件,生成第二机器学习模型的模型包。
S107:机器学习模型管理中心对第二机器学习模型的模型包进行签名,得到第二机器学习模型的模型包的签名文件。
S107是为了对第二机器学习模型的模型包进行完整性保护,以说明第二机器学习模型的模型包是来自机器学习模型管理中心,而非来自其他设备/系统。
基于此,在机器学习模型中心向任一联邦学习服务端发送第二机器学习模型的模型包的情况下,还可以向该联邦学习服务端发送该模型包的签名文件,以使得该联邦学习服务端可以基于该签名文件,确定该模型包是否来自机器学习模型管理中心。当然,上述机器学习模型中心向第一联邦学习服务端发送第一机器学习模型的模型包的情况下,还可以向第一联邦学习服务端发送该数据包的签名文件。
可选的,第二机器学习模型是基于第一训练框架的机器学习模型。基于此,该方法还可以包括以下步骤1:
步骤1:机器学习模型管理中心将第二机器学习模型转换为第三机器学习模型。其中,第三机器学习模型是基于第二训练框架的机器学习模型,且第三机器学习模型和第二机器学习模型是同一模型业务信息对应的机器学习模型。
可选的,机器学习模型管理中心使用模型转换工具,将基于第一训练框架的第二机器学习模型的模型文件中的算法实现代码和参数等翻译成,基于第二训练框架的相应算法实现代码和参数。其中,该模型转换工具可以通过软件和/或硬件实现。
也就是说,机器学习模型管理中心将第一训练框架支持的机器学习模型转换为第二训练框架支持的机器学习模型。例如,当第一训练框架是Tensorflow训练框架,第二训练框架是Pytorch训练框架时,relu激活层(运算单元)这一信息可翻译为torch.nn.ReLu()。
进一步可选的,如果机器学习模型管理中心还存储有第五机器学习模型,第五机器学习 模型是基于第二训练框架的机器学习模型,第五机器学习模型和第一机器学习模型是同一模型业务信息对应的机器学习模型,则该方法还可以包括以下步骤2:
步骤2:机器学习模型管理中心用第三机器学习模型替换第五机器学习模型。
可以理解的是,同一模型业务信息对应的机器学习模型可能对应不同的训练框架,基于此,机器学习模型管理中心在用第二机器学习模型替换第一机器学习模型的情况下,将第二机器学习模型转换为其他训练框架下的机器学习模型,以为其他支持第二训练框架的联邦学习服务端提供机器学习模型。
进一步地,如果机器学习模型管理中心确定自身维护了与第一机器学习模型对应了同一模型业务信息的基于其他训练框架的机器学习模型(即第五机器学习模型),则机器学习模型管理中心还可以用第三机器学习模型替换第五机器学习模型,从而保证针对同一模型业务信息的其他训练框架下的机器学习模型是最新的机器学习模型。
例如,表2中的机器学习模型1与机器学习模型2是同一模型业务信息对应的机器学习模型,二者的区别在于应用于不同的训练框架。如果第一机器学习模型是机器学习模型1,第五机器学习模型是机器学习模型2,则在表1中,使用第二机器学习模型的模型文件替换第一机器学习模型的模型文件(即机器学习模型文件1)的情况下,使用第三机器学习模型的模型文件替换第五机器学习模型的模型文件(即机器学习模型文件2)。
示例的,针对应用识别业务的机器学习模型可以运行在三种AI模型训练框架上,因此,在基于其中一种训练模型的机器学习模型被替换之后,机器学习模型管理中心可以同步更新其他两种训练框架对应的机器学习模型。
本申请实施例提供的机器学习模型管理方法,联邦学习服务端执行联邦学习得到的机器学习模型可以被其他联邦学习服务端使用。这样,不同联邦学习服务端之间不需要重复训练机器学习模型,从整个社会角度看节省了计算资源。
而且,联邦学习服务端从机器学习模型管理中心获取初始机器学习模型,有助于联邦学习服务端确定一个最接近机器学习模型训练需求的机器学习模型作为初始机器学习模型,从而有助于减少联邦学习次数,加快机器学习模型的收敛速度。
另外,随着时间的推移,机器学习模型管理中心上的机器学习模型,能够集多个管理域中网络业务数据之长(即机器学习模型是间接地基于多个管理域中的网络业务数据经联邦学习得到),其适应性相对于仅基于单个管理域的网络业务数据得到的机器学习模型可以有较大的提高,对于每一个管理域而言,后续输入更新颖、更复杂的网络业务数据执行的模型业务,也能获得较好的效果。
此外,本技术方案中,一方面,每个管理域独立进行机器学习模型的训练,因此,如果一个管理域中的联邦学习服务端发生故障,则其他管理域依然可以继续执行联邦学习,并使得机器学习模型管理中心继续更新机器学习模型。
另一方面,由于联邦学习服务端从机器学习模型管理中心获取初始机器学习模型,因此,即使一个管理域中的联邦学习服务端发生故障,当故障恢复时,该联邦学习服务端依然可以从机器学习模型管理中心获取来自当前最新的用于共享的机器学习模型(即故障期间,机器学习模型管理中心结合其他联学习服务端获得的更新后的机器学习模型)作为初始机器学习模型,从而有助于减少联邦学习次数,加快机器学习模型的收敛速度。
而传统技术中,不同管理域之间独立进行机器学习模型的训练,且不同管理域之间不能共享机器学习模型,因此,如果一个管理域中的联邦学习服务端发生故障,当故障恢复时, 该联邦学习服务端只能从预定义的初始机器学习模型开始训练,因此,联邦学习次数多,机器学习模型的收敛速度慢。
由此可知,相比传统技术,本技术方案在联邦学习服务端故障恢复后,机器学习模型的收敛速度较快,因此,恢复能力更强,也就是说,鲁棒性更好。
另外,本申请实施例提供的机器学习模型管理方法应用于如图4或图8所示的网络架构中时,公有云与管理域之间只有一次机器学习模型的双向传递过程,这有助于避免由于模型参数更新信息多次传递带来的网络业务数据被盗取的风险,提高了管理域中的网络业务数据的安全性。同时,本技术方案中,多个管理域可以共享联邦学习模型,因此,对每个管理域(如电信运营商网络)来讲,降低了管理域的建设成本和维护成本。
下面,以机器学习模型对应应用识别业务,即机器学习模型是应用识别(SA)机器学习模型为例,进一步解释图10和图11对应的实施方式。
本实施例中,管理域具体是电信运营商网络,且电信运营商网络A的EMS1上部署有第一联邦学习服务端,EMS1用于管理第一网络设备和第二网络设备,第一网络设备上部署有第一联邦学习客户端,第二网络设备上部署有第二联邦学习客户端,第一联邦学习客户端与第二联邦学习客户端分别与第一联邦学习服务端连接。电信运营商网络B的EMS2上部署有第二联邦学习服务端,EMS2管理的网络设备上部署有与第二联邦学习服务端连接的联邦学习客户端。
图12为本申请实施例提供的又一种机器学习模型管理方法的交互示意图。其中,图12所示的方法可以包括以下步骤S201-S212:
S201:机器学习模型管理中心向电信运营商网络A的EMS1发送SA机器学习模型001。
S202:EMS1向第一网络设备和第二网络设备分别发送SA机器学习模型001。
S203:第一网络设备将SA机器学习模型001作为初始机器学习模型,并基于该初始机器学习模型,以及第一网络设备的应用报文或应用报文的统计数据,进行本地训练,得到第一中间机器学习模型(标记为SA机器学习模型002)。第二网络设备将SA机器学习模型001作为初始机器学习模型,并基于该初始机器学习模型,以及第二网络设备的应用报文或应用报文的统计数据,进行本地训练,得到第二中间机器学习模型(标记为SA机器学习模型003)。
S204:第一网络设备将SA机器学习模型002相对于SA机器学习模型001的参数更新信息,发送给EMS1。第二网络设备将SA机器学习模型003相对于SA机器学习模型001的参数更新信息,发送给EMS1。
S205:EMS1基于SA机器学习模型001和第一网络设备发送的参数更新信息,获取SA机器学习模型002。EMS1基于SA机器学习模型001和第二网络设备发送的参数更新信息,获取SA机器学习模型003。然后,EMS1采用汇聚算法,对SA机器学习模型002和SA机器学习模型003进行模型汇聚,得到汇聚机器学习模型(标记为SA机器学习模型004)。
S206:EMS1判断SA机器学习模型004是否满足第二预设条件。其中,关于第二预设条件的相关描述可以参考上述S102下的相关说明,此处不再赘述。
若否,则执行S207。若是,则SA机器学习模型004是第二机器学习模型,执行S208。
S207:EMS1将SA机器学习模型004作为新的SA机器学习模型001。
执行S207之后,返回执行S201。
S208:EMS1向第一网络设备和第二网络设备分别发送SA机器学习模型004。
S209:EMS1向机器学习模型管理中心发送SA机器学习模型004,以使SA机器学习模型004被电信运营商网络B中的EMS2使用。
例如,机器学习模型管理中心向EMS2发送SA机器学习模型004。EMS2基于SA机器学习模型004和电信运营商网络B中的应用报文或应用报文的统计数据,与EMS2管理的多个网络设备进行联邦学习,得到SA机器学习模型005;并向该多个网络设备发送SA机器学习模型005。后续,该多个网络设备可以基于SA机器学习模型005进行应用识别。其具体实现过程可以由上述S202-S208得到。
S210:机器学习模型管理中心用SA机器学习模型004替换SA机器学习模型001。
S211:机器学习模型管理中心基于SA机器学习模型004,生成SA机器学习模型004的模型包。
S212:机器学习模型管理中心对SA机器学习模型004的模型包进行签名,得到SA机器学习模型004的模型包的签名文件。
由此可见,由于电信运营商网络A中的EMS1、第一网络设备和第二网络设备在电信运营商网络A内进行联邦学习,电信运营商网络A内的应用报文或应用报文的统计数据(包括:第一网络设备的应用报文或应用报文的统计数据,以及,第二网络设备的应用报文或应用报文的统计数据),以及电信运营商网络A的中间机器学习模型(如SA机器学习模型002和SA机器学习模型003)不需要传给第三方,因此,提高了数据隐私安全。
同时,如S209中的示例,电信运营商网络B中的网络设备,可以基于SA机器学习模型005进行应用识别。而SA机器学习模型005的获得融合了电信运营商网络B中的应用报文或应用报文的统计数据,以及电信运营商网络A中的应用报文或应用报文的统计数据。因此,电信运营商网络B中的网络设备基于SA机器学习模型005进行应用识别,有助于提高应用识别的准确率。
上述主要从方法的角度对本申请实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本申请能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例可以根据上述方法示例对机器学习模型管理装置(如机器学习模型管理中心或联邦学习服务端)进行功能模块的划分,例如,可以对应各个功能划分各个功能模块,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。需要说明的是,本申请实施例中对模块的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图13为本申请实施例提供的一种机器学习模型管理中心的结构示意图。图13所示的机器学习模型管理中心100可以用于实现上述方法实施例中机器学习模型管理中心的功能,因此也能实现上述方法实施例所具备的有益效果。在本申请的实施例中,该机器学习模型管理中心可以是如图3所示的机器学习模型管理中心401。
机器学习模型管理中心100连接第一联邦学习服务端,第一联邦学习服务端归属于第一管理域。
如图13所示,机器学习模型管理中心100包括发送单元1001、接收单元1002和处理单元1003。
发送单元1001,用于向第一联邦学习服务端发送第一机器学习模型。接收单元1002,用 于从第一联邦学习服务端接收第二机器学习模型;其中,第二机器学习模型为第一联邦学习服务端基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习得到的。处理单元1003,用于用第二机器学习模型替换第一机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
例如,结合图10,发送单元1001可以用于执行S101,接收单元1002可以用于执行S104对应的接收步骤。处理单元1003可以用于执行S105。
可选的,接收单元1002还用于,接收第一联邦学习服务端发送的机器学习模型需求信息。处理单元1003还用于,根据机器学习模型需求信息,确定第一机器学习模型。
可选的,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
可选的,机器学习模型训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
可选的,第二机器学习模型是基于第一训练框架的机器学习模型。处理单元1003还用于:将第二机器学习模型转换为第三机器学习模型;其中,第三机器学习模型是基于第二训练框架的机器学习模型,且第三机器学习模型和第二机器学习模型是同一模型业务信息对应的机器学习模型。
可选的,接收单元1002还用于,接收第一联邦学习服务端发送的第二机器学习模型的访问权限信息。
可选的,发送单元1001还用于,向第二联邦学习服务端发送第二机器学习模型,第二联邦学习服务端归属于第二管理域。接收单元1002还用于,从第二联邦学习服务端接收第四机器学习模型;其中,第四机器学习模型为第二联邦学习服务端基于第二机器学习模型和第二管理域的本地网络业务数据,与第二管理域中的多个联邦学习客户端进行联邦学习得到的。该情况下,处理单元1003还用于,用第四机器学习模型替换第二机器学习模型。
关于上述可选方式的具体描述可以参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种机器学习模型管理中心100的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
作为示例,结合图9,上述发送单元1001和接收单元1002的功能可以通过通信接口703实现。上述处理单元1003的功能,可以通过处理器701调用存储器702中的程度代码实现。
如图14所示,为本申请实施例提供的一种联邦学习服务端的结构示意图。图14所示的联邦学习服务端110可以用于实现上述方法实施例中联邦学习服务端的功能,因此也能实现上述方法实施例所具备的有益效果。在本申请的实施例中,该联邦学习服务端110可以是如图3所示的联邦学习服务端。
联邦学习服务端110归属于第一管理域,且与机器学习模型管理中心连接。
如图14所示,联邦学习服务端110包括收发单元1101和处理单元1102。
收发单元1101,用于从机器学习模型管理中心获取第一机器学习模型。处理单元1102,用于基于第一机器学习模型和第一管理域的本地网络业务数据,与第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型。收发单元1101还用于,向机器学习模型管理中心发送第二机器学习模型,以使第二机器学习模型被第二管理域中的设备使用。
例如,结合图10,收发单元1101可以用于执行S101对应的接收步骤,以及S104。处理单元1102可以用于执行S102中联邦学习服务端执行的步骤。
可选的,收发单元1101具体用于:向机器学习模型管理中心发送机器学习模型需求信息;接收机器学习模型管理中心根据机器学习模型需求信息确定的第一机器学习模型。
可选的,机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
可选的,机器学习模型训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
可选的,收发单元1101还用于,向机器学习模型管理中心发送第二机器学习模型的访问权限信息。
可选的,收发单元1101还用于,向多个联邦学习客户端发送第二机器学习模型。
可选的,收发单元1101具体用于,如果第二机器学习模型的应用效果满足预设条件,则向机器学习模型管理中心发送第二机器学习模型。
可选的,收发单元1101还用于,向第一管理域中的多个联邦学习客户端发送第一机器学习模型,以使该多个联邦学习客户端分别基于第一机器学习模型和各自获取的网络业务数据进行联邦学习,得到各自的中间机器学习模型。处理单元1102具体用于,获取多个联邦学习客户端得到的多个中间机器学习模型,并基于多个中间机器学习模型汇聚得到第二机器学习模型。
关于上述可选方式的具体描述可以参见前述的方法实施例,此处不再赘述。此外,上述提供的任一种联邦学习服务端110的解释以及有益效果的描述均可参考上述对应的方法实施例,不再赘述。
作为示例,结合图9,上述收发单元1101的功能可以通过通信接口703实现。上述处理单元1102的功能,可以通过处理器701调用存储器702中的程度代码实现。
本申请另一实施例还提供一种机器学习模型管理装置,该装置包括:处理器和存储器,该存储器用于存储计算机程序和指令,该处理器用于调用计算机程序和指令,以执行上述方法实施例所示的方法流程中机器学习模型管理中心所执行的相应步骤。
本申请另一实施例还提供一种机器学习模型管理装置,该装置包括:处理器和存储器,该存储器用于存储计算机程序和指令,该处理器用于调用计算机程序和指令,以执行上述方法实施例所示的方法流程中联邦学习服务端所执行的相应步骤。
本申请另一实施例还提供一种机器学习模型管理装置,该装置包括:处理器和存储器,该存储器用于存储计算机程序和指令,该处理器用于调用计算机程序和指令,以执行上述方法实施例所示的方法流程中联邦学习客户端所执行的相应步骤。
本申请另一实施例还提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当指令在终端执行上述方法实施例所示的方法流程中机器学习模型管理中心或第一联邦学习服务端或联邦学习客户端所执行的相应步骤。
在一些实施例中,所公开的方法可以实施为以机器可读格式被编码在计算机可读存储介质上的或者被编码在其它非瞬时性介质或者制品上的计算机程序指令。
应该理解,这里描述的布置仅仅是用于示例的目的。因而,本领域技术人员将理解,其它布置和其它元素(例如,机器、接口、功能、顺序、和功能组等等)能够被取而代之地使用,并且一些元素可以根据所期望的结果而一并省略。
另外,所描述的元素中的许多是可以被实现为离散的或者分布式的组件的、或者以任何适当的组合和位置来结合其它组件实施的功能实体。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机执行指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (40)

  1. 一种机器学习模型管理方法,其特征在于,应用于联邦学习服务端,所述联邦学习服务端归属于第一管理域,且与机器学习模型管理中心连接;所述方法包括:
    从所述机器学习模型管理中心获取第一机器学习模型;
    基于所述第一机器学习模型和所述第一管理域的本地网络业务数据,与所述第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型;
    向所述机器学习模型管理中心发送所述第二机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述机器学习模型管理中心获取第一机器学习模型,包括:
    向所述机器学习模型管理中心发送机器学习模型需求信息;
    接收所述机器学习模型管理中心根据所述机器学习模型需求信息确定的所述第一机器学习模型。
  3. 根据权利要求2所述的方法,其特征在于,所述机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
  4. 根据权利要求3所述的方法,其特征在于,所述训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述方法还包括:
    向所述机器学习模型管理中心发送所述第二机器学习模型的访问权限信息。
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述方法还包括:
    向所述多个联邦学习客户端发送所述第二机器学习模型。
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述向所述机器学习模型管理中心发送所述第二机器学习模型,包括:
    如果所述第二机器学习模型的应用效果满足预设条件,则向所述机器学习模型管理中心发送所述第二机器学习模型。
  8. 根据权利要求1至7任一项所述的方法,其特征在于,所述基于所述第一机器学习模型和所述第一管理域的本地网络业务数据,与所述第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型,包括:
    向所述第一管理域中的多个联邦学习客户端发送所述第一机器学习模型,以使所述多个联邦学习客户端分别基于所述第一机器学习模型和各自获取的网络业务数据进行联邦学习,得到各自的中间机器学习模型;
    获取所述多个联邦学习客户端得到的多个中间机器学习模型,并基于所述多个中间机器学习模型汇聚得到所述第二机器学习模型。
  9. 一种机器学习模型管理方法,其特征在于,应用于机器学习模型管理中心,所述机器学习模型管理中心与第一联邦学习服务端连接,所述第一联邦学习服务端归属于第一管理域;所述方法包括:
    向所述第一联邦学习服务端发送第一机器学习模型;
    从所述第一联邦学习服务端接收第二机器学习模型;其中,所述第二机器学习模型为所述第一联邦学习服务端基于所述第一机器学习模型和所述第一管理域的本地网络业务数据, 与所述第一管理域中的多个联邦学习客户端进行联邦学习得到的;
    用所述第二机器学习模型替换所述第一机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用。
  10. 根据权利要求9所述的方法,其特征在于,在所述向联邦学习服务端发送第一机器学习模型之前,所述方法还包括:
    接收所述第一联邦学习服务端发送的机器学习模型需求信息;
    根据所述机器学习模型需求信息,确定所述第一机器学习模型。
  11. 根据权利要求10所述的方法,其特征在于,所述机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
  12. 根据权利要求11所述的方法,其特征在于,所述训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
  13. 根据权利要求9至12任一项所述的方法,其特征在于,所述第二机器学习模型是基于第一训练框架的机器学习模型;所述方法还包括:
    将所述第二机器学习模型转换为第三机器学习模型;其中,所述第三机器学习模型是基于第二训练框架的机器学习模型,且所述第三机器学习模型和所述第二机器学习模型是同一模型业务信息对应的机器学习模型。
  14. 根据权利要求9至13任一项所述的方法,其特征在于,所述方法还包括:
    接收所述第一联邦学习服务端发送的所述第二机器学习模型的访问权限信息。
  15. 根据权利要求9至14任一项所述的方法,其特征在于,所述方法还包括:
    向第二联邦学习服务端发送第二机器学习模型,其中,所述第二联邦学习服务端归属于所述第二管理域;
    从所述第二联邦学习服务端接收第四机器学习模型;其中,所述第四机器学习模型为所述第二联邦学习服务端基于所述第二机器学习模型和所述第二管理域的本地网络业务数据,与所述第二管理域中的多个联邦学习客户端进行联邦学习得到的;
    用所述第四机器学习模型替换所述第二机器学习模型。
  16. 一种联邦学习系统,包括:联邦学习服务端和多个联邦学习客户端,其特征在于:所述联邦学习服务端和所述多个联邦学习客户端归属于第一管理域,且所述联邦学习服务端与机器学习模型管理中心连接;
    所述联邦学习服务端,用于从所述机器学习模型管理中心获取第一机器学习模型,并向所述多个联邦学习客户端发送所述第一机器学习模型;
    所述多个联邦学习客户端中的每个联邦学习客户端,用于基于所述第一机器学习模型和各自获取的网络业务数据,进行联邦学习,得到各自的中间机器学习模型;
    所述联邦学习服务端,还用于获取所述多个联邦学习客户端得到的多个中间机器学习模型,并基于所述多个中间机器学习模型汇聚得到第二机器学习模型,向所述机器学习模型管理中心发送所述第二机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用。
  17. 根据权利要求16所述的联邦学习系统,其特征在于,
    所述联邦学习服务端,还用于向所述多个联邦学习客户端发送所述第二机器学习模型;
    所述多个联邦学习客户端中的每个联邦学习客户端,还用于基于所述第二机器学习模型执行所述第二机器学习模型对应的模型业务。
  18. 一种网络系统,其特征在于,包括:机器学习模型管理中心、联邦学习服务端和多个联邦学习客户端;所述联邦学习服务端和所述多个联邦学习客户端归属于第一管理域,且 所述联邦学习服务端与所述机器学习模型管理中心连接;
    所述机器学习模型管理中心,用于向所述联邦学习服务端发送第一机器学习模型;
    所述联邦学习服务端,用于向所述多个联邦学习客户端发送所述第一机器学习模型;
    所述多个联邦学习客户端中的每个联邦学习客户端,用于基于所述第一机器学习模型和各自获取的网络业务数据,进行联邦学习,得到各自的中间机器学习模型;
    所述联邦学习服务端,还用于获取所述多个联邦学习客户端得到的多个中间机器学习模型,并基于所述多个中间机器学习模型汇聚得到第二机器学习模型,并向所述机器学习模型管理中心发送所述第二机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用;
    所述机器学习模型管理中心,还用于用所述第二机器学习模型替换所述第一机器学习模型。
  19. 根据权利要求18所述的网络系统,其特征在于,
    所述联邦学习服务端还用于,向所述机器学习模型管理中心发送机器学习模型需求信息;
    所述机器学习模型管理中心还用于,根据所述机器学习模型需求信息,向所述联邦学习服务端发送所述第一机器学习模型。
  20. 根据权利要求19所述的网络系统,其特征在于,所述机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
  21. 根据权利要求20所述的网络系统,其特征在于,所述训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
  22. 根据权利要求18至21任一项所述的网络系统,其特征在于,所述第二机器学习模型是基于第一训练框架的机器学习模型;
    所述机器学习模型管理中心还用于,将所述第二机器学习模型转换为第三机器学习模型;其中,所述第三机器学习模型是基于第二训练框架的机器学习模型,且所述第三机器学习模型和所述第二机器学习模型是同一模型业务信息对应的机器学习模型。
  23. 根据权利要求18至22任一项所述的网络系统,其特征在于,
    所述联邦学习服务端,还用于向所述多个联邦学习客户端发送所述第二机器学习模型;
    所述多个联邦学习客户端中的每个联邦学习客户端,还用于基于所述第二机器学习模型执行所述第二机器学习模型对应的模型业务。
  24. 一种联邦学习服务端,其特征在于,所述联邦学习服务端归属于第一管理域,且与机器学习模型管理中心连接;所述联邦学习服务端包括:
    收发单元,用于从所述机器学习模型管理中心获取第一机器学习模型;
    处理单元,用于基于所述第一机器学习模型和所述第一管理域的本地网络业务数据,与所述第一管理域中的多个联邦学习客户端进行联邦学习,得到第二机器学习模型;
    所述收发单元还用于,向所述机器学习模型管理中心发送所述第二机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用。
  25. 根据权利要求24所述的联邦学习服务端,其特征在于,所述收发单元具体用于:
    向所述机器学习模型管理中心发送机器学习模型需求信息;
    接收所述机器学习模型管理中心根据所述机器学习模型需求信息确定的所述第一机器学习模型。
  26. 根据权利要求25所述的联邦学习服务端,其特征在于,所述机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
  27. 根据权利要求26所述的联邦学习服务端,其特征在于,所述训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
  28. 根据权利要求24至27任一项所述的联邦学习服务端,其特征在于,
    所述收发单元还用于,向所述机器学习模型管理中心发送所述第二机器学习模型的访问权限信息。
  29. 根据权利要求24至28任一项所述的联邦学习服务端,其特征在于,
    所述收发单元还用于,向所述多个联邦学习客户端发送所述第二机器学习模型。
  30. 根据权利要求24至29任一项所述的联邦学习服务端,其特征在于,
    所述收发单元具体用于,如果所述第二机器学习模型的应用效果满足预设条件,则向所述机器学习模型管理中心发送所述第二机器学习模型。
  31. 根据权利要求24至30任一项所述的联邦学习服务端,其特征在于,
    所述收发单元还用于,向所述多个联邦学习客户端发送所述第一机器学习模型,以使所述多个联邦学习客户端分别基于所述第一机器学习模型和各自获取的网络业务数据进行联邦学习,得到各自的中间机器学习模型;
    所述处理单元具体用于,获取所述多个联邦学习客户端得到的多个中间机器学习模型,并基于所述多个中间机器学习模型汇聚得到所述第二机器学习模型。
  32. 一种机器学习模型管理中心,其特征在于,所述机器学习模型管理中心连接第一联邦学习服务端,所述第一联邦学习服务端归属于第一管理域;所述机器学习模型管理中心包括:
    发送单元,用于向所述第一联邦学习服务端发送第一机器学习模型;
    接收单元,用于从所述第一联邦学习服务端接收第二机器学习模型;其中,所述第二机器学习模型为所述第一联邦学习服务端基于所述第一机器学习模型和所述第一管理域的本地网络业务数据,与所述第一管理域中的多个联邦学习客户端进行联邦学习得到的;
    处理单元,用于用所述第二机器学习模型替换所述第一机器学习模型,以使所述第二机器学习模型被第二管理域中的设备使用。
  33. 根据权利要求32所述的机器学习模型管理中心,其特征在于,
    所述接收单元还用于,接收所述第一联邦学习服务端发送的机器学习模型需求信息;
    所述处理单元还用于,根据所述机器学习模型需求信息,确定所述第一机器学习模型。
  34. 根据权利要求33所述的机器学习模型管理中心,其特征在于,所述机器学习模型需求信息包括机器学习模型对应的模型业务信息和/或机器学习模型训练需求。
  35. 根据权利要求34所述的机器学习模型管理中心,其特征在于,所述训练需求包括以下至少一项:训练环境、算法类型、网络结构、训练框架、汇聚算法或安全模型。
  36. 根据权利要求32至35任一项所述的机器学习模型管理中心,其特征在于,所述第二机器学习模型是基于第一训练框架的机器学习模型;所述处理单元还用于:
    将所述第二机器学习模型转换为第三机器学习模型;其中,所述第三机器学习模型是基于第二训练框架的机器学习模型,且所述第三机器学习模型和所述第二机器学习模型是同一模型业务信息对应的机器学习模型。
  37. 根据权利要求32至36任一项所述的机器学习模型管理中心,其特征在于,
    所述接收单元还用于,接收所述第一联邦学习服务端发送的所述第二机器学习模型的访问权限信息。
  38. 根据权利要求32至37任一项所述的机器学习模型管理中心,其特征在于,
    所述发送单元还用于,向第二联邦学习服务端发送第二机器学习模型;其中,所述第二联邦学习服务端归属于所述第二管理域;
    所述接收单元还用于,从所述第二联邦学习服务端接收第四机器学习模型;其中,所述第四机器学习模型为所述第二联邦学习服务端基于所述第二机器学习模型和所述第二管理域的本地网络业务数据,与所述第二管理域中的多个联邦学习客户端进行联邦学习得到的;
    所述处理单元还用于,用所述第四机器学习模型替换所述第二机器学习模型。
  39. 一种机器学习模型管理装置,其特征在于,包括:存储器和处理器,所述存储器用于存储计算机指令,所述处理器用于调用所述计算机指令,以执行如权利要求1-15中任一项所述的方法。
  40. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行权利要求1-15中任一项所述的方法。
PCT/CN2021/110111 2020-11-03 2021-08-02 机器学习模型管理方法、装置和系统 WO2022095523A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP21888222.3A EP4224369A4 (en) 2020-11-03 2021-08-02 METHOD, APPARATUS AND SYSTEM FOR MANAGING A MACHINE LEARNING MODEL
JP2023526866A JP7574438B2 (ja) 2020-11-03 2021-08-02 機械学習モデル管理方法及び装置とシステム
US18/309,583 US20230267326A1 (en) 2020-11-03 2023-04-28 Machine Learning Model Management Method and Apparatus, and System

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011212838.9A CN114529005A (zh) 2020-11-03 2020-11-03 机器学习模型管理方法、装置和系统
CN202011212838.9 2020-11-03

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/309,583 Continuation US20230267326A1 (en) 2020-11-03 2023-04-28 Machine Learning Model Management Method and Apparatus, and System

Publications (1)

Publication Number Publication Date
WO2022095523A1 true WO2022095523A1 (zh) 2022-05-12

Family

ID=81457472

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/110111 WO2022095523A1 (zh) 2020-11-03 2021-08-02 机器学习模型管理方法、装置和系统

Country Status (5)

Country Link
US (1) US20230267326A1 (zh)
EP (1) EP4224369A4 (zh)
JP (1) JP7574438B2 (zh)
CN (1) CN114529005A (zh)
WO (1) WO2022095523A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220182802A1 (en) * 2020-12-03 2022-06-09 Qualcomm Incorporated Wireless signaling in federated learning for machine learning components
WO2024088572A1 (en) * 2023-01-05 2024-05-02 Lenovo (Singapore) Pte. Ltd. Registering and discovering external federated learning clients in a wireless communication system
EP4435655A1 (en) * 2023-03-23 2024-09-25 Hitachi, Ltd. Data management system and data management method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830100A (zh) * 2018-05-30 2018-11-16 山东大学 基于多任务学习的用户隐私泄漏检测方法、服务器及系统
CN110490738A (zh) * 2019-08-06 2019-11-22 深圳前海微众银行股份有限公司 一种混合联邦学习方法及架构
CN111243698A (zh) * 2020-01-14 2020-06-05 暨南大学 一种数据安全共享方法、存储介质和计算设备
CN111325619A (zh) * 2018-12-15 2020-06-23 深圳先进技术研究院 一种基于联合学习的信用卡欺诈检测模型更新方法及装置
US20200293887A1 (en) * 2019-03-11 2020-09-17 doc.ai, Inc. System and Method with Federated Learning Model for Medical Research Applications

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190032433A (ko) 2016-07-18 2019-03-27 난토믹스, 엘엘씨 분산 머신 학습 시스템들, 장치, 및 방법들
JP6986597B2 (ja) 2017-03-21 2021-12-22 株式会社Preferred Networks サーバ装置、学習済モデル提供プログラム、学習済モデル提供方法及び学習済モデル提供システム
JP6877393B2 (ja) 2017-12-18 2021-05-26 株式会社東芝 システム、プログラム及び方法
CN111309486B (zh) 2018-08-10 2024-01-12 中科寒武纪科技股份有限公司 转换方法、装置、计算机设备和存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830100A (zh) * 2018-05-30 2018-11-16 山东大学 基于多任务学习的用户隐私泄漏检测方法、服务器及系统
CN111325619A (zh) * 2018-12-15 2020-06-23 深圳先进技术研究院 一种基于联合学习的信用卡欺诈检测模型更新方法及装置
US20200293887A1 (en) * 2019-03-11 2020-09-17 doc.ai, Inc. System and Method with Federated Learning Model for Medical Research Applications
CN110490738A (zh) * 2019-08-06 2019-11-22 深圳前海微众银行股份有限公司 一种混合联邦学习方法及架构
CN111243698A (zh) * 2020-01-14 2020-06-05 暨南大学 一种数据安全共享方法、存储介质和计算设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4224369A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220182802A1 (en) * 2020-12-03 2022-06-09 Qualcomm Incorporated Wireless signaling in federated learning for machine learning components
WO2024088572A1 (en) * 2023-01-05 2024-05-02 Lenovo (Singapore) Pte. Ltd. Registering and discovering external federated learning clients in a wireless communication system
EP4435655A1 (en) * 2023-03-23 2024-09-25 Hitachi, Ltd. Data management system and data management method

Also Published As

Publication number Publication date
JP2023548530A (ja) 2023-11-17
EP4224369A1 (en) 2023-08-09
CN114529005A (zh) 2022-05-24
US20230267326A1 (en) 2023-08-24
EP4224369A4 (en) 2024-04-17
JP7574438B2 (ja) 2024-10-28

Similar Documents

Publication Publication Date Title
Abdulqadder et al. Multi-layered intrusion detection and prevention in the SDN/NFV enabled cloud of 5G networks using AI-based defense mechanisms
WO2022095523A1 (zh) 机器学习模型管理方法、装置和系统
CN113765713B (zh) 一种基于物联网设备采集的数据交互方法
Srirama A decade of research in fog computing: relevance, challenges, and future directions
US11968537B2 (en) Methods and apparatuses for managing compromised communication devices in a communication network
US11528231B2 (en) Active labeling of unknown devices in a network
Lin Artificial intelligence in 3gpp 5g-advanced: A survey
WO2021151335A1 (zh) 一种网络事件处理方法、装置及可读存储介质
CN105071989A (zh) 视频内容分发质量监控系统及其监控方法
JP2016511451A (ja) ネットワーク機能を開くためのシステムおよび方法、ならびに関連するネットワーク要素
Elmangoush et al. Application-derived communication protocol selection in M2M platforms for smart cities
US10536397B2 (en) Packet count-based object locking protocol
Fragkos et al. NEFSim: An open experimentation framework utilizing 3GPP’s exposure services
Sahay et al. A holistic framework for prediction of routing attacks in IoT-LLNs
US20230262098A1 (en) Packet flow descriptor provisioning
Bahutair et al. Multi-use trust in crowdsourced iot services
CN115460617A (zh) 基于联邦学习的网络负载预测方法、装置、电子设备及介质
US20220322089A1 (en) Network device identification
US20230336432A1 (en) Traffic classification rules based on analytics
KR20220001797A (ko) 무선 통신 네트워크에서 네트워크 분석 정보 제공 방법 및 장치
Fernández et al. Application of multi-pronged monitoring and intent-based networking to verticals in self-organising networks
Gomba et al. Architecture and security considerations for Internet of Things
CN114567678A (zh) 一种云安全服务的资源调用方法、装置及电子设备
Sicari et al. Performance Comparison of Reputation Assessment Techniques Based on Self‐Organizing Maps in Wireless Sensor Networks
US12058098B2 (en) Validation of alignment of wireless and wireline network function configuration with domain name system records

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21888222

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023526866

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2021888222

Country of ref document: EP

Effective date: 20230505

NENP Non-entry into the national phase

Ref country code: DE