CN111091180A

CN111091180A - Model training method and related device

Info

Publication number: CN111091180A
Application number: CN201911251338.3A
Authority: CN
Inventors: 黄羿衡; 田晋川
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-05-01
Anticipated expiration: 2039-12-09
Also published as: CN111091180B

Abstract

The embodiment of the application discloses a model training method and a related device, when an ith training iteration is finished, a processing device determines a model parameter mean value according to model parameters of a network model trained by a plurality of processing nodes, then the processing device determines first parameter change information corresponding to a target processing node aiming at the target processing node in the plurality of processing nodes, the first parameter change information is used for identifying the change of the model parameters of the network model trained by the target processing node based on the ith training iteration, and finally the processing device determines the initial model parameters of the network model trained by the target processing node when the (i + 1) th training iteration starts according to the model parameter mean value and the first parameter change information. On the basis of the model mean value which embodies the overall training characteristics, the parameter change information which embodies the self training characteristics of each processing node is added, so that the problem of performance loss of the network model when the training is finally completed is solved.

Description

Model training method and related device

Technical Field

The present application relates to the field of data processing, and in particular, to a model training method and related apparatus.

Background

With the development of artificial intelligence technology, various services such as voice recognition, image recognition, search and the like can be provided for users through a neural network model. A high-quality neural network model can be obtained only after a large amount of training data are trained, and under the condition that the magnitude of the training data is large, the time required for completing training is considerable, so that the ever-increasing service requirements are difficult to meet.

Aiming at the problem of high training time consumption, some related technologies provide a solution for parallel training of multiple processing nodes. Aiming at a data set comprising massive training data, a plurality of processing nodes respectively carry out parallel training on the same initial model, the training process comprises a plurality of times of training iteration, model parameters of models trained by all the processing nodes are averagely calculated when one training iteration is finished, and the model parameters obtained through the average calculation are used as the initial parameters of the models trained by each processing node in the next training iteration.

And after the training data is consumed, fusing the model of each processing node to obtain a network model corresponding to the data set. Because each processing node transfers training data from the data set in parallel in the training process, the consumption speed of the training data is increased, and the training time consumption is shortened.

However, due to the model parameter averaging method adopted in the related art, the obtained network model has a certain performance loss relative to the network model trained by a single processing node, and the magnitude of the performance loss has a linear relationship with the number of processing nodes used for parallel training. Resulting in poor training efficiency and training quality.

Disclosure of Invention

In order to solve the technical problems, the application provides a model training method and a related device, and first change information for different processing nodes is further added on the basis of the average of an original model to serve as personalized compensation, so that the diversity of model parameters adopted by different processing nodes is highlighted on the premise of emphasizing the homogeneity of the whole training, and the problem of performance loss of a network model during the final training is solved.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a model training method, which includes k training iterations in a process of performing parallel training on a network model by multiple processing nodes, and the method includes:

when the ith training iteration is finished, the processing equipment determines a model parameter mean value according to model parameters of the network model trained by the multiple processing nodes, wherein k is more than or equal to 2, and i is less than or equal to k-1;

for a target processing node in the plurality of processing nodes, the processing device determines first parameter change information corresponding to the target processing node, wherein the first parameter change information is used for identifying a change of a model parameter of the network model trained by the target processing node based on an ith training iteration;

and the processing equipment determines the initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

In a second aspect, an embodiment of the present application provides a processing apparatus for model training, including k training iterations in a process of performing parallel training on a network model by a plurality of processing nodes, where the processing apparatus includes a first determining unit, a second determining unit, and a third determining unit:

the first determining unit is used for determining a model parameter mean value according to model parameters of the network model trained by the processing nodes when the ith training iteration is finished, wherein k is more than or equal to 2, and i is less than or equal to k-1;

the second determining unit is configured to determine, for a target processing node in the plurality of processing nodes, first parameter change information corresponding to the target processing node, where the first parameter change information is used to identify a change, generated based on an ith training iteration, of a model parameter of the network model trained by the target processing node;

the third determining unit is configured to determine, according to the model parameter mean and the first parameter change information, an initial model parameter of the network model trained by the target processing node at the start of an (i + 1) th training iteration.

In a third aspect, an embodiment of the present application provides an apparatus for model training, where the apparatus includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the model training method according to the first aspect according to instructions in the program code.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium for storing program code for executing the model training method according to the first aspect.

According to the technical scheme, the process of training the network model in parallel through the plurality of processing nodes comprises k times of training iterations, and when one of the training iterations, for example, the ith training iteration, is finished, the mean value of the model parameters can be determined according to the model parameters of the network model trained by the plurality of processing nodes. When determining initial model parameters of a network model trained by each processing node at the beginning of a next training iteration, for example, an i +1 th training iteration, any one of the processing nodes may be used as a target processing node, and first parameter change information for identifying a change of model parameters of the network model trained by the target processing node based on the i th training iteration is determined. As the model parameter mean value can embody the overall training characteristics after the ith training iteration in the parallel training process, namely the same training starting points are set for a plurality of processing nodes, and the first parameter change information can embody the change trend characteristics of the model parameter of the network model trained by the target processing node per se under the training sample through the ith training iteration, namely the change trend characteristics of the model parameter of the target processing node is guided to the training direction of the target processing node per se, the initial model parameter determined based on the change trend characteristics not only can embody the overall training characteristics of the parallel training of the model, but also can embody the model training characteristics of the single processing node per se, the diversity of the model parameters adopted by different processing nodes is highlighted on the premise of emphasizing the overall training homogeneity, and the problem of performance loss of the network model when the training is finally completed is solved, the quality of the model is ensured on the premise of improving the training efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic scene diagram of a model training method in the prior art according to an embodiment of the present application;

fig. 2 is a schematic view of an application scenario of a model training method provided in an embodiment of the present application;

FIG. 3 is a flow chart of a model training method provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech recognition system according to an embodiment of the present application;

fig. 5 is a flowchart of a model training method in an application scenario according to an embodiment of the present application;

FIG. 6a is a block diagram of a processing device for model training according to an embodiment of the present disclosure;

FIG. 6b is a block diagram of a processing device for model training according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of an apparatus for model training according to an embodiment of the present disclosure;

fig. 8 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

In order to increase the training speed of the complex network model, the complex network model is usually trained in a parallel training manner by using a plurality of processing nodes in the related art. In the parallel training process, because the characteristics of the training data taken by different processing nodes may be different, the network models trained by different processing nodes may have a large difference, in order to reduce the training difference between the processing nodes, model parameters of the network models trained by all the processing nodes are often averaged at the end of one or more training iterations, and the model parameters obtained after the averaging are used as initial model parameters when the next training iteration of each processing node starts.

As shown in fig. 1, fig. 1 includes a parameter server (parameter server) and a plurality of processing nodes (e.g., 5 processing nodes in the figure), where the parameter server is used as a central processing node, and when a certain training iteration is finished, model parameters of a network model trained at the end of the current iteration are obtained from all the processing nodes to be averaged, and the model parameter average is returned to each processing node to be used as initial model parameters of each processing node at the start of the next training iteration.

In the method, the model parameters of all processing nodes are averaged, and the average value obtained by averaging is used as the initial model parameter of the next training iteration of each processing node, so that the initial model parameters received by each processing node are the same, and actually, because the training data of different processing nodes have certain difference, the network models trained by different processing nodes should conform to the characteristics of the training data of the network models and have certain difference, so that the network model obtained by final training can better conform to the requirements of the actual situation. When the initial model parameters received by each processing node are the same, the network model trained by the processing nodes lacks diversity and excessively tends to be homogeneous, and serious performance loss is easily caused.

In order to solve the technical problem, the application provides a model training method which can be applied to a scene in which a plurality of processing nodes perform parallel training on the same network model. Any one of the processing nodes may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and the like, multiple processing nodes may be configured in the same processing device, or may be configured in different processing devices, and the processing devices configured with the processing nodes may be a server, a terminal, and the like.

K times of training iteration is needed in the training process, the value of k is related to the number of training samples for training the network model, the number of samples for each training of each processing node and the like, and generally, k is an integer larger than or equal to 2. The ith training iteration mentioned later in the embodiment of the application can be one of k training iterations, and i is less than or equal to k-1 because the parallel training of the network model is completed after the last training iteration.

In this embodiment of the present application, after the ith training iteration is performed on the multiple processing nodes, initial model parameters of the network model trained by each processing node at the beginning of the (i + 1) th training iteration may be respectively determined for the multiple processing nodes.

The initial model parameters in the embodiment of the present application are used to identify the model parameters of the network model at the beginning of the (i + 1) th training iteration, that is, during the (i + 1) th training iteration, the network model trained by the target processing node starts to perform the (i + 1) th training iteration based on what model parameters.

For any processing node, for example, a target processing node, at the end of the ith training iteration, a model parameter mean value may be determined according to model parameters of network models trained by all processing nodes, first parameter change information may be determined based on a change of the model parameters of the network model trained by the target processing node in the ith training iteration, and an initial model parameter of the network model trained by the target processing node at the start of the (i + 1) th training iteration may be determined according to the model parameter mean value and the first parameter change information. Since the first parameter variation information may be different for different processing nodes, the initial model parameters determined for different processing nodes may be different.

In this embodiment of the application, the mean value of the model parameters determined when the i-th round of training iteration is finished may be embodied in median information of the model parameters of the network model trained by each of the plurality of processing nodes after the i-th round of training iteration, where the median information may embody the overall characteristics of the model parameters corresponding to the plurality of processing nodes. The model parameter mean value may be obtained by, for example, calculating a mean value, or may be obtained by using other data calculation methods, numerical distribution analysis methods, or the like.

And when the ith round of training is finished, the first parameter change information determined for the target processing node can identify the parameter change condition of the model parameter of the network model trained by the target processing node based on the ith training iteration. Therefore, the first parameter change information can reflect the training characteristics of the target processing node generated by training the network model through the training sample called by the target processing node in the ith training iteration. Because the training samples called by different processing nodes in the ith training iteration are different, and the initial model parameters in the training start are different, the first parameter change information respectively determined by different processing nodes can be different.

On the basis of embodying the overall training characteristics of all processing nodes by using the model parameter mean value, the first parameter change information corresponding to different target processing nodes is determined, and the determined first parameter change information may be different and has diversity when different processing nodes are used as target processing nodes, so that the initial model parameters determined based on the model parameter mean value and the first parameter change information can embody the overall training characteristics of parallel training of the model, the training characteristics of different processing nodes are emphasized, the diversity of the model parameters is improved, the problem of performance loss of the network model during final training is reduced, and the model quality is ensured on the premise of improving the training efficiency.

The technical scheme provided by the embodiment of the application can be applied to data processing equipment with model parameter processing and model parameter configuration capabilities, such as a server, a terminal and the like. The data processing device may be configured with part or all of the plurality of processing nodes, or may be a parameter server independent of the plurality of processing nodes.

In order to facilitate understanding of the technical solution of the present application, the model training method provided in the embodiments of the present application is introduced below with reference to an actual application scenario. In one possible application scenario shown in fig. 2, the plurality of processing nodes is 6 processing nodes, and the 6 processing nodes for performing parallel training on the same network model may be distributed in one or more servers, and the 6 processing nodes are identified by numbers 10 to 60.

Initial model parameters of the network model trained by the 6 processing nodes at the beginning of the (i + 1) th training iteration can be calculated respectively. When calculating the initial model parameter corresponding to any one of the 6 processing nodes, this processing node may be used as a target processing node, for example, the processing node 10 may be used as a target processing node. The parameter server 201, which is not configured with the above 6 processing nodes, can be used as the aforementioned data processing device to calculate the initial model parameters adopted by the 6 processing nodes at the beginning of the (i + 1) th training iteration.

In the scenario shown in fig. 2, the process of parallel training the network model includes k training iterations, and at the end of the ith training iteration, the parameter server 201 determines a model parameter average value according to the model parameters of the network model trained by the 6 processing nodes, for example, by averaging the model parameters of the processing nodes 10 to 60.

After determining the mean value of the model parameters, the parameter server 201 determines, for the processing node 10, first parameter variation information corresponding to the processing node 10, where the first parameter variation information can represent a variation of the model parameters of the network model trained by the processing node 10 based on the ith training iteration.

After determining the model parameter mean value and the first parameter variation information, the parameter server 201 determines the initial model parameters of the network model trained by the processing node 10 at the start of the (i + 1) th training iteration according to the model parameter mean value and the first parameter variation information, and returns the initial model parameters to the processing node 10 for performing the (i + 1) th training iteration.

When the initial model parameters of the processing nodes 10 are determined, the model parameters of all the processing nodes are subjected to mean value calculation, and the model parameter mean value can embody the overall training characteristics after the ith training iteration in the parallel training process; and then determining first parameter change information of the processing node 10, wherein the first parameter change information can reflect the change trend of model parameters of the processing node 10 through the ith training iteration and can guide the training direction of the processing node 10 in the (i + 1) th training iteration, so that the initial model parameters determined on the basis of the model parameter mean value and the first parameter change information can reflect the overall training characteristics of the ith training iteration and can reflect the model training characteristics of the processing node 10.

Meanwhile, when different processing nodes are used as target processing nodes for calculation, for example, when the processing node 20 and the processing node 30 in fig. 2 are used as target processing nodes for calculation, training data and training directions of the two processing nodes may be different, so that changes of model parameters of a trained network model based on the ith training iteration may be different, and the determined change information of the first parameter may be different, so that the diversity of model training characteristics of the different processing nodes is embodied, the problem of homogenization between network models trained by the different processing nodes is reduced, the problem of performance loss of the network model when training is finally completed is reduced, and the quality of the model is ensured on the premise of improving the training efficiency.

Next, a model training method provided by the embodiments of the present application will be described with reference to the drawings.

Referring to fig. 3, fig. 3 shows a flowchart of a model training method, which is applied in a process of performing parallel training on a network model through a plurality of nodes, wherein the parallel training process includes k training iterations. The method comprises the following steps:

s301: and when the ith training iteration is finished, the processing equipment determines the mean value of the model parameters according to the model parameters of the network model trained by the multiple processing nodes.

The network model may include various models, such as an acoustic model or a voice model used in speech recognition, an image model used in face recognition, and the like.

When a plurality of processing nodes perform parallel training on network model parameters, the training parameters obtained by different processing nodes and the initial training model may be different, so that the network models trained by different processing nodes may have a certain difference. In order to prevent the finally trained network model from being poor in effect due to too large difference between network models trained by different processing nodes, in the embodiment of the present application, when each training iteration is finished, model averaging needs to be performed on the network models trained by different processing nodes, where model averaging is performed at the end of the ith training iteration corresponding to fig. 3.

In order to reduce the difference between the network models trained by the multiple processing nodes and highlight the characteristics of the overall training in the parallel training process, the data processing equipment needs to obtain model parameters of the network models trained by the multiple processing nodes when the ith training iteration is finished, and determine a model parameter mean value according to the model parameters. There are various methods for determining the mean value of the model parameter, such as determining the mean value of the model parameter by directly calculating the mean value of the model parameter, or determining the mean value of the model parameter by other data analysis methods, such as analyzing the distribution of the model parameter.

It can be understood that the average effect of the model parameters brought by different determination methods may be different, for example, when the average value of the model parameters is determined by calculating the average value, the determined average value of the model parameters can take into account various training directions of the network model trained by the plurality of processing nodes, and global correction can be performed; when the mean value of the model parameters is determined by analyzing the distribution condition of the model parameters, the determined mean value of the model parameters tends to a training direction with more distributed processing nodes, so that the most obvious model characteristic of a network model in which a plurality of processing nodes perform parallel training can be highlighted to the greatest extent. Related personnel can select different model average modes for training according to actual training requirements.

It can be understood that, although the implementation effect may be different, the above manners are all based on the mean value of the model parameters determined by the model parameters of the plurality of processing nodes, and can embody the overall training characteristics of the network model trained by the plurality of processing nodes to a certain extent.

In addition, different parallel training methods are corresponding to different model averaging modes. For example, for the method of determining the Model parameter average by calculating the Model parameter average, a Model Average (MA) algorithm, a block-by-block Model-Update Filtering (BMUF) algorithm, and the like may be adopted to perform parallel training on the network Model. The following describes the technical improvement provided by the embodiment of the present application based on the scenario of model averaging in the MA algorithm. In the embodiment of the present application, the model parameters of the network model trained by the plurality of processing nodes obtained by the data processing device at the end of the ith training iteration may be represented by W_jAnd (t) represents the number of times of training samples of a small batch (mini-batch) passed by the network model when the network model is trained according to a conventional random gradient descent process. When the plurality of processing nodes is N processing nodes, j is 1,2 …, N, the determined model parameter mean value can be used

And (5) identifying.

The method for determining the mean value of the model parameters comprises the following steps:

s302: for a target processing node in the plurality of processing nodes, the processing device determines first parameter change information corresponding to the target processing node.

As mentioned above, there may be some differences in the network models trained by different processing nodes, and when the differences are too large, the finally trained network model may be poor. However, in the MA algorithm, only model averaging is performed, and the obtained model parameter mean is directly used as an initial model parameter of the network model trained by each processing node at the beginning of the (i + 1) th training iteration, so that although the difference between the network models can be reduced, the network models trained by different processing nodes are too homogeneous, the training characteristics of self-training are lost, and the performance loss of the finally trained network model is large. In order to emphasize training characteristics of a network model trained by each processing node, in the embodiment of the application, on the basis of determining a model parameter mean value according to an overall processing node, when different processing nodes are used as target processing nodes, corresponding first parameter change information is determined for the target processing nodes, and the first parameter change information is used for identifying changes of model parameters of the network model trained by the target processing nodes based on an ith training iteration.

For example, the model parameters of the network model trained by the target processing node at the end of the ith training iteration and the model parameters at the start of the ith training iteration may be obtained, a difference between the model parameters and the model parameters may be calculated, and the difference may be used as the first parameter change information. Setting the target processing node as processing node N, N is less than or equal to N, and G_nIs an intermediate variable used to store a parameter value in the calculation process, in which G_nNamely, the first parameter change information is represented by the following specific formula:

G_n＝W_n(t)-W_n(t-τ)

wherein τ is the number of mini-batch updated by the network model trained by the target processing node in one training iteration, and t represents the number of mini-batch passed by the network model trained by the target processing node at the end of the ith iteration, so t- τ can represent the number of mini-batch passed by the network model trained by the target processing node at the beginning of the ith training iteration, and W is the number of mini-batch passed by the network model trained by the target processing node_n(t- τ) may represent the model parameters of the target processing node at the beginning of the ith training iteration. Due to the fact thatin the process of i times of training iterations, the change of the model parameters depends on the training direction of the network model trained by the target processing node to a certain extent, so that the first change information can embody the training characteristics of the network model trained by the target processing node.

Meanwhile, the training direction depends on training parameters and an initial training model based on model training of the target processing node, and when different processing nodes are used as the target processing node, the training parameters and the initial training model based on the training parameters may be different, so that the training direction during the ith training iteration may be different, and the determined change information of the first parameter may also be different, thereby being capable of embodying different training characteristics of the network models trained by different processing nodes.

It can be understood that a certain training error may exist in the process of model training, and when determining the first parameter variation information, the first parameter variation information of the target processing node can also be determined in this way by combining a preset error parameter based on a difference value generated in the ith training iteration process by the model parameter of the network model trained by the target processing node. For example, when the error parameter is a, the formula for determining the first parameter variation information is:

G_n＝a(W_n(t)-W_n(t-τ))

s303: and the processing equipment determines the initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

As mentioned above, the model parameter mean determined by the data processing device can embody the training characteristics of the whole processing node, and the first parameter change information can embody the training direction and the training characteristics of the network model itself trained by the target processing node. If the initial model parameters are determined according to the model parameter mean value alone, the finally trained network model may be too homogeneous; if the initial model parameters are determined solely according to the first parameter change information, the difference between the network models trained by different processing nodes may be too large due to the fact that the network model training characteristics of the target processing nodes are too emphasized, and the finally trained network model has a poor effect. Therefore, in order to highlight the training characteristics of the network model trained by the target processing node on the basis of ensuring that the target processing node can embody the training characteristics of the whole processing node, the data processing device may determine the initial model parameters of the network model trained by the target processing node at the start of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

For example, the first parameter variation information may be used as an increment to perform incremental processing on the model parameter mean, and a result of the incremental processing may be used as an initial model parameter of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration. Setting initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration as W_n(t), the above determination method is shown by the following formula:

the first parameter change information can reflect the training direction of the network model trained by the target processing node in the ith training iteration, so that the first parameter change information is added on the basis of the model parameter mean value, which is equivalent to guiding the model training to the training direction of the network model of the target processing node on the basis of the training characteristics of the whole processing node, thereby ensuring the training characteristics of the target processing node.

It can be understood that, when a target processing node performs k times of training iterations, due to factors such as quality problems of training parameters and error problems generated in the training iteration process, an abnormality may occur in any one or more training iterations in the k times of training iterations, so that an abnormal change may occur in model parameters of a network model trained by the target processing node, and then an abnormality may occur in a first change parameter determined according to the change of the model parameters, which makes it difficult to correctly embody training characteristics of the network model trained by the target processing node.

In order to reduce the influence of the abnormality of one or more training iterations on the network model trained by the target processing node to a certain extent, in a possible implementation mode, the model parameters generated by the training iterations before the abnormal training iterations can be used for determining the historical model parameters and correcting the current training iteration.

For example, when an anomaly occurs in a certain training iteration, the probability that no anomaly occurs in the previous training iteration is high, so that at the end of the ith training iteration, the data processing device may determine, for the target processing node, corresponding second change information for identifying a change, which is generated based on the i-1 st training iteration, of the model parameter of the network model trained by the target processing node. After the second variation information is determined, the initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration may be determined according to the model parameter mean, the first variation information, and the second variation information. In addition, the model parameter change of the network model represented by the first change information and the second change information is generated based on two adjacent training iterations, so that the continuity characteristic of model training can be further embodied when the ith training iteration is corrected through the second change parameter, and the finally trained network model has higher quality.

It is understood that the method of determining the second variation information is similar to the method of determining the first variation information, and there are many methods, and the difference method may be used for the determination in this embodiment. The second variation information may be a difference between a model parameter of the target processing node at the end of the (i-1) th training iteration and a model parameter at the start of the (i-1) th training iteration, and the specific calculation formula is as follows:

G_n＝W_n(t-τ)-W_n(t-2τ)

the specific calculation formula for determining the first parameter variation information is as follows:

the significance of the symbol is that the calculation result on the right side of the symbol is assigned to a variable on the left side of the symbol, η >0 is used as a block learning rate (block learning rate), the block learning rate is an important parameter for supervising a certain processing node to carry out model training, whether and when an objective function in a trained network model can converge to a local minimum value are determined, and the weight of the m historical gradients (block momentum rate) is used for reflecting the influence of the historical gradients of the target processing node.

The initial model parameter formula for the beginning of the i +1 th training iteration is determined as follows:

therefore, the first parameter change information determined by the method not only can embody the training characteristics of the target processing node, but also can correct the first parameter change information through the historical parameters, so that the influence of the training iteration abnormity on model training is reduced; meanwhile, the learning rate and the weight of the historical gradient are added as calculation factors, so that the historical characteristics of model training can be further embodied, the determined initial model parameters are closer to the self-training characteristics of the network model trained by the target processing node, the interference of other factors is reduced, and certain robustness is achieved.

It is understood that, when some network models are trained in parallel, in order to further reduce the performance loss of the finally trained network model, when determining the model parameter mean, a processing node to be a node to be calculated, which is less than the number of the plurality of processing nodes, may be determined among the plurality of processing nodes, and then the model parameter mean may be determined according to the model parameters of the network model trained by the node to be calculated. Because the number of the processing nodes to be calculated is less than that of all the processing nodes, the training characteristics of the local processing nodes can be embodied according to the model parameter mean value determined by the nodes to be calculated.

Meanwhile, different or not identical processing nodes can be determined as processing nodes to be calculated when different training iterations are finished, for example, in a scene shown in fig. 2, when the ith training iteration is finished, the processing node 10, the processing node 20 and the processing node 30 can be selected as the processing nodes to be calculated; when the (i + 1) th training iteration is finished, the processing node 20, the processing node 30 and the processing node 40 can be selected as nodes to be calculated.

Therefore, the training characteristics of the local processing nodes embodied according to the model parameter mean determined by the nodes to be calculated may be different, and have a certain diversity, thereby further reducing the performance loss of the finally trained network model. In addition, when each training iteration is finished, the model parameters of the nodes to be calculated, which need to be calculated for determining the mean value of the model parameters, are less than the model parameters of all the processing nodes, so that the parallel training speed of the network model is improved to a certain extent.

It can be understood that, in order to further increase the diversity of model training and reduce the performance loss of the trained network model, in the process of parallel training, different nodes to be calculated can be selected not only for each training iteration process, but also for different processing nodes in a certain training iteration, and the model parameter mean value determined by the nodes to be calculated is taken as the model parameter mean value corresponding to the target processing node. For example, in the scenario shown in fig. 2, when processing node 10 is the target processing node, processing node 20 and processing node 30 may be selected as the nodes to be calculated; when processing node 20 is the target processing node, processing node 10 and processing node 30 may be selected as the nodes to be computed. When each processing node is used as a target processing node, the selected nodes to be calculated can be different, so the determined model parameter mean values can be different, and the initial model parameters determined according to the model parameter mean values and the first change information can also be different, so that when each training iteration starts, the model training starting points are not completely the same due to the fact that the initial training parameters of the network models trained by the multiple processing nodes are different, the training characteristics of a part of processing nodes in all the processing nodes on the network models are respectively reflected, and the over homogenization is avoided. Therefore, on the premise of not influencing the integrity of parallel training, the training diversity of part of processing nodes is highlighted. After the k times of training iteration, the performance loss of the network model during final training can be effectively reduced, and the model quality is ensured on the premise of improving the training efficiency.

It can be understood that, under some circumstances, a large difference may exist between a network model trained by a node to be calculated and a network model trained by a target processing node, and if an initial model parameter of the target processing node is determined only by a model parameter of the node to be calculated, it may be caused that the determined initial model parameter can embody a training characteristic of a local processing node, but the difference is large from the training characteristic of the network model trained by the target processing node itself, and it is difficult to embody the training characteristic of the target processing node itself, so that the target processing node is too biased to the local training characteristic when training the network model, and the training effect of the target processing node is reduced instead. In order to highlight the characteristics of the network model trained by the target processing node itself, and to make appropriate corrections to the model training of the target processing node, in a possible implementation manner, the target processing node may be one of the nodes to be calculated.

Next, the model training method provided in the embodiment of the present application will be described in conjunction with a practical application scenario. The application scenario is a speech recognition scenario, and the speech recognition system includes a preprocessing module 401, a word boundary detection module 402, a mel-frequency cepstral coefficient feature module 403, an acoustic model and speech model module 404, and an authentication module 405. The model training mode provided by the embodiment of the application can be applied to training of the acoustic model and the voice model in the scene so as to realize high-quality and efficient training.

The operation of the above modules in this scenario is briefly described as follows:

a preprocessing module 401, configured to receive an input voice signal and perform preprocessing;

a word boundary detection module 402, configured to perform word boundary detection on the preprocessed voice signal, and determine whether the voice signal is human voice audio;

a mel-frequency cepstrum coefficient feature module 403, configured to extract mel-frequency cepstrum coefficient features from the audio data after determining that the audio is human audio;

an acoustic model and speech model module 404 for recognizing the audio data by the acoustic model and the speech model;

and an authentication module 405 for authenticating and outputting the identification result.

The acoustic model and the language model module are provided with n processing nodes for parallel training of a long short-term memory (LSTM) acoustic model, and the parallel training method is an optimized MA algorithm combined with the technical scheme of the application. A flow chart of a model training method for a speech recognition system is shown in fig. 5, the method comprising:

s501: the data to be trained is divided into n parts and sent to n processing nodes.

Before parallel training is started, data to be trained needs to be sent to each processing node.

S502: and each processing node reads data to perform model training.

S503: and when the ith training iteration is finished, obtaining model parameters of the network model trained by all the processing nodes.

S504: and calculating the mean value of the model parameters according to the obtained model parameters.

After the data processing equipment obtains the model parameters of the network model trained by all the processing nodes, the average value of the model parameters is calculated and used as the average value of the model parameters.

S505: and determining a target processing node, and acquiring model parameters of the network model trained by the target processing node when the ith training iteration is finished.

S506: and obtaining model parameters of the network model trained by the target processing node at the beginning of the ith training iteration.

The data processing device can obtain model parameters of the network model trained by the target processing node at the beginning of the ith training iteration from historical calculation data stored in the data processing device.

S507: and calculating first parameter change information corresponding to the target processing node according to the acquired model parameters.

And the data processing equipment makes a difference between the two model parameters to obtain first parameter change information.

S508: and determining initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

And the data processing equipment takes the first parameter change information as increment and sums with the model parameter mean value to obtain the initial model parameter of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration.

Based on the model training method provided by the foregoing embodiment, this embodiment provides a processing apparatus 600 for model training, referring to fig. 6a, the processing apparatus 600 includes a first determining unit 601, a second determining unit 602, and a third determining unit 603:

a first determining unit 601, configured to determine a model parameter mean value according to model parameters of a network model trained by multiple processing nodes when an ith training iteration is finished, where k is greater than or equal to 2, and i is less than or equal to k-1;

a second determining unit 602, configured to determine, for a target processing node in the multiple processing nodes, first parameter change information corresponding to the target processing node, where the first parameter change information is used to identify a change, generated based on an i-th training iteration, of a model parameter of a network model trained by the target processing node;

a third determining unit 603, configured to determine, according to the model parameter mean and the first parameter variation information, an initial model parameter of the network model trained by the target processing node at the start of the (i + 1) th training iteration.

In a possible implementation manner, the third determining unit 603 is specifically configured to:

performing incremental processing on the mean value of the model parameters by taking the first parameter change information as an increment;

and taking the result of the incremental processing as the initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration.

In one possible implementation, referring to fig. 6b, the processing device 600 further comprises a fourth determining unit 604:

a fourth determining unit 604, configured to determine second parameter change information corresponding to the target processing node, where the second parameter change information is used to identify a change, generated based on the i-1 st training iteration, of a model parameter of the network model trained by the target processing node;

the third determining unit 603 is specifically configured to:

and determining initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value, the first parameter change information and the second parameter change information.

In a possible implementation manner, the first determining unit 601 is specifically configured to:

determining a processing node serving as a node to be calculated in the plurality of processing nodes, wherein the number of the node to be calculated is less than that of the plurality of processing nodes;

and determining the mean value of the model parameters according to the model parameters of the network model trained by the nodes to be calculated.

In one possible implementation, the target processing node is one of the nodes to be computed.

The embodiment of the application also provides equipment for model training, and the equipment for model training is described below with reference to the attached drawings. Referring to fig. 7, an embodiment of the present application provides an apparatus 700 for model training, where the apparatus 700 may also be a terminal apparatus, and the terminal apparatus may be any intelligent terminal including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the terminal apparatus is a mobile phone:

fig. 7 is a block diagram illustrating a partial structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 7, the handset includes: a Radio Frequency (RF) circuit 710, a memory 720, an input unit 730, a display unit 740, a sensor 750, an audio circuit 760, a wireless fidelity (WiFi) module 770, a processor 780, and a power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 7 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 7:

the RF circuit 710 may be used for receiving and transmitting signals during information transmission and reception or during a call, and in particular, receives downlink information of a base station and then processes the received downlink information to the processor 780; in addition, the data for designing uplink is transmitted to the base station. In general, the RF circuit 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 710 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for Mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 may execute various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, can collect touch operations of a user (e.g. operations of the user on or near the touch panel 731 by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 731 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 780, and can receive and execute commands from the processor 780. In addition, the touch panel 731 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 740 may be used to display information input by the user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 731 can cover the display panel 741, and when the touch panel 731 detects a touch operation on or near the touch panel 731, the touch operation is transmitted to the processor 780 to determine the type of the touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of the touch event. Although in fig. 7, the touch panel 731 and the display panel 741 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 741 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 741 and/or a backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor may be further configured on the mobile phone, which are not described herein again.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 can transmit the electrical signal converted from the received audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 and output; on the other hand, the microphone 762 converts the collected sound signals into electrical signals, converts the electrical signals into audio data after being received by the audio circuit 760, processes the audio data output processor 780, and then sends the processed audio data to, for example, another cellular phone via the RF circuit 710, or outputs the processed audio data to the memory 720 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive e-mails, browse webpages, access streaming media and the like through the WiFi module 770, and provides wireless broadband internet access for the user. Although fig. 7 shows the WiFi module 770, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby integrally monitoring the mobile phone. Optionally, processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 780.

The handset also includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 780 via a power management system, such that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

when the ith training iteration is finished, determining a model parameter mean value according to model parameters of a network model trained by a plurality of processing nodes, wherein k is more than or equal to 2, and i is less than or equal to k-1;

determining first parameter change information corresponding to a target processing node aiming at the target processing node in the plurality of processing nodes, wherein the first parameter change information is used for identifying the change of model parameters of a network model trained by the target processing node based on the ith training iteration;

and determining initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

Referring to fig. 8, fig. 8 is a block diagram of a server 800 provided in this embodiment, and the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.

The steps performed by the server in the above embodiments may also provide a computer-readable storage medium for storing program code for performing any one of the embodiments of a model training method described in the foregoing embodiments.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of model training comprising k training iterations in a process of parallel training a network model by a plurality of processing nodes, the method comprising:

for a target processing node in the plurality of processing nodes, the processing device determines first parameter change information corresponding to the target processing node, where the first parameter change information is used to identify a change, generated based on an i-th training iteration, of a model parameter of the network model trained by the target processing node;

and the processing equipment determines initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter change information.

2. The method of claim 1, wherein the processing device determines initial model parameters of the network model trained by the target processing node at the beginning of an i +1 th training iteration according to the model parameter mean and the first parameter variation information, and comprises:

the processing equipment takes the first parameter change information as an increment to carry out increment processing on the model parameter mean value;

the processing device takes the result of the incremental processing as initial model parameters of the network model trained by the target processing node at the start of the (i + 1) th training iteration.

3. The method according to claim 1 or 2, wherein when i >1, the method further comprises:

the processing equipment determines second parameter change information corresponding to the target processing node, wherein the second parameter change information is used for identifying the change of the model parameter of the network model trained by the target processing node based on the (i-1) th training iteration;

the processing device determines initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value and the first parameter variation information, and includes:

and the processing equipment determines initial model parameters of the network model trained by the target processing node at the beginning of the (i + 1) th training iteration according to the model parameter mean value, the first parameter change information and the second parameter change information.

4. The method of claim 1 or 2, wherein at the end of the ith training iteration, the processing device determining a model parameter mean from model parameters of the network model trained by the plurality of processing nodes, comprises:

the processing device determines processing nodes which are nodes to be calculated in the plurality of processing nodes, wherein the number of the nodes to be calculated is less than that of the plurality of processing nodes;

and the processing equipment determines the model parameter mean value according to the model parameters of the network model trained by the node to be calculated.

5. The method of claim 4, wherein the target processing node is one of the nodes to be computed.

6. The method according to claim 1 or 2, wherein the network model comprises an acoustic model or a speech model used in speech recognition.

7. The method of claim 1 or 2, wherein the processing device is configured with some or all of the plurality of processing nodes; alternatively, the processing device is a parameter server independent of the plurality of processing nodes.

8. A processing apparatus for model training, comprising k training iterations in a process of parallel training a network model by a plurality of processing nodes, the processing apparatus comprising a first determining unit, a second determining unit and a third determining unit:

the second determining unit is configured to determine, for a target processing node in the plurality of processing nodes, first parameter change information corresponding to the target processing node, where the first parameter change information is used to identify a change, generated based on an i-th training iteration, of a model parameter of the network model trained by the target processing node;

the third determining unit is configured to determine, according to the model parameter mean and the first parameter variation information, an initial model parameter of the network model trained by the target processing node at the start of the (i + 1) th training iteration.

9. The processing device according to claim 8, wherein the third determining unit is specifically configured to:

performing increment processing on the model parameter mean value by taking the first parameter change information as an increment;

10. The processing apparatus according to claim 8 or 9, wherein when i >1, the processing apparatus further comprises a fourth determination unit:

the fourth determining unit is configured to determine second parameter change information corresponding to the target processing node, where the second parameter change information is used to identify a change, generated based on an i-1 st training iteration, of a model parameter of the network model trained by the target processing node;

the third determining unit is specifically configured to:

11. The processing device according to claim 8 or 9, wherein the first determining unit is specifically configured to:

determining a processing node as a node to be calculated among the plurality of processing nodes, the number of the node to be calculated being less than the number of the plurality of processing nodes;

12. An apparatus for model training, the apparatus comprising a processor and a memory:

the processor is configured to perform the model training method of any one of claims 1-7 according to instructions in the program code.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium is configured to store a program code for performing the model training method of any one of claims 1-7.