CN114661475A - Distributed resource scheduling method and device for machine learning - Google Patents
Distributed resource scheduling method and device for machine learning Download PDFInfo
- Publication number
- CN114661475A CN114661475A CN202210337294.1A CN202210337294A CN114661475A CN 114661475 A CN114661475 A CN 114661475A CN 202210337294 A CN202210337294 A CN 202210337294A CN 114661475 A CN114661475 A CN 114661475A
- Authority
- CN
- China
- Prior art keywords
- resources
- subtasks
- subtask
- resource
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5018—Thread allocation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5021—Priority
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application provides a distributed resource scheduling method and device for machine learning, which relate to the technical field of resource scheduling and are used for determining a pre-used resource required by each subtask aiming at a plurality of parallel-processing subtasks divided by a machine learning task; allocating a use resource which is not less than a pre-used resource for each subtask from an available resource of a resource pool to process each subtask; monitoring the state information of the subtasks processed by the use resources in real time; and scheduling the residual resources in the low-utilization-rate used resources to the high-utilization-rate used resources in advance according to the monitored time law of processing the subtasks by utilizing the used resources, so that the unified monitoring and scheduling of machine learning resources are realized, the resource utilization rate is improved, the use habits of users can be predicted, and the users can switch system resources in a non-sensible manner.
Description
Technical Field
The present application relates to the technical field of resource scheduling, and in particular, to a distributed resource scheduling method and apparatus for machine learning.
Background
Because the resource scheduling of machine learning is based on container coarse-grained scheduling at present, the model is developed in an exclusive mode, especially when a data scientist writes codes, machine resources are not consumed basically, and only when a training model is started, computing resources (CPU, GPU, memory and video memory) are needed, so that the resource utilization rate is low, and especially in a SaaS multi-user scene, the resource performance of the whole system is low.
Disclosure of Invention
In view of this, an object of the present application is to provide a distributed resource scheduling method and apparatus for machine learning, which can uniformly monitor and schedule resources used by each subtask during machine learning, and improve resource utilization rate.
The embodiment of the application provides a distributed resource scheduling method for machine learning, which comprises the following steps:
aiming at a plurality of parallel-processing subtasks divided by a machine learning task, determining a pre-used resource required by each subtask;
allocating, based on the pre-used resources required by each of the subtasks, a used resource which is not less than the pre-used resources of each of the subtasks from the available resources of the resource pool to process each of the subtasks;
monitoring the state information of the subtasks processed by the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law of processing the subtask by using the used resource;
and scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
In some embodiments, said allocating, from available resources of a resource pool, used resources, which are not less than their pre-used resources, for each of said subtasks to process each of said subtasks based on the pre-used resources required by each of said subtasks includes:
self-defining the priority of each subtask;
allocating used resources which are not less than the pre-used resources of each subtask from the available resources of the resource pool based on the priority order of the subtasks; and the available resources of the resource pool are updated after the used resources are allocated to one subtask, and the used resources are allocated to the next subtask according to the updated available resources of the resource pool.
In some embodiments, the subtasks are processed by:
dividing the subtask into at least one thread for parallel processing;
and distributing consumption resources for each thread from the use resources, and respectively processing each thread in the subtasks by using the consumption resources.
In some embodiments, the real-time monitoring of the status information of the processing of the subtasks by the used resource includes:
counting the consumed resources of each thread of the subtasks, and calculating the residual resources in the used resources of the subtasks and the utilization rate of the used resources;
acquiring a time law for processing the subtasks based on the utilization rate of the used resources, wherein the time law comprises the following steps:
determining a time period in which the utilization rate of the used resources is high as a time period in which the subtask is processed, and determining a time period in which the utilization rate of the used resources is low as a time period in which the subtask is not processed;
and determining the time law of the sub tasks based on the time period of processing the sub tasks and the time period of not processing the sub tasks.
In some embodiments, said scheduling, in advance, the remaining resources in the low-utilization used resources to the high-utilization used resources according to the monitored time law for processing the subtasks by using the used resources includes:
determining a subtask to be processed according to the monitored time rule for processing the subtask by using the use resource;
determining a subtask to be scheduled with low utilization rate of currently used resources;
and scheduling the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed.
In some embodiments, the pending subtasks are processed by:
distributing the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed to obtain the used resources of the subtasks to be processed;
and reallocating consumed resources for each thread of the subtasks to be processed based on the resources to be used, and processing each thread of the subtasks to be processed by using the reallocating consumed resources.
In some embodiments, the priority of the subtasks is determined based on the type of the subtask, wherein the subtask includes one or more of data preparation, model development, model training, model management, and model deployment.
In some embodiments, there is also provided a distributed resource scheduling apparatus for machine learning, the apparatus comprising:
the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining a plurality of parallel-processing subtasks divided by a machine learning task and a pre-used resource required by each subtask;
the allocation module is used for allocating used resources which are not less than the pre-used resources of the subtasks to the subtasks from the available resources of the resource pool to process the subtasks based on the pre-used resources required by the subtasks;
the monitoring module is used for monitoring the state information of the subtasks processed by using the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law for processing the subtask by using the used resource;
and the scheduling module is used for scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
In some embodiments, there is also provided an electronic device comprising: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, the processor and the memory communicate via the bus when the electronic device runs, and the machine-readable instructions are executed by the processor to perform any one of the steps of the method for machine-learning distributed resource scheduling.
In some embodiments, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the above-described methods for distributed resource scheduling for machine learning.
According to the method, the device, the electronic equipment and the medium for allocating the refined resources, aiming at a plurality of parallel processing subtasks divided by a machine learning task, the pre-used resources required by each subtask are determined; allocating a use resource which is not less than a pre-used resource for each subtask from an available resource of a resource pool to process each subtask; monitoring the state information of the subtasks processed by using the use resources in real time; and scheduling the residual resources in the low-utilization-rate used resources to the high-utilization-rate used resources in advance according to the monitored time law of processing the subtasks by utilizing the used resources, thereby realizing the unified monitoring and scheduling of machine learning resources, improving the resource utilization rate, predicting the use habits of users and switching system resources noninductively.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
FIG. 1 is a flow chart illustrating a method for distributed resource scheduling for machine learning according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating allocating usage resources for respective subtasks according to an embodiment of the present application;
FIG. 3 is a flow diagram illustrating the processing of subtasks in accordance with an embodiment of the present application;
FIG. 4 is a flow diagram illustrating a process for monitoring status information of a subtask processed using a resource in real time according to an embodiment of the present application;
FIG. 5 is a flowchart illustrating an embodiment of the present application, where the remaining resources in the low-utilization used resources are scheduled to the high-utilization used resources in advance;
FIG. 6 is a flow chart illustrating the processing of pending subtasks according to an embodiment of the present application;
fig. 7 is a schematic structural diagram illustrating a distributed resource scheduling apparatus for machine learning according to an embodiment of the present application;
fig. 8 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and steps without logical context may be performed in reverse order or simultaneously. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.
Because the resource scheduling of machine learning is based on container coarse-grained scheduling at present, an exclusive mode is adopted to develop a model. Especially, when a data scientist writes codes, machine resources are not consumed basically, and only when a training model is started, computing resources (CPU, GPU, memory and video memory) are needed, so that the resource utilization rate is low. Based on the method and the device for allocating the refined resources, the electronic equipment and the medium are provided, and the used resources of each subtask during machine learning are monitored and scheduled in a unified mode, so that the resource utilization rate is improved.
Referring to fig. 1 in the specification, in an embodiment, the present application provides a distributed resource scheduling method for machine learning, including the following steps:
s1, determining a pre-used resource needed by each subtask aiming at a plurality of parallel processing subtasks divided by a machine learning task;
s2, allocating the use resources which are not less than the pre-used resources for each subtask from the available resources of the resource pool to process each subtask based on the pre-used resources needed by each subtask;
s3, monitoring the state information of the subtasks processed by the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law for processing the subtask by using the used resource;
and S4, scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
In the embodiment of the application, the distributed resource scheduling method for machine learning may be executed in a server. When the refined resource allocation method is executed on a server, the distributed resource scheduling method for machine learning may be implemented and executed based on a cloud interactive system, where the cloud interactive system at least includes the server and at least one client device (i.e., terminal device).
Specifically, in step S1, the subtasks included in the machine learning task may be one or more of data preparation, model development, model training, model management, and model deployment, such as data preparation, that is, collecting and storing a large amount of sample data; a model development building function to determine a model; model training, namely, determining parameters of a function in a model through sample data optimization; model management, namely recording model information; model deployment is to apply the prediction model on new data; by processing the subtasks in parallel, the efficiency of machine learning can be greatly improved.
When the subtasks are processed in parallel, resources are required to be utilized, and the resources include a CPU, a GPU, a memory, a video memory, and the like. Before allocating and using resources to each subtask for processing, the pre-used resources required by each subtask are often estimated according to the type of each subtask.
In step S2, referring to fig. 2 of the specification, the allocating, from the available resources of the resource pool, a used resource not less than the pre-used resource of each of the subtasks to process each of the subtasks based on the pre-used resource required by each of the subtasks includes:
s201, self-defining the priority of each subtask;
s202, distributing used resources which are not less than the pre-used resources of the subtasks to the subtasks from the available resources of a resource pool based on the priority order of the subtasks; and the available resources of the resource pool are updated after the used resources are allocated to one subtask, and the used resources are allocated to the next subtask according to the updated available resources of the resource pool.
In one embodiment, defining the priority of subtask data preparation as high, the priority of subtask model development as medium, and the priority of model training as low, then preferentially allocating the use resources for subtask data preparation from the available resources of the resource pool; after the subtask data is ready to be allocated with the used resources, counting the available resources of the resource pool again, and then developing and allocating the used resources to the subtask model from the updated available resources of the resource pool; and after the development and allocation of the used resources to the subtask model are finished, counting available resources of the resource pool again, and then training and allocating the used resources to the subtask model from the updated available resources of the resource pool.
The distributed used resources of each subtask are not less than the pre-used resources of the subtask, and the method can be performed in a following manner, for example, the distributed used resources of the subtask are set to be 1.1 times of the pre-used resources of the subtask, and the reliability of processing each subtask through the distributed used resources is ensured through a multi-distribution principle, so that the bad results of interruption and even failure of the subtask processing caused by insufficient used resources are avoided.
Further, referring to fig. 3 of the specification, the subtasks are processed by:
s203, dividing the subtasks into at least one thread for parallel processing;
s204, distributing consumed resources for each thread from the used resources, and respectively processing each thread in the subtasks by using the consumed resources.
Each subtask is further divided into a plurality of threads which can be processed in parallel, the consumed resources are distributed to each thread again from the distributed used resources, and the threads are processed in parallel by using the consumed resources, so that the efficiency of processing each subtask is improved, and the resources can be distributed and scheduled in a finer granularity.
In step S3, referring to fig. 4 of the specification, the status information of the subtasks processed by the used resource is monitored in real time by:
s301, counting and processing the consumed resources of each thread of the subtasks, and calculating the residual resources in the used resources of the subtasks and the utilization rate of the used resources;
s302, acquiring a time law for processing the subtasks based on the utilization rate of the used resources, wherein the time law comprises the following steps:
determining a time period in which the utilization rate of the used resources is high as a time period in which the subtask is processed, and determining a time period in which the utilization rate of the used resources is low as a time period in which the subtask is not processed;
and determining the time law of the sub tasks based on the time period of processing the sub tasks and the time period of not processing the sub tasks. And determining the time law of the sub tasks based on the time period of processing the sub tasks and the time period of not processing the sub tasks.
In step S301, the remaining resources specifically include resources that are mostly allocated to the subtask at the beginning and consumed resources that are not processed by a thread. If the subtask a includes a thread a, a thread b, and a thread c that are processed in parallel, where the used resources allocated by the task a are (1.1 GPU, 1100M memory), the consumed resources of the thread a are (0.2 GPU, 200M memory), the consumed resources of the thread b are (0.3 GPU, 300M memory), the consumed resources of the thread a are (0.5 GPU, 500M memory), and the visible excess resources are (0.1 GPU, 100M memory), if each process of the task a is processing at this time, the remaining resources are the same as the excess resources (0.1 GPU, 100M memory), and if the thread a of the task a is not processing at this time, the multiple remaining resources are the sum of the excess resources (0.1 GPU, 100M memory) and the consumed resources of the thread a (0.2 GPU, 200M memory) (0.3 GPU, 300M memory).
In the next step, the utilization rate of the used resources can be calculated by the consumed resources of each thread and the used resources allocated by the subtasks. When the utilization rate of resources used by the time period I is high, if each thread is in progress, judging that the user is processing the subtask in the time period I; and if only one thread is reserved to maintain the dormant state of the subtask, judging that the user does not process the subtask in the time period II, and further obtaining the time rule for processing the subtask by the user.
Referring to fig. 5 in the specification, the scheduling, in advance, the remaining resources in the low-utilization used resources to the high-utilization used resources according to the monitored time law for processing the subtasks by using the used resources includes the following steps:
s401, determining a subtask to be processed according to the monitored time rule for processing the subtask by using the used resource;
s402, determining a subtask to be scheduled with low utilization rate of currently used resources;
and S403, scheduling the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed.
For example, when the subtask A is monitored, the utilization rate of the used resource is higher when the subtask A is processed. When the subtask a is processed at the next time point, the remaining resources in other subtasks are transferred to the resources used by the subtask a in advance, that is, the resources used by the subtask a are expanded, so that the utilization rate of the resources used by the subtask a is reduced, and the efficiency of processing the subtask a is improved.
It should be noted that, when scheduling the remaining resources of other subtasks, the scheduling ratio of the remaining resources may be set appropriately, for example, 90%, instead of 100%, so as to avoid that the utilization rate of the used resources of other subtasks is increased more.
Referring to fig. 6 of the specification, the pending subtasks are processed in the following manner:
s404, distributing the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed to obtain the used resources of the subtasks to be processed;
s405, re-distributing the consumed resources for each thread of the subtasks to be processed based on the resources to be used, and processing each thread of the subtasks to be processed by using the re-distributed consumed resources.
The method and the device have the advantages that the sub-tasks to be processed redistribute the threads for the threads which are processed in parallel according to the redistributed used resources, fine-grained distribution of the resources is achieved, processing efficiency of the threads is improved, and processing efficiency of the sub-tasks to be processed is further improved.
According to the refined resource allocation method, the states of the utilized resources of the plurality of parallel-processed subtasks divided by the machine learning task are monitored in real time, so that the use habits of users are predicted, the subtasks to be processed are scheduled in advance, the effect of switching the resources noninductively is achieved, the utilization rate of the used resources is reduced, and the efficiency of processing the subtasks is improved.
Based on the same inventive concept, the embodiment of the present application further provides a distributed resource scheduling apparatus for machine learning, and as the principle of solving the problem of the apparatus in the embodiment of the present application is similar to that of the above-mentioned distributed resource scheduling method for machine learning in the embodiment of the present application, the implementation of the apparatus may refer to the implementation of the method, and the repeated parts are not described again.
An embodiment of the present application further provides a distributed resource scheduling apparatus for machine learning, as shown in fig. 7, the apparatus includes:
701. the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining a plurality of parallel-processing subtasks divided by a machine learning task and a pre-used resource required by each subtask;
702. the allocation module is used for allocating the use resources which are not less than the pre-used resources of the subtasks from the available resources of the resource pool to process the subtasks based on the pre-used resources needed by the subtasks;
703. the monitoring module is used for monitoring the state information of the subtasks processed by using the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law for processing the subtask by using the used resource;
704. and the scheduling module is used for scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
In some embodiments, the allocating module 702 is further configured to, in allocating, from available resources in the resource pool, used resources not smaller than the pre-used resources of each of the subtasks to process each of the subtasks based on the pre-used resources required by each of the subtasks, and to:
self-defining the priority of each subtask;
allocating used resources which are not less than the pre-used resources of each subtask from the available resources of the resource pool based on the priority order of the subtasks; and the available resources of the resource pool are updated after the used resources are allocated to one subtask, and the used resources are allocated to the next subtask according to the updated available resources of the resource pool.
In some embodiments, the allocating module 702, when processing the subtasks, is further configured to:
dividing the subtask into at least one thread for parallel processing;
and distributing consumption resources for each thread from the use resources, and respectively processing each thread in the subtasks by using the consumption resources.
In some embodiments, the monitoring module 703, when monitoring in real time the status information of the subtasks processed by the used resource, is further configured to:
counting the consumed resources of each thread of the subtasks, and calculating the residual resources in the used resources of the subtasks and the utilization rate of the used resources;
acquiring a time law for processing the subtasks based on the utilization rate of the used resources, wherein the time law comprises the following steps:
determining a time period in which the utilization rate of the used resources is high as a time period in which the subtask is processed, and determining a time period in which the utilization rate of the used resources is low as a time period in which the subtask is not processed;
and determining the time law of the sub tasks based on the time period of processing the sub tasks and the time period of not processing the sub tasks.
In some embodiments, the scheduling module 704 is further configured to, in accordance with the monitored time law for processing the subtask by using the used resource, schedule a remaining resource of the used resource with the low utilization rate to the used resource with the high utilization rate in advance, and further configured to:
determining a subtask to be processed according to the monitored time rule for processing the subtask by using the use resource;
determining a subtask to be scheduled with low utilization rate of currently used resources;
and scheduling the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed.
In some embodiments, the scheduling module 704, in processing the pending subtasks, is further configured to:
distributing the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed to obtain the used resources of the subtasks to be processed;
and reallocating consumed resources for each thread of the subtasks to be processed based on the resources to be used, and processing each thread of the subtasks to be processed by using the reallocating consumed resources.
The application the refined resource allocation device can realize unified monitoring and scheduling of machine learning resources, improve the resource utilization rate, predict the use habits of users and enable the users to switch system resources in a non-sensitive manner.
Based on the same concept of the present invention, as shown in fig. 8 in the specification, an embodiment of the present application provides a structure of an electronic device 800, where the electronic device 800 includes: at least one processor 801, at least one network interface 804 or other user interface 803, memory 805, at least one communication bus 802. A communication bus 802 is used to enable connective communication between these components. The electronic device 800 optionally contains a user interface 803 including a display (e.g., touchscreen, LCD, CRT, Holographic (Holographic) or projection (Projector), etc.), a keyboard or a pointing device (e.g., mouse, trackball (trackball), touch pad or touchscreen, etc.).
In some embodiments, memory 805 stores the following elements, executable modules or data structures, or a subset thereof, or an expanded set thereof:
an operating system 8051, which contains various system programs for implementing various basic services and for handling hardware-based tasks;
the application module 8052 contains various applications, such as a desktop (launcher), a Media Player (Media Player), a Browser (Browser), and the like, for implementing various application services.
In the present embodiment, the processor 801 is configured to perform the steps of the distributed resource scheduling method, such as one used for machine learning, by calling a program or instructions stored in the memory 805.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps in the distributed resource scheduling method as used for machine learning.
Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is executed, the distributed resource scheduling method for machine learning can be executed, so that unified monitoring and scheduling of machine learning resources can be realized, and resource utilization rate is improved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is only a logical division, and other divisions may be realized in practice, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the technical solutions of the present application, and the scope of the present application is not limited thereto, although the present application is described in detail with reference to the foregoing examples, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method for distributed resource scheduling for machine learning, the method comprising:
aiming at a plurality of parallel-processing subtasks divided by a machine learning task, determining a pre-used resource required by each subtask;
allocating, based on the pre-used resources required by each of the subtasks, a used resource which is not less than the pre-used resources of each of the subtasks from the available resources of the resource pool to process each of the subtasks;
monitoring the state information of the subtasks processed by the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law for processing the subtask by using the used resource;
and scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
2. The method according to claim 1, wherein said allocating, from available resources in a resource pool, a used resource not less than a pre-used resource of each of the subtasks to process each of the subtasks based on the pre-used resource required by each of the subtasks comprises:
self-defining the priority of each subtask;
allocating used resources which are not less than the pre-used resources of the subtasks to each subtask from the available resources of the resource pool based on the priority order of the subtasks; and the available resources of the resource pool are updated after the used resources are allocated to one subtask, and the used resources are allocated to the next subtask according to the updated available resources of the resource pool.
3. The method of claim 2, wherein the sub-tasks are processed by:
dividing the subtask into at least one thread for parallel processing;
and distributing consumed resources for each thread from the used resources, and respectively processing each thread in the subtasks by using the consumed resources.
4. The method of claim 3, wherein the real-time monitoring of the status information of the sub-tasks processed by the resources comprises:
counting the consumed resources of each thread for processing the subtasks, and calculating the residual resources in the used resources of the subtasks and the utilization rate of the used resources;
acquiring a time law for processing the subtasks based on the utilization rate of the used resources, wherein the time law comprises the following steps:
determining a time slot in which the utilization rate of the used resources is high as a time slot for processing the subtask, and determining a time slot in which the utilization rate of the used resources is low as a time slot for not processing the subtask;
and determining the time law of the sub tasks based on the time period of processing the sub tasks and the time period of not processing the sub tasks.
5. The method of claim 4, wherein the dispatching the remaining resources of the low-utilization used resources to the high-utilization used resources in advance according to the monitored time law of processing the subtasks by using the used resources comprises:
determining a subtask to be processed according to the monitored time rule for processing the subtask by using the use resource;
determining a subtask to be scheduled with low utilization rate of currently used resources;
and scheduling the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed.
6. The method of claim 5, wherein the to-be-processed subtasks are processed by:
distributing the residual resources of the subtasks to be scheduled to the used resources of the subtasks to be processed to obtain the used resources of the subtasks to be processed;
and reallocating consumed resources for each thread of the subtasks to be processed based on the resources to be used, and processing each thread of the subtasks to be processed by using the reallocating consumed resources.
7. The method of claim 1, wherein determining the priority of the subtasks based on the type of the subtasks comprises one or more of data preparation, model development, model training, model management, and model deployment.
8. An apparatus for machine learning distributed resource scheduling, the apparatus comprising:
the system comprises a determining module, a processing module and a processing module, wherein the determining module is used for determining a plurality of parallel-processing subtasks divided by a machine learning task and a pre-used resource required by each subtask;
the allocation module is used for allocating the use resources which are not less than the pre-used resources of the subtasks from the available resources of the resource pool to process the subtasks based on the pre-used resources needed by the subtasks;
the monitoring module is used for monitoring the state information of the subtasks processed by using the use resources in real time; the state information comprises the utilization rate of each subtask to the used resource thereof and the time law for processing the subtask by using the used resource;
and the scheduling module is used for scheduling the residual resources in the used resources with low utilization rate to the used resources with high utilization rate in advance according to the monitored time rule for processing the subtasks by utilizing the used resources.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method for machine-learned distributed resource scheduling according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the distributed resource scheduling method for machine learning according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210337294.1A CN114661475A (en) | 2022-03-31 | 2022-03-31 | Distributed resource scheduling method and device for machine learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210337294.1A CN114661475A (en) | 2022-03-31 | 2022-03-31 | Distributed resource scheduling method and device for machine learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114661475A true CN114661475A (en) | 2022-06-24 |
Family
ID=82033026
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210337294.1A Pending CN114661475A (en) | 2022-03-31 | 2022-03-31 | Distributed resource scheduling method and device for machine learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661475A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024109312A1 (en) * | 2022-11-22 | 2024-05-30 | 北京地平线信息技术有限公司 | Task scheduling execution method, and generation method and apparatus for task scheduling execution instruction |
-
2022
- 2022-03-31 CN CN202210337294.1A patent/CN114661475A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024109312A1 (en) * | 2022-11-22 | 2024-05-30 | 北京地平线信息技术有限公司 | Task scheduling execution method, and generation method and apparatus for task scheduling execution instruction |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Tumanov et al. | alsched: Algebraic scheduling of mixed workloads in heterogeneous clouds | |
CN111966500A (en) | Resource scheduling method and device, electronic equipment and storage medium | |
US20130219385A1 (en) | Batch scheduler management of virtual machines | |
CN111190712A (en) | Task scheduling method, device, equipment and medium | |
JP2015146154A (en) | Job scheduling apparatus, job scheduling method and job scheduling program | |
JP5718378B2 (en) | System and method used to perform one or more tasks | |
CN113886089B (en) | Task processing method, device, system, equipment and medium | |
CN109840149B (en) | Task scheduling method, device, equipment and storage medium | |
CN110162397B (en) | Resource allocation method, device and system | |
CN115543615A (en) | Resource allocation method and device, electronic equipment and storage medium | |
CN114327894A (en) | Resource allocation method, device, electronic equipment and storage medium | |
CN112905334A (en) | Resource management method, device, electronic equipment and storage medium | |
CN114968567A (en) | Method, apparatus and medium for allocating computing resources of a compute node | |
CN114579284B (en) | Task scheduling method and device | |
CN115951974A (en) | Management method, system, device and medium for GPU virtual machine | |
CN114661475A (en) | Distributed resource scheduling method and device for machine learning | |
Hung et al. | Task scheduling for optimizing recovery time in cloud computing | |
CN118069379A (en) | Scheduling realization method based on GPU resources | |
CN117112222A (en) | Request processing method and device, electronic equipment and storage medium | |
CN112395062A (en) | Task processing method, device, equipment and computer readable storage medium | |
JP2018190355A (en) | Resource management method | |
CN114168294B (en) | Method and device for distributing compiling resources, electronic equipment and storage medium | |
CN114489978A (en) | Resource scheduling method, device, equipment and storage medium | |
CN113515355A (en) | Resource scheduling method, device, server and computer readable storage medium | |
Postoaca et al. | h-Fair: asymptotic scheduling of heavy workloads in heterogeneous data centers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |