Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleAugust 2024
NetLLM: Adapting Large Language Models for Networking
ACM SIGCOMM '24: Proceedings of the ACM SIGCOMM 2024 ConferencePages 661–678https://doi.org/10.1145/3651890.3672268Many networking tasks now employ deep learning (DL) to solve complex prediction and optimization problems. However, current design philosophy of DL-based algorithms entails intensive engineering overhead due to the manual design of deep neural networks (...
- posterJuly 2024
Orchestrating a DNN training job using an iScheduler Framework: a use case
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered ComputingArticle No.: 108, Pages 1–3https://doi.org/10.1145/3626203.3670632Orchestrating DNN training jobs efficiently on HPC centers such as Ohio Supercomputer Center (OSC), Texas Advanced Computing Center (TACC), and San Diego Supercomputer Center (SDSC) is crucial due to the prevalence of AI-driven workloads. However, ...
- short-paperJuly 2024
Reference Implementation of Smart Scheduler: A CI-Aware, AI-Driven Scheduling Framework for HPC Workloads
PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered ComputingArticle No.: 75, Pages 1–4https://doi.org/10.1145/3626203.3670555Many modern scientific workloads in HPC centers rely heavily on AI-driven tasks, particularly deep neural network (DNN) training workloads. Efficiently managing and scheduling these workloads via SLURM interfaces requires users to comprehensively ...
- research-articleJuly 2024
Genetic-based Constraint Programming for Resource Constrained Job Scheduling
GECCO '24: Proceedings of the Genetic and Evolutionary Computation ConferencePages 942–951https://doi.org/10.1145/3638529.3654046Resource constrained job scheduling is a hard combinatorial optimisation problem that originates in the mining industry. Off-the-shelf solvers cannot solve this problem satisfactorily in reasonable time-frames, while other solution methods such as ...
- research-articleAugust 2024
HPC Jobs Classification and Resource Prediction to Minimize Job Failures
CompSysTech '24: Proceedings of the International Conference on Computer Systems and Technologies 2024Pages 95–101https://doi.org/10.1145/3674912.3674914In this work, we focus on HPC job classification and resource prediction. HPC users at Concordia University come from different departments, and not all of them have an IT background. One of the most challenging issues for users is how to properly ...
-
- keynoteAugust 2024
Demistifying HPC-Quantum integration: it's all about scheduling
HPQCI '24: Proceedings of the 2024 Workshop on High Performance and Quantum Computing IntegrationPages 1–3https://doi.org/10.1145/3659996.3673223Recent research on the integration between HPC and quantum computer was mostly focused on the software stack and quantum circuit compilation aspects, neglecting critical issues like HPC resource allocation and job scheduling given the scarcity of QPUs, ...
- research-articleAugust 2024
A Design Framework for the Simulation of Distributed Quantum Computing
HPQCI '24: Proceedings of the 2024 Workshop on High Performance and Quantum Computing IntegrationPages 4–10https://doi.org/10.1145/3659996.3660035The growing demand for large-scale quantum computers is pushing research on Distributed Quantum Computing (DQC). Recent experimental efforts have demonstrated some of the building blocks for such a design. DQC systems are clusters of quantum processing ...
- research-articleMay 2024
FuncMem: Reducing Cold Start Latency in Serverless Computing Through Memory Prediction and Adaptive Task Execution
SAC '24: Proceedings of the 39th ACM/SIGAPP Symposium on Applied ComputingPages 131–138https://doi.org/10.1145/3605098.3636033Because serverless computing can scale automatically and affordably, it has become a popular choice for cloud-based services. However, despite these advantages, a serverless architecture is not suitable for applications requiring instantaneous executions ...
DPS: Adaptive Power Management for Overprovisioned Systems
SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 27, Pages 1–14https://doi.org/10.1145/3581784.3607091Maximizing performance under a power budget is essential for HPC systems and has inspired the development of many power management frameworks. These can be broadly characterized into two groups: model-based and stateless. Model-based frameworks use ...
- research-articleAugust 2023
A Dual-Agent Scheduler for Distributed Deep Learning Jobs on Public Cloud via Reinforcement Learning
KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data MiningPages 2776–2788https://doi.org/10.1145/3580305.3599241Public cloud GPU clusters are becoming emerging platforms for training distributed deep learning jobs. Under this training paradigm, the job scheduler is a crucial component to improve user experiences, i.e., reducing training fees and job completion ...
- research-articleJuly 2023
Orchestration of materials science workflows for heterogeneous resources at large scale
- Naweiluo Zhou,
- Giorgio Scorzelli,
- Jakob Luettgau,
- Rahul R Kancharla,
- Joshua J Kane,
- Robert Wheeler,
- Brendan P Croom,
- Pania Newell,
- Valerio Pascucci,
- Michela Taufer,
- Jack Dongarra,
- Bernard Tourancheau
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 3-4Pages 260–271https://doi.org/10.1177/10943420231167800In the era of big data, materials science workflows need to handle large-scale data distribution, storage, and computation. Any of these areas can become a performance bottleneck. We present a framework for analyzing internal material structures (e.g., ...
- research-articleJanuary 2023
Research on overall energy consumption optimization method for data center based on deep reinforcement learning
Journal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology (JIFS), Volume 44, Issue 5Pages 7333–7349https://doi.org/10.3233/JIFS-223769With the rapid development of cloud computing, there are more and more large-scale data centers, which makes the energy management of data centers more complex. In order to achieve better energy-saving effect, it is necessary to solve the problems of ...
- research-articleDecember 2022
Towards Data Minimization and Access Control for Immunization Data Sharing
MEDES '22: Proceedings of the 14th International Conference on Management of Digital EcoSystemsPages 16–23https://doi.org/10.1145/3508397.3564837The global interest in taking preventative measures has intensified in response to the coronavirus epidemic. The vaccination certificate is one approach used to control the spread of disease while enabling people to carry out their regular lives and ...
- research-articleJuly 2022
SMART: Speedup Job Completion Time by Scheduling Reduce Tasks
- Jia-Qing Dong,
- Ze-Hao He,
- Yuan-Yuan Gong,
- Pei-Wen Yu,
- Chen Tian,
- Wan-Chun Dou,
- Gui-Hai Chen,
- Nai Xia,
- Hao-Ran Guan
Journal of Computer Science and Technology (JCST), Volume 37, Issue 4Pages 763–778https://doi.org/10.1007/s11390-022-2118-5AbstractDistributed computing systems have been widely used as the amount of data grows exponentially in the era of information explosion. Job completion time (JCT) is a major metric for assessing their effectiveness. How to reduce the JCT for these ...
- research-articleJune 2022
A Hyper-Heuristic for the Preemptive Single Machine Scheduling Problem to Minimize the Total Weighted Tardiness
AbstractA problem of minimizing the total weighted tardiness in the preemptive single machine scheduling for discrete manufacturing is considered. A hyper-heuristic is presented, which is composed of 24 various heuristics, to find an approximately optimal ...
- research-articleJanuary 2022
Teaching learning-based optimisation for job scheduling in computational grids
International Journal of Advanced Intelligence Paradigms (IJAIP), Volume 21, Issue 1-2Pages 72–86https://doi.org/10.1504/ijaip.2022.121030Grid computing is a framework that enables the sharing, selection and aggregation of geographically distributed resources dynamically to meet the current and growing computational demands. Job scheduling is a key issue of grid computing and its algorithm ...
- research-articleJanuary 2022
MG-FIM: A Multi-GPU Fast Iterative Method Using Adaptive Domain Decomposition
SIAM Journal on Scientific Computing (SISC), Volume 44, Issue 1Pages C54–C76https://doi.org/10.1137/21M1414644Applying the latest parallel computing technology has become a recent trend in Eikonal equation solvers. Many recent studies have focused on parallelization of Eikonal solvers for multithreaded CPUs or single GPU systems. However, multi-GPU Eikonal solvers ...
- research-articleNovember 2021
Good Things Come to Those Who Wait: Optimizing Job Waiting in the Cloud
SoCC '21: Proceedings of the ACM Symposium on Cloud ComputingPages 229–242https://doi.org/10.1145/3472883.3487007Cloud-enabled schedulers execute jobs on either fixed resources or those acquired on demand from cloud platforms. Thus, these schedulers must define not only a scheduling policy, which selects which jobs run when fixed resources become available, but ...
- posterAugust 2021
Cost-effective data analytics across multiple cloud regions
SIGCOMM '21: Proceedings of the SIGCOMM '21 Poster and Demo SessionsPages 1–3https://doi.org/10.1145/3472716.3472842We propose a cloud-native data analytics engine for processing data stored among geographically distributed cloud regions with reduced cost. A job is split into subtasks and placed across regions based on factors including prices of compute resources and ...
- extended-abstractJuly 2021
An Algorithmic Framework for Approximating Maximin Share Allocation of Chores
EC '21: Proceedings of the 22nd ACM Conference on Economics and ComputationPages 630–631https://doi.org/10.1145/3465456.3467555We consider the problem of fairly dividing m indivisible chores among n agents. The fairness measure we consider here is the maximin share. The previous best known result is that there always exists a 4/3-approximation maximin share allocation[3]. With ...