Abstract
Due to the fact that the resource is prone to be wrong during tasks execution in cloud, which leads to failed tasks, in view of the recent research, the Primary-Backup model (PB model) is mostly used to deal with fault-tolerant tasks, but the selection of passive scheme and active scheme is assumed in advance, and the advantages between the two schemes are not fully utilized. Based on the deep reinforcement learning, this paper proposes an adaptive PB model selection algorithm, Active-Passive Scheme DQN (APSDQN). The process of faulty task tolerance is regarded as a Markov decision process, taking the passive scheme and active scheme as the action spaces, the shortest completion time of the task and the highest resource utilization as the reward feedback, combine with the real environment state information, select the most suitable fault-tolerant scheme for faulty tasks to save resources and improve the robustness of cloud system. The experimental results show that APSDQN has certain advantages in the total task finish time of task allocation, and significantly improves the resource utilization and the task success rate in the cloud.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Dähling, S., Razik, L., Monti, A.: Enabling scalable and fault-tolerant multi-agent systems by utilizing cloud-native computing. Auton. Agent. Multi-Agent Syst. 35(1), 1–27 (2021)
Ahmad, Z., Nazir, B., Umer, A.: A fault-tolerant workflow management system with quality-of-service-aware scheduling for scientific workflows in cloud computing. Int. J. Commun. Syst. 34(1), 66–78 (2021)
Yao, G., Ding, Y., Ren, L., et al.: An immune system-inspired rescheduling algorithm for workflow in Cloud systems. Knowl. Based Syst. 99, 39–50 (2016)
Yan, H., Zhu, X., Chen, H., et al.: DEFT: dynamic fault-tolerant elastic scheduling for tasks with uncertain runtime in cloud. Inf. Sci. 477, 30–46 (2019)
Liu, J., Wei, M., Hu, W., et al.: Task scheduling with fault-tolerance in real-time heterogeneous systems. J. Syst. Archit. 90, 23–33 (2018)
Ansari, M., Salehi, M., Safari, S., et al.: Peak-power-aware primary-backup technique for efficient fault-tolerance in multicore embedded systems. IEEE Access 8, 142843–142857 (2020)
Cuccu, G., Togelius, J., Cudré-Mauroux, P.: Playing Atari with few neurons. Auton. Agent. Multi-Agent Syst. 35(2), 1–23 (2021)
Li, Z., Zhu, C., Gao, Y., et al.: AlphaGo policy network: a DCNN accelerator on FPGA. IEEE Access 8, 203039–203047 (2020)
Arulkumaran, K., Cully, A., Togelius, Y.: AlphaStar: an evolutionary computation perspective. GECCO (Companion) 314–315 (2019)
Mnih, V., Kavukcuoglu, K., Silver, D., et al.: Playing Atari with deep reinforcement learning (2013). https://arxiv.org/abs/1312.5602
Husamelddin, A.M.B., Sheng, C., Jing, W.: Reliability-aware: task scheduling in cloud computing using multi-agent reinforcement learning algorithm and neural fitted Q. Int. Arab J. Inf. Technol. 18(1), 36–47 (2021)
Setlur, A., Nirmala, S., Singh, H., et al.: An efficient fault tolerant workflow scheduling approach using replication heuristics and checkpointing in the cloud. J. Parallel Distrib. Comput. 136, 14–28 (2020)
Xie, G., Zeng, G., Li, R., et al.: Quantitative fault-tolerance for reliable workflows on heterogeneous IaaS clouds. IEEE Trans. Cloud Comput. 8(4), 1223–1236 (2020)
Jing, W., Liu, Y.: Multiple DAGs reliability model and fault-tolerant scheduling algorithm in cloud computing system. Comput. Model. New Techol. 18(8), 22–30 (2014)
Wang, J., Bao, W., Zhu, X., et al.: FESTAL: fault-tolerant elastic scheduling algorithm for real-time tasks in virtualized clouds. IEEE Trans. Comput. 64(9), 2545–2558 (2015)
Ding, Y., Yao, G., Hao, K.: Fault-tolerant elastic scheduling algorithm for workflow in cloud systems. Inf. Sci. 393, 47–65 (2017)
Zhou, J., Cong, P., Sun, J., et al.: Throughput maximization for multicore energy-harvesting systems suffering both transient and permanent faults. IEEE Access 7, 98462–98473 (2019)
Manimaran, G., Murthy, C.S.R.: A fault-tolerant dynamic scheduling algorithm for multiprocessor real-time systems and its analysis. IEEE Trans. Parallel Distrib. Syst. 9(11), 1137–1152 (1998)
Moon, J., Jeong, J.: Smart manufacturing scheduling system: DQN based on cooperative edge computing. IMCOM 1–8 (2021)
Wu, Y., Dinh, T., Fu, Y., et al.: A hybrid DQN and optimization approach for strategy and resource allocation in MEC networks. IEEE Trans. Wirel. Commun. 20(7), 4282–4295 (2021)
Lu, H.: Edge QoE: computation offloading with deep reinforcement learning for internet of things. IEEE Internet Things J. 7(10), 9255–9265 (2020)
Shashank, S., Elhadi, M.S., Ansar, Y.: Task scheduling in cloud using deep reinforcement learning. Proc. Comput. Sci. 184, 42–51 (2021)
Wei, C., Rafael, F., Ewa, D., et al.: Dynamic and fault-tolerant clustering for scientific workflows. IEEE Trans. Cloud Comput. 4(1), 49–62 (2016)
Soniya, J., Sujana, J., Revathi, T.: Dynamic fault tolerant scheduling mechanism for real time tasks in cloud computing. ICEEOT 124–129 (2016)
Ismael, S., Garraghan, P., Townend, P., et al.: An approach for characterizing workloads in google cloud to derive realistic resource utilization models. SOSE 49–60 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tang, H., Tang, Z., Dong, T., Hai, Q., Xue, F. (2022). Fault-Tolerant Scheme of Cloud Task Allocation Based on Deep Reinforcement Learning. In: Pan, L., Cui, Z., Cai, J., Li, L. (eds) Bio-Inspired Computing: Theories and Applications. BIC-TA 2021. Communications in Computer and Information Science, vol 1566. Springer, Singapore. https://doi.org/10.1007/978-981-19-1253-5_5
Download citation
DOI: https://doi.org/10.1007/978-981-19-1253-5_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1252-8
Online ISBN: 978-981-19-1253-5
eBook Packages: Computer ScienceComputer Science (R0)