Abstract
Within mission-critical systems, the primary–backup scheme is a desirable approach for improving reliability and fault tolerance. It can be used to ensure a high mission success rate despite unexpected errors. However, it must cope with the need to maintain consistency between a primary and a backup whenever the primary encounters unexpected errors. We overcome this issue by introducing a platform that uses container-based light virtualization and an automatic build system to isolate an application so that it may then be deployed on different devices without manual intervention. We believe an advanced deployment procedure can retain the consistency of the primary–backup systems with low implementation complexity. Integrated with a cloud application, it can also manage mission-critical systems effectively, communicate with the redundant systems, and detect unexpected errors by using sophisticated fault-detection technologies. We demonstrate that the platform can improve the reliability of mission-critical systems through realistic experiment using a model electronic vehicle and can reduce hardware dependencies.
Similar content being viewed by others
References
Zhang, Y., Chamseddine, A., Rabbath, C., Gordon, B., Su, C.-Y., Rakheja, S., et al. (2013). Development of advanced FDD and FTC techniques with application to an unmanned quadrotor helicopter testbed. Journal of the Franklin Institute, 350(9), 2396–2422.
Saied, M., Lussier, B., Fantoni, I., Francis, C., & Shraim, H. (2015). Fault tolerant control for multiple successive failures in an octorotor: Architecture and experiments. In Proceedidings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS`15), (pp. 40–45).
Park, J., Lee, S., Yoon, T., & Kim, J. (2015). An autonomic control system for high-reliable CPS. Cluster Computing, 18(2), 587–598.
Asikin, D., & Dolan, J. M. (2010). Reliability impact on planetary robotic missions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS`10), (pp. 4095–4100).
Freddi, A., Longhi, S., & Monteriu, A. (2012). A diagnostic Thau observer for a class of unmanned vehicles. Journal of Intelligent and Robotic Systems, 67(1), 61–73.
Fault-detection, Fault-isolation and recovery (FDIR) techniques. Johnson Space Center (NASA), Tech. DFE-7, (1994).
Soltesz, S., P¨otzl, H., Fiuczynski, M. E., Bavier, A., & Peterson, L. (2007). Container-based operating system virtualization: A scalable, high-performance alternative to hypervisors. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007 (EuroSys`07), (pp. 275–287).
Kyriazis, D., Anagnostopoulos, V., Arcangeli, A., Gilbert, D., Kalogeras, D., Kat, R., Klein, C., Kokkinos, P., Kuperman, Y., Nider, J., Svärd, P., Tomas, L., Varvarigos, E., & Varvarigou, T. (2015). High performance fault-tolerance for clouds. In Proceedings of the IEEE Symposium on Computers and Communication (ISCC`15), (pp. 251–257).
Wang, J., Zhu, X., & Bao, W. (2013). Real-time fault-tolerant scheduling based on primary–backup approach in virtualized clouds. In Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications and IEEE International Conference on Embedded and Ubiquitous Computing (HPCC EUC`13), (pp. 1127–1134).
Jiang, G., Chen, H., Yoshihira, K., & Saxena, A. (2011). Ranking the importance of alerts for problem determination in large computer systems. Cluster Computing, 14(3), 213–227.
Merkel, D. (2014). Docker: Lightweight Linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
Jia, W., & Zhou, W. (2006). Distributed network systems: From concepts to implementations, ser. network theory and applications. New York: Springer.
Zheng, W., Xu, P., Huang, X., & Wu, N. (2010). Design a cloud storage platform for pervasive computing environments. Cluster Computing, 13(2), 141–151.
Zheng, Q., Veeravalli, B., & Tham, C. K. (2009). On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Transactions on Computers, 58(3), 380–393.
Luo, W., Qin, X., Tan, X. C., Qin, K., & Manzanares, A. (2009). Exploiting redundancies to enhance schedulability in fault-tolerant and real-time distributed systems. IEEE Transactions on Systems, Man, and Cybernetics—Part A: Systems and Humans, 39(3), 626–639.
Ko, W., Yoo, J., Kang, I., Jun, J., & Lim, S. S. (2016). Lightweight, predictable hypervisor for ARM-Based embedded systems. In Proceedings of the IEEE 22nd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA`16), (p. 109).
Li, N., Kinebuchi, Y., Mitake, H., Shimada, H., Lin, T., & Nakajima, T. (2012). A light-weighted virtualization layer for multicore processor-based rich functional embedded systems. In Proceedings of the IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (ISORC`12), (pp. 144–153).
Yoo, J. (2016). The design and implementation of fault tolerant PSTR on the embedded virtualization system. In Proceedings of the World Congress on Engineering and Computer Science (WCECS`16), (pp. 145–149).
Checconi, F., Cucinotta, T., & Stein, M. (2010). Real-time issues in live migration of virtual machines. In Proceedings of the International Conference on Parallel Processing (Euro-Par`09), (pp. 454–466).
Kim, D., Machida, F., & Trivedi, K. (2009). Availability modeling and analysis of a virtualized system. In Proceedings of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC`09), (pp. 365–371).
Groesbrink, S. (2014). Virtual machine migration as a fault tolerance technique for embedded real-time systems. In Proceedings of the IEEE International Conference on Software Security and Reliability-Companion (SERE-C`14), (pp. 7–12).
Dhouib, S., Kchir, S., Stinckwich, S., Ziadi, T., & Ziane, M. (2012). RobotML, a domain-specific language to design, simulate and deploy robotic applications. In Proceedings of the Third International Conference on Simulation, Modeling, and Programming for Autonomous Robots (SIMPAR`12), (pp. 149–160).
Dhillon, B. (2012). Robot reliability and safety. New York: Springer.
Hammadi, M., Choley, M., Ben Said, A., Kellner, A., & Hehenberger, P. (2016). Systems engineering analysis approach based on interoperability for reconfigurable manufacturing systems. In Proceedings of the IEEE International Symposium on Systems Engineering (ISSE`16), (pp. 1–6).
Zhu, X., Wang, J., Guo, H., Zhu, D., Yang, L. T., & Liu, L. (2016). Fault-tolerant scheduling for real-time scientific workflows with elastic resource provisioning in virtualized clouds. IEEE Transactions on Parallel and Distributed Systems, 27(12), 3501–3517.
Stanclif, S., Dolan, J., & Trebi-Ollennu, A. (2009). Planning to fail—reliability as a design parameter for planetary rover missions. In Proceedings of the Carnegie Mellon University Research Showcase Robotics Institute, (pp. 2–6).
Sommerville, I. (2010). Software engineering (9th ed.). Boston: Addison Wesley.
Bassil, Y. (2012). A simulation model for the waterfall software development life cycle. International Journal of Engineering and Technology, 2(5), 742–749.
Stellman, A., & Greene, J. (2005). Applied software project management. Sebastopol: O’Reilly Media.
Cappos, J., Baker, S., Plichta, J., Nyugen, D., Hardies, J., Borgard, M., Johnston, J., & Hartman, J. H. (2007). Stork: Package management for distributed VM environments. In Proceedings of the 21st Conference on Large Installation System Administration Conference (LISA`07), (pp. 7:1–7:16).
Tucker, C., Shuffelton, D., Jhala, R., & Lerner, S. (2007). OPIUM: Optimal package install/uninstall manager. In Proceedings of the 29th International Conference on Software Engineering (ICSE`07), (pp. 178–188).
Gerkey, B., & Conley, K. (2011). Robot developer kits. IEEE Robotics and Automation Magazine, 18(3), 16.
Smith, J. E., & Nair, R. (2005). The architecture of virtual machines. Computer, 38(5), 32–38.
Youseff, L., Seymour, K., You, H., Zagorodnov, D., Dongarra, J., & Wolski, R. (2009). Paravirtualization effect on single- and multi-threaded memory-intensive linear algebra software. Cluster Computing, 12(2), 101–122.
Bernstein, D. (2014). Containers and cloud: From LXC to Docker to Kubernetes. IEEE Cloud Computing, 1(3), 81–84.
Felter, W., Ferreira, A., Rajamony, R., & Rubio, J. (2014). An updated performance comparison of virtual machines and linux containers. IBM Research Division Austin Research Laboratory, RC25482 (AUS1407-001).
Higginbotham, S. (2015). Why Facebook’s parse news is a big deal for the internet of things. (Online). http://fortune.com/2015/03/25/facebook-parse-internet-of-things/.
Bahl, P., Han, R. Y., Li, L. E., & Satyanarayanan, M. (2012). Advancing the state of mobile cloud computing. In Proceedings of 3rd ACM Workshop on Mobile Cloud Computing and Services (MCS`12), New York, NY, USA: ACM, (pp. 21–28).
Acknowledgements
This work was supported by the research fund of Hanyang University (HY-2014-N). W.-J. Lee is the co-corresponding author of this paper.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lee, J., Jeong, H., Lee, WJ. et al. Advanced Primary–Backup Platform with Container-Based Automatic Deployment for Fault-Tolerant Systems. Wireless Pers Commun 98, 3177–3194 (2018). https://doi.org/10.1007/s11277-017-4282-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-017-4282-4