Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
Understanding Silent Data Corruption in Processors for Mitigating its Effects
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 4Article No.: 84, Pages 1–27https://doi.org/10.1145/3690825Silent Data Corruption (SDC) in processors can lead to various application-level issues, such as incorrect calculations and even data loss. Since traditional techniques are not effective in detecting these errors, it is very hard to address problems ...
- ArticleSeptember 2024
Duplication-Based Fault Tolerance for RISC-V Embedded Software
AbstractEmbedded devices play critical roles in security and safety, demanding robust protection against fault injection attacks. Among the myriad of fault effects, the instruction skip fault model stands out due to its recurrent manifestation in silicon ...
- research-articleSeptember 2024
Achieving Tunable Erasure Coding with Cluster-Aware Redundancy Transitioning
ACM Transactions on Architecture and Code Optimization (TACO), Volume 21, Issue 3Article No.: 59, Pages 1–24https://doi.org/10.1145/3672077Erasure coding has been demonstrated as a storage-efficient means against failures, yet its tunability remains a challenging issue in data centers, which is prone to induce substantial cross-cluster traffic. In this article, we present ClusterRT, a ...
- research-articleSeptember 2024
A Robust and Energy Efficient Hyperdimensional Computing System for Voltage-scaled Circuits
ACM Transactions on Embedded Computing Systems (TECS), Volume 23, Issue 6Article No.: 91, Pages 1–20https://doi.org/10.1145/3620671Voltage scaling is one of the most promising approaches for energy efficiency improvement but also brings challenges to fully guaranteeing stable operation in modern VLSI. To tackle such issues, we further extend the DependableHD to the second version ...
- research-articleJuly 2024
On Cyber-Physical Fault Resilience in Data Communication: A Case From A LoRaWAN Network Systems Design
ACM Transactions on Cyber-Physical Systems (TCPS), Volume 8, Issue 3Article No.: 36, Pages 1–25https://doi.org/10.1145/3639571Systems offering fault-resilient, energy-efficient, soft real-time data communication have wide applications in Industrial Internet-of-Things (IIoT). While there have been extensive studies for fault resilience in real-time embedded systems, ...
-
- research-articleJuly 2024
Experimentation and Implementation of the BFT++ Cyber-Attack Resilience Mechanism for Cyber-Physical Systems
ACM Transactions on Cyber-Physical Systems (TCPS), Volume 8, Issue 3Article No.: 35, Pages 1–25https://doi.org/10.1145/3639570Cyber-physical systems (CPS) are used in various safety-critical domains such as robotics, industrial manufacturing systems, and power systems. Faults and cyber attacks have been shown to cause safety violations, which can damage the system and endanger ...
- research-articleJuly 2024
A Comprehensive Benchmarking Analysis of Fault Recovery in Stream Processing Frameworks
DEBS '24: Proceedings of the 18th ACM International Conference on Distributed and Event-based SystemsPages 171–182https://doi.org/10.1145/3629104.3666040Nowadays, several software systems rely on stream processing architectures to deliver scalable performance and handle large volumes of data in near real-time. Stream processing frameworks facilitate scalable computing by distributing the application's ...
- research-articleNovember 2024
HTAG-eNN: Hardening Technique with AND Gates for Embedded Neural Networks
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 270, Pages 1–6https://doi.org/10.1145/3649329.3657329Embedded Neural Networks (NNs) face significant challenges due to Single-Event Upsets (SEUs), compromising their reliability. To address this challenge, previous works study SEU layers sensitivity of AI models. Contrary to these techniques, remaining at ...
- research-articleNovember 2024
MENDNet: Just-in-time Fault Detection and Mitigation in AI Systems with Uncertainty Quantification and Multi-Exit Networks
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 247, Pages 1–6https://doi.org/10.1145/3649329.3656506Hardware faults in AI accelerators, particularly in accelerator memory, can alter pre-trained deep neural network parameters, leading to errors that compromise performance. To address this, just-intime (JIT) fault detection and mitigation are crucial. ...
- research-articleJune 2024
Snapshotting Mechanisms for Persistent Memory-Mapped Files
ApPLIED'24: Proceedings of the 2024 Workshop on Advanced Tools, Programming Languages, and PLatforms for Implementing and Evaluating algorithms for Distributed systemsPages 1–9https://doi.org/10.1145/3663338.3665832This study focuses on methods to improve the reliability of persistent memory systems. By utilizing Montage (ICPP'21), we identify areas that have potential for enhancement, particularly in relation to the vulnerability of data loss in certain failure ...
- research-articleJune 2024
The Fractional Spending Problem: Executing Payment transactions in parallel with less than f+1 validations
PODC '24: Proceedings of the 43rd ACM Symposium on Principles of Distributed ComputingPages 295–305https://doi.org/10.1145/3662158.3662817We consider the problem of supporting payment transactions in an asynchronous system in which up to f validators are subject to Byzantine failures under the control of an adaptive adversary. It was shown that, in the case of a single owner, this problem ...
- research-articleJune 2024
Fault-Tolerant Parallel Integer Multiplication
SPAA '24: Proceedings of the 36th ACM Symposium on Parallelism in Algorithms and ArchitecturesPages 207–218https://doi.org/10.1145/3626183.3659961Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significant overhead costs. Long integer ...
- research-articleMay 2024
Fault Tolerance Placement in the Internet of Things
Proceedings of the ACM on Management of Data (PACMMOD), Volume 2, Issue 3Article No.: 138, Pages 1–29https://doi.org/10.1145/3654941Today's IoT applications exploit the capabilities of three different computation environments: sensors, edge, and cloud. Ensuring fault tolerance at the edge level presents unique challenges due to complex network hierarchies and the presence of resource-...
- research-articleMay 2024
Increasing the Fault Tolerance in Microservice Architecture
Cybernetics and Systems Analysis (KLU-CASA), Volume 60, Issue 3Pages 480–488https://doi.org/10.1007/s10559-024-00689-0AbstractMicroservice architecture is widely used in the development of distributed applications. Many problems present at the beginning of using this approach have already been solved. However, one of the fundamental problems that seriously affects the ...
- research-articleAugust 2024
A Non-centralized Federated Learning Architecture to Obtain Accurate Privacy Preserving Results
ICCAI '24: Proceedings of the 2024 10th International Conference on Computing and Artificial IntelligencePages 393–398https://doi.org/10.1145/3669754.3669814Federated learning was introduced in 2016 by Google researchers to train machine learning models to preserve user privacy. Basically, applying the federated learning concept, the companies’ servers receive partially trained models instead of data from ...
- research-articleApril 2024
Energy Management for Fault-tolerant (m,k)-constrained Real-time Systems That Use Standby-Sparing
ACM Transactions on Embedded Computing Systems (TECS), Volume 23, Issue 3Article No.: 36, Pages 1–36https://doi.org/10.1145/3648365Fault tolerance, energy management, and quality of service (QoS) are essential aspects for the design of real-time embedded systems. In this work, we focus on exploring methods that can simultaneously address the above three critical issues under standby-...
- research-articleJune 2024
Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning
SEAMS '24: Proceedings of the 19th International Symposium on Software Engineering for Adaptive and Self-Managing SystemsPages 110–121https://doi.org/10.1145/3643915.3644093Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging, because of the ...
- research-articleFebruary 2024
Energy-Constrained Scheduling for Weakly Hard Real-Time Systems Using Standby-Sparing
ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 29, Issue 2Article No.: 29, Pages 1–35https://doi.org/10.1145/3631587For real-time embedded systems, QoS (Quality of Service), fault tolerance, and energy budget constraint are among the primary design concerns. In this research, we investigate the problem of energy constrained standby-sparing for both periodic and ...
- research-articleApril 2024
BNN-Flip: Enhancing the Fault Tolerance and Security of Compute-in-Memory Enabled Binary Neural Network Accelerators
ASPDAC '24: Proceedings of the 29th Asia and South Pacific Design Automation ConferencePages 146–152https://doi.org/10.1109/ASP-DAC58780.2024.10473947Compute-in-memory based binary neural networks or CiM-BNNs offer high energy/area efficiency for the design of edge deep neural network (DNN) accelerators, with only a mild accuracy reduction. However, for successful deployment, the design of CiM-BNNs ...
- surveyJanuary 2024
Reaching Consensus in the Byzantine Empire: A Comprehensive Review of BFT Consensus Algorithms
- Gengrui Zhang,
- Fei Pan,
- Yunhao Mao,
- Sofia Tijanic,
- Michael Dang’ana,
- Shashank Motepalli,
- Shiquan Zhang,
- Hans-Arno Jacobsen
ACM Computing Surveys (CSUR), Volume 56, Issue 5Article No.: 134, Pages 1–41https://doi.org/10.1145/3636553Byzantine fault-tolerant (BFT) consensus algorithms are at the core of providing safety and liveness guarantees for distributed systems that must operate in the presence of arbitrary failures. Recently, numerous new BFT algorithms have been proposed, not ...