• Rojas E, Pérez D and Meneses E. (2024). A characterization of soft-error sensitivity in data-parallel and model-parallel distributed deep learning. Journal of Parallel and Distributed Computing. 190:C. Online publication date: 1-Aug-2024.

    https://doi.org/10.1016/j.jpdc.2024.104879

  • Wu P, Le T, Zhu Z and Zhang Z. Redundant Array of Independent Memory Devices. IEEE Computer Architecture Letters. 10.1109/LCA.2023.3334989. 22:2. (181-184).

    https://ieeexplore.ieee.org/document/10323513/

  • Zhang G, Liu Y, Yang H and Qian D. (2021). Efficient detection of silent data corruption in HPC applications with synchronization-free message verification. The Journal of Supercomputing. 10.1007/s11227-021-03892-4.

    https://link.springer.com/10.1007/s11227-021-03892-4

  • Yavits L, Orosa L, Mahar S, Ferreira J, Erez M, Ginosar R and Mutlu O. (2020). WoLFRaM: Enhancing Wear-Leveling and Fault Tolerance in Resistive Memories using Programmable Address Decoders 2020 IEEE 38th International Conference on Computer Design (ICCD). 10.1109/ICCD50377.2020.00044. 978-1-7281-9710-4. (187-196).

    https://ieeexplore.ieee.org/document/9283556/

  • Ping L, Tan J and Yan K. SERN: Modeling and Analyzing the Soft Error Reliability of Convolutional Neural Networks. Proceedings of the 2020 on Great Lakes Symposium on VLSI. (445-450).

    https://doi.org/10.1145/3386263.3406938

  • Rojas E, Meneses E, Jones T and Maxwell D. Towards a Model to Estimate the Reliability of Large-Scale Hybrid Supercomputers. Euro-Par 2020: Parallel Processing. (37-51).

    https://doi.org/10.1007/978-3-030-57675-2_3

  • Balasubramonian R. (2019). Innovations in the Memory System. Synthesis Lectures on Computer Architecture. 10.2200/S00933ED1V01Y201906CAC048. 14:2. (1-151). Online publication date: 10-Sep-2019.

    https://www.morganclaypool.com/doi/10.2200/S00933ED1V01Y201906CAC048

  • Fang B, Halawa H, Pattabiraman K, Ripeanu M and Krishnamoorthy S. BonVoision. Proceedings of the ACM International Conference on Supercomputing. (484-496).

    https://doi.org/10.1145/3330345.3330388

  • Nagarajan C, Shafiee A, Balasubramonian R and Tiwari M. ρ. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. (659-671).

    https://doi.org/10.1145/3297858.3304045

  • Malek A, Vasilakis E, Papaefstathiou V, Trancoso P and Sourdis I. Odd-ECC. Proceedings of the International Symposium on Memory Systems. (96-111).

    https://doi.org/10.1145/3132402.3132443

  • Cancès E and Dusson G. (2017). Discretization error cancellation in electronic structure calculation: toward a quantitative study. ESAIM: Mathematical Modelling and Numerical Analysis. 10.1051/m2an/2017035. 51:5. (1617-1636). Online publication date: 1-Sep-2017.

    http://www.esaim-m2an.org/10.1051/m2an/2017035

  • Gottscho M, Shoaib M, Govindan S, Sharma B, Wang D and Gupta P. Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters. 10.1109/LCA.2016.2599513. 16:1. (51-55).

    http://ieeexplore.ieee.org/document/7542148/

  • Chen H, Jeloka S, Arunkumar A, Blaauw D, Wu C, Mudge T and Chakrabarti C. (2016). Using Low Cost Erasure and Error Correction Schemes to Improve Reliability of Commodity DRAM Systems. IEEE Transactions on Computers. 65:12. (3766-3779). Online publication date: 1-Dec-2016.

    https://doi.org/10.1109/TC.2016.2550455

  • Chen S, Irving S and Peng L. Operational Cost Optimization for Cloud Computing Data Centers Using Renewable Energy. IEEE Systems Journal. 10.1109/JSYST.2015.2462714. 10:4. (1447-1458).

    http://ieeexplore.ieee.org/document/7210179/

  • Levy S, Ferreira K and Bridges P. Improving application resilience to memory errors with lightweight compression. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

    /doi/10.5555/3014904.3014942

  • Levy S, Ferreira K and Bridges P. (2016). Improving Application Resilience to Memory Errors with Lightweight Compression SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. 10.1109/SC.2016.27. 978-1-4673-8815-3. (323-334).

    http://ieeexplore.ieee.org/document/7877106/

  • Nair P, Sridharan V and Qureshi M. (2016). XED. ACM SIGARCH Computer Architecture News. 44:3. (341-353). Online publication date: 12-Oct-2016.

    https://doi.org/10.1145/3007787.3001174

  • Deb A, Faraboschi P, Shafiee A, Muralimanohar N, Balasubramonian R and Schreiber R. (2016). Enabling technologies for memory compression: Metadata, mapping, and prediction 2016 IEEE 34th International Conference on Computer Design (ICCD). 10.1109/ICCD.2016.7753256. 978-1-5090-5142-7. (17-24).

    http://ieeexplore.ieee.org/document/7753256/

  • Nair P, Sridharan V and Qureshi M. XED. Proceedings of the 43rd International Symposium on Computer Architecture. (341-353).

    https://doi.org/10.1109/ISCA.2016.38

  • Subasi O, Unsal O, Labarta J, Yalcin G and Cristal A. (2016). CRC-Based Memory Reliability for Task-Parallel HPC Applications 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS.2016.70. 978-1-5090-2140-6. (1101-1112).

    http://ieeexplore.ieee.org/document/7516107/

  • Jian X, Sridharan V and Kumar R. (2016). Parity Helix: Efficient protection for single-dimensional faults in multi-dimensional memory systems 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). 10.1109/HPCA.2016.7446094. 978-1-4673-9211-2. (555-567).

    http://ieeexplore.ieee.org/document/7446094/

  • Nair P, Roberts D and Qureshi M. (2016). Citadel. ACM Transactions on Architecture and Code Optimization. 12:4. (1-24). Online publication date: 7-Jan-2016.

    https://doi.org/10.1145/2840807

  • Nair P, Roberts D and Qureshi M. (2015). FaultSim. ACM Transactions on Architecture and Code Optimization. 12:4. (1-24). Online publication date: 7-Jan-2016.

    https://doi.org/10.1145/2831234

  • Palframan D, Kim N and Lipasti M. (2015). COP. ACM SIGARCH Computer Architecture News. 43:3S. (682-693). Online publication date: 4-Jan-2016.

    https://doi.org/10.1145/2872887.2750377

  • Nikolaou P, Sazeides Y, Ndreu L and Kleanthous M. Modeling the implications of DRAM failures and protection techniques on datacenter TCO. Proceedings of the 48th International Symposium on Microarchitecture. (572-584).

    https://doi.org/10.1145/2830772.2830804

  • Chen H, Arunkumar A, Wu C, Mudge T and Chakrabarti C. E-ECC. Proceedings of the 2015 International Symposium on Memory Systems. (60-70).

    https://doi.org/10.1145/2818950.2818961

  • Palframan D, Kim N and Lipasti M. COP. Proceedings of the 42nd Annual International Symposium on Computer Architecture. (682-693).

    https://doi.org/10.1145/2749469.2750377

  • Nair P, Roberts D and Qureshi M. Citadel. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. (51-62).

    https://doi.org/10.1109/MICRO.2014.57

  • Yu L, Li D, Mittal S and Vetter J. Quantitatively modeling application resilience with the data vulnerability factor. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (695-706).

    https://doi.org/10.1109/SC.2014.62

  • Hijaz F and Khan O. (2014). NUCA-L1. ACM Transactions on Architecture and Code Optimization. 11:3. (1-28). Online publication date: 27-Oct-2014.

    https://doi.org/10.1145/2631918

  • Chen L, Chen M, Ruan Y, Huang Y, Cui Z, Lu T and Bao Y. (2014). MIMS: Towards a Message Interface Based Memory System. Journal of Computer Science and Technology. 10.1007/s11390-014-1428-7. 29:2. (255-272). Online publication date: 1-Mar-2014.

    http://link.springer.com/10.1007/s11390-014-1428-7

  • Sazeides Y, Özer E, Kershaw D, Nikolaou P, Kleanthous M and Abella J. Implicit-storing and redundant-encoding-of-attribute information in error-correction-codes. Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. (160-171).

    https://doi.org/10.1145/2540708.2540723

  • Li D, Chen Z, Wu P and Vetter J. Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

    https://doi.org/10.1145/2503210.2503226

  • Giridhar B, Cieslak M, Duggal D, Dreslinski R, Chen H, Patti R, Hold B, Chakrabarti C, Mudge T and Blaauw D. Exploring DRAM organizations for energy-efficient and resilient exascale memories. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

    https://doi.org/10.1145/2503210.2503215

  • Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K and Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

    /doi/10.5555/2388996.2389102

  • Li S, Yoon D, Chen K, Zhao J, Ahn J, Brockman J, Xie Y and Jouppi N. MAGE. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-11).

    /doi/10.5555/2388996.2389041

  • Udipi A, Muralimanohar N, Balsubramonian R, Davis A and Jouppi N. (2012). LOT-ECC. ACM SIGARCH Computer Architecture News. 40:3. (285-296). Online publication date: 5-Sep-2012.

    https://doi.org/10.1145/2366231.2337192

  • Ferreira K, Pedretti K, Brightwell R, Bridges P, Fiala D and Mueller F. Evaluating operating system vulnerability to memory errors. Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers. (1-8).

    https://doi.org/10.1145/2318916.2318930

  • Udipi A, Muralimanohar N, Balsubramonian R, Davis A and Jouppi N. LOT-ECC. Proceedings of the 39th Annual International Symposium on Computer Architecture. (285-296).

    /doi/10.5555/2337159.2337192