An online fault-detection method for ReRAM-based computing systems that improves the energy-efficiency of artificial intelligence and machine-learning hardware
With the current wave of technological advancements in neural networks and other related applications, there has been a need for innovative hardware solutions for matrix operations. Resistive random-access memory (ReRAM or RRAM) are designed to efficiently accelerate matrix-vector computing for machine learning hardware. However, ReRAM-based computing systems are vulnerable to faults due to the immature fabrication process. Existing fault detection methods are time consuming and not suitable for on-line fault detection. More efficient methods for on-line fault detection and error correction in ReRAM-based computing systems are needed.
Researchers at Duke have invented an efficient method for detecting faults in Resistive RAM-based computing hardware. This technology is intended to provide a more energy-efficient process for artificial intelligence and machine-learning hardware. The method can identify faulty ReRAM crossbars by monitoring their dynamic power consumption. Specifically, the inventors compute statistical features before and after the changepoint and train a predictive model using machine-learning techniques in order to estimate the percentage of faulty cells in a faulty ReRAM crossbar. In this way, the computationally expensive fault localization and error-recovery steps are carried out only when a high fault rate is estimated. The online fault detection also reduces the overhead required to detect and correct for faults by simultaneously carrying out regular computation and changepoint-based fault detection. In doing so, it avoids unnecessary checking for faults and improves overall computational efficiency of the system. Three neural network architectures on two datasets have been used to demonstrate the effectiveness of the online fault-detection method.
- Greatly reduces time overhead and improves overall computational efficiency by reducing unnecessary interruption
- Provides more than 94% effective fault coverage
- Detects faults before they significantly affect accuracy
- Assessment of this technology demonstrates that the time is significantly reduced while high classification accuracy for well-known AI/ML datasets using RCS is ensured