...
Dell PERC 9 controllers (H330, H730, H730P, and H830) introduced a feature called Rapid Rebuild that speeds up the time to rebuild failed drives in certain conditions. This feature is based on T10 Rebuild Assist. Dell has determined that there is a possibility for data integrity issues when this feature is used under certain conditions. Table of content Feature OperationProblem StatementHow can I tell if this has happenedSolution Feature Operation: Any drive that is capable of Rapid Rebuild registers this capability with the controller. This feature is supported with parity raid virtual disks: RAID 5, RAID 6, RAID 50, and RAID 60The feature requires a server to have capable drives, parity based RAID levels, and a configured hot spare (either global or dedicated to the exact VD). Each capable drive in the VD tracks its own failed blocks/sectors. A drive may then fail in such a way that it can still communicate with the PERC, and tell the PERC which sectors are still "good." Instead of performing time-consuming RAID recovery XOR algorithms for the entire disk, the PERC copies the good sectors to the hot spare, and only have to recover the known bad sectors. The PERC copies the good sectors to the hot spare, and only have to rebuild those known bad sectors. Without Rapid Rebuild, the PERC has to rebuild all sectors which can be time-consuming for large capacity drives. Problem Statement When the PERC is rebuilding the data for the "bad" sectors, it incorrectly writes data from the cache to the failed drive instead of the hot spare. This results in data and associated parity not being written to the hot spare and in write-through mode, parity errors occur. In write-back mode, errors occur in both data and associated parity. How can I tell if this has happened? Note: How to extract the PERC Controller log is explained in the article 000126308. From the PERC Controller log if you see the below highlighted text that you have encountered the issue. C0:EVT#395950-08/17/16 13:54:59: 114=State change on PD 0b(e0x20/s11) from OFFLINE(XX) to REBUILDASSIST(12)
Solution If your VD was in Write-Through mode, only parity data is at risk, and running a CC (consistency check) restores your parity. This only works if this is the first occurrence of a rebuild requirement. If more than one rebuild for the same VD occurred, restore your data from a backup. If your VD was in Write-Back mode and you have encountered the issue then you should restore your data from backup. Unfortunately, there is no way to recover the lost data. Note: If you have not encountered this issue, update your PERC H730, H730p, H830 controller firmware to 25.5.0.0018 and PERC H330 controller firmware to 25.5.0.0019 or later firmware which disables the Rapid Rebuild feature. To download the latest firmware version, go to "Drivers & Downloads" of a 13G server and expand the "SAS RAID" menu. The correct firmware has been implemented in the factory and new servers are not exposed to this issue. Note: As part of ongoing business process improvement across all key functions, Dell continually reviews key processes and implements improvements. Dell places a high focus on the development, test, and manufacturing processes for our server and storage systems. These process improvements will help prevent future problems and are allowing Dell to react more rapidly and more aggressively to potential issues in the field.