Loading...
Loading...
Initial device failures on SDS 0000000000000009 (sds-node-002) - Jun 11, 2025 10:03 68516 2025-06-11 10:03:22.486 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000001. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68517 2025-06-11 10:03:22.555 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000002. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68518 2025-06-11 10:03:22.617 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000003. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68519 2025-06-11 10:03:22.668 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000004. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68520 2025-06-11 10:03:22.688 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000005. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68521 2025-06-11 10:03:22.734 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000006. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68522 2025-06-11 10:03:22.804 SDS_DEV_ERROR_REPORT ERROR Device error reported on SDS: sds-node-002, Device: /dev/disk/by-id/scsi-300000000000000000000000000000007. State: NORMAL upDownState: UP processState: DEV_ERR_INPROGRESS 68523 2025-06-11 10:03:23.724 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 68534 2025-06-11 10:03:27.960 SDS_DECOUPLED ERROR SDS: sds-node-002 (id: 0000000000000009) decoupled. SDS reboot, device errors cleared - Jun 11, 2025 19:50-20:01 68909 2025-06-11 19:50:41.641 SDS_RECONNECTED INFO SDS: sds-node-002 (ID 0000000000000009) reconnected. 68940 2025-06-11 20:00:25.231 MDM_CLI_CONF_COMMAND_RECEIVED INFO Command clear_sds_device_error received, User: ': N/A'. [4346116] SDS: Name: sds-node-002. Clear on all devices. Capacity on device 000000000000000c already critical during reclaim - Jun 11, 2025 20:02 2025/06/11 20:02:50.766972 DevId:000000000000000c usage: PHYSICAL: 1681935 MB Planned MDM ownership switch while rebalance active - Jun 13, 2025 07:46 70255 2025-06-13 07:46:11.942 MDM_CLI_CONF_COMMAND_RECEIVED Command SWITCH_MDM_OWNERSHIP primary_mdm_id=000000000000000b ... 70256 2025-06-13 07:46:11.942 CLI_COMMAND_SUCCEEDED Command SWITCH_MDM_OWNERSHIP succeeded One minute later SDS reports no-space cluster enters DU - Jun 13, 2025 07:47 2025/06/13 07:47:06.083097 feIo_CapLimit_ExceededTrace: SPEF-FE: UD exceeded capacity for dev=000000000000000c rc: EXCEED_SYSTEM_CAPACITY_LIMITATIONS (294) 2025/06/13 07:47:06.083109 spefStorageRegion_HandleSpefRc: translated rc = IO_FAULT_NO_SPACE combId:00000000000f Cluster auto-recovered once MDM learned correct chunk sizes - Jun 13, 2025 07:51. ImpactDuring the four-minute DU window, write I/O to all volumes hosted on the affected pool failed. Client systems saw volumes switch to read-only mode, virtual-machine and database workloads paused, and administrators postponed restarts to avoid additional write attempts. Even after auto-recovery, rebalance throughput slowed as the system throttled background activity to free capacity, prolonging operational risk until utilization dropped.
When a Primary MDM is switched during an active rebuild/rebalance cycle, the new Primary initially lacks accurate chunk-size statistics for each SDS. For about 30 seconds, it assumed zero usage and authorised additional migrations into SDS 0000000000000009, whose devices were already over 99 % full. The extra writes filled the last free extents, triggering EXCEED_SYSTEM_CAPACITY_LIMITATIONS and IO_FAULT_NO_SPACE. Until the MDM completes chunk-size synchronisation and recalculates space correctly, the SDS rejects further writes, resulting in temporary data unavailability.
1. Initiate Device or SDS Removal. Note: Reducing the Spare Capacity may be required, this is safe to do as long as they have enough free capacity in their SP to perform a full rebuild of atleast 1 node. Or if their spare capacity is larger than what it should be set to. Using UI or SCLI, remove the affected SDS/Devices from the systemThis action places them in "Pending Removal" stateImportant: This prevents new data from being written to problematic devices and avoids reclamation issues 2. Resolve Underlying Hardware Issues. Identify and fix the root cause of the problemExample: For HPE Storage Controller failure: ** Power down the hostReseat the storage controllerReboot the hostAllow SDS to reconnect to the system 3. Readd Devices or SDS to the System Readd the removed devices back to the system and verify that the device counts match the other SDSs. Impacted Versions PowerFlex 3.x Fixed In Version PowerFlex 3.6.6
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.