...
ScaleIO cluster experienced data failure. ScaleIO cluster is running ScaleIO 1.33 code version. At one point the cluster experienced multiple disconnects and loss of connectivity resulting in an MDM_DATA_FAILED state Tie-breaker lost connectivity. 2017-06-29 10:42:36.262 CLUSTER_TIEBREAKER_LOST_CON WARNING Tie-breaker lost connection 1053 2017-06-29 10:42:36.312 CLUSTER_TIEBREAKER_CONNECTED INFO Tie-breaker connected 1054 2017-06-29 10:47:23.759 CLUSTER_TIEBREAKER_LOST_CON WARNING Tie-breaker lost connection The SDS s and MDM s were disconnecting and continuously attempting to reconnect. The following is one example of the SDS trace logs. All the other SDS s and the MDM s saw the same loss of connectivity. 9/06 10:47:25.050097 netSocket_CloseIfNotActive:00663: pSock 0x7f65448c45b0 socket(102) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4):: Socket receive is Inactive, will close. (msgSend 2, msgRecv 0, bHasMsgToSend 0, bMemPending 0, rcvNotActiveCount 1)bOldClusterVer 0, bOldVer 0 29/06 10:47:25.050116 netSocket_SetState:01896: pSock 0x7f65448c45b0 socket(102) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4)::Changing state to 4 (debug) 3 29/06 10:47:25.050119 netSocket_StartClosing:00536: pSock 0x7f65448c45b0 socket(102) ownerType(CON) state(CLOSING) type(SNDRCV) pollState(0x14)::Closing 29/06 10:47:25.050120 netSocket_CloseIfNotActive:00663: pSock 0x7f65448c2b30 socket(100) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4)::Socket receive is Inactive, will close. (msgSend 2, msgRecv 0, bHasMsgToSend 0, bMemPending 0, rcvNotActiveCount 1)bOldClusterVer 0, bOldVer 0 29/06 10:47:25.050122 netSocket_SetState:01896: pSock 0x7f65448c2b30 socket(100) ownerType(CON) state(CONNECTED) type(SNDRCV) pollState(0x4)::Changing state to 4 (debug) 3 The MDM event logs show that there was a 4 minute time gap where neither MDM could become Primary, and the cluster was down because of these connectivity issues. When the connectivity issue was resolved, nodeX could resume Master MDM duties and the SDS nodes could reconnect, as seen below: 2017-06-29 10:47:28.321 CLUSTER_TIEBREAKER_NOT_RESPOND WARNING Tie-breaker is not responding 2017-06-29 10:51:47.278 MDM_CLUSTER_BECOMING_MASTER WARNING This MDM, ID 7f2e9d36643e5a9b, took control of the cluster and is now the Primary MDM. 2017-06-29 10:51:54.486 MDM_DATA_FAILED CRITICAL The system is now in DATA FAILURE state. Some data is unavailable. 2017-06-29 10:51:59.377 CLUSTER_TIEBREAKER_CONNECTED INFO Tie-breaker connected 2017-06-29 10:51:59.863 SDS_RECONNECTED INFO SDS: SDS_[10.239.127.13] (ID b5b0426b00000004) reconnected 2017-06-29 10:52:01.774 SDS_RECONNECTED INFO SDS: SDS_[10.239.127.14] (ID b5b0697700000002) reconnected 2017-06-29 10:52:01.818 SDS_RECONNECTED INFO SDS: SDS_[10.239.127.11] (ID b5b0426700000000) reconnected After the SDS nodes reconnected, we went to a DEGRADED state and began a 4 min rebuild. The clients also reconnected and would have volume access once ScaleIO reached the DEGRADED state. 1071 2017-06-29 10:52:02.933 SDC_CONNECTED INFO SDC connected. ID: f82955d100000002; IP: 10.241.128.101; GUID: 0846B62C-CE31-46C4-AD59-18A383CD548B 1072 2017-06-29 10:52:03.454 MDM_DATA_DEGRADED ERROR The system is now in DEGRADED state. 1073 2017-06-29 10:52:16.960 SDC_CONNECTED INFO SDC connected. ID: f8297cdf00000004; IP: 10.241.128.114; GUID: 2E40035F-BB99-417A-A3FB-D4E5EBA2A892 1074 2017-06-29 10:52:19.116 SDC_CONNECTED INFO SDC connected. ID: f82955d200000003; IP: 10.241.128.111; GUID: B7385E37-0C88-40A2-91F7-E92CE485A694 1075 2017-06-29 10:56:25.719 MDM_DATA_NORMAL INFO The system is now in NORMAL state.
The fact that all nodes experienced the same inability to communicate simultaneously indicates a loss of network connectivity between all nodes.
ScaleIO is working as designed.Investigate and resolve any external connectivity issues.Verify all switches and, or routers are configured correctly.Review switch logs for connectivity issues.Verify switch/router cabling is all correct.Verify that no changes were made to the network that could cause an outage.Ensure that all firewall rules are correct.Make sure all ports ScaleIO uses are open.