Loading...
Loading...
Nodes running only Ethernet Backend can split if one backend interface flaps and the other backend interface flaps within five minutes of each other. NOTE: This scenario does not affect InfiniBand architecture, if this scenario matches InfiniBand nodes, the cause is something different and should be investigated. For example, messages log will indicate the link going down, and then back up: 2021-01-01T16:27:15Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen0: *** HW EVENT: Link DOWN *** 2021-01-01T16:27:15Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen0: link state changed to DOWN 2021-01-01T16:27:27Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen0: *** HW EVENT: Link UP *** 2021-01-01T16:27:27Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen0: link state changed to UP When the link is back up, a prep timer begins in the Load Balancing or Failover (LBFO) logs: 2021-01-01T16:27:27Z <26.5> ISILON-1(id1) /boot/kernel.amd64/kernel: [lbfo_kif.c:432](pid 12="intr")(tid=100042) mlxen0: prep state started after link UP If the other link is brought down within five minutes: 2021-01-01T16:29:04Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen1: *** HW EVENT: Link DOWN *** 2021-01-01T16:29:04Z <0.5> ISILON-1(id1) /boot/kernel.amd64/kernel: mlxen1: link state changed to DOWN The other path may not be ready yet: 2021-01-01T16:29:04Z <26.4> ISILON-8(id8) /boot/kernel.amd64/kernel: [lbfo_so.c:484](pid 12="intr")(tid=100176) * WARNING * FAILOVER DEFERRED (TX RE-TRANSMIT MONITOR) !!! Alternate path in prep state. Node: 128.221.254.1, CurPath: mlxen1, AltPath: mlxen0, Prep Secs Left: 10 NOTE: The message indicates "Prep Secs Left: 10" which is incorrect. This counter increments UP to 300, it is misleading and there a fix to this issue. This output indicates that the counter is at 10 seconds and still has 290 seconds left.
This occurs because the LBFO daemon places the link into a "prep state". Prep state is a status where a path (because the link went down) is unavailable until a five minute counter is completed after the link has come back up. The path is unavailable when the link goes down, however, the prep state counter starts when the link has come back up.
Steps to consider and perform: 1. Do not perform tests simulating backend failures unless this timer is considered. 2. Performing an upgrade to backend switches requires the switch to reload. A reload brings down interfaces that are connected to the switch, the node sees them as "no carrier". From the Command Line Interface (cli) run the following command to check the interface status. # ifconfig -vvv 3. When the switch is finished reloading, node interfaces come back up (showing a carrier status "active"). Run the above command to verify interface status. This begins a prep state timer of five minutes where the interface remains unusable until the timer has completed. 4. During backend switch upgrades, ensure that there is enough time that is allocated when moving from one fabric to the next. For example, once int-a is completed upgrading, wait 10-15 minutes before starting the upgrade for int-b.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
BugZero Plan
Streamline upgrades with automated vendor bug scrubs
BugZero Prevent
Wish you caught this bug sooner? Get proactive today.