Symptom
03/13/2023 00:01:10 | FABMGR: (0/0/CPU0, XBAR, 0) (0/RSP0/CPU0, XBAR, 0) backplane fabric crossbar link underwent link retraining to recover
from transient error
LC/0/0/CPU0:Mar 11 15:27:07.531 IST: fab_xbar[299]: %PLATFORM-CIH-5-ASIC_ERROR_THRESHOLD : sfe[0]: An interface-err error has occurred causing packet drop transient. ibbReg13.ibbExceptionHier.ibbReg13.ibbExceptionLeaf0.intIpcFnc0UcDataErr Threshold has been exceeded
Further Problem Description
Current Design , If we observe more than 5 retrain in 5 min interval , we would make FIA slice down
03/13/2023 00:00:25 | FABMGR: (0/0/CPU0, XBAR, 0) (0/RSP0/CPU0, XBAR, 0) backplane fabric crossbar link underwent link retraining to recover
from transient error
03/13/2023 00:01:10 | FABMGR: (0/0/CPU0, XBAR, 0) (0/RSP0/CPU0, XBAR, 0) backplane fabric crossbar link underwent link retraining to recover
from transient error
03/13/2023 00:01:55 | FABMGR: (0/0/CPU0, XBAR, 0) (0/RSP0/CPU0, XBAR, 0) backplane fabric crossbar link underwent link retraining to recover
from transient error
03/13/2023 00:02:41 | FABMGR: (0/0/CPU0, XBAR, 0) (0/RSP0/CPU0, XBAR, 0) backplane fabric crossbar link underwent link retraining to recover
from transient error
03/13/2023 00:03:55 | FABMGR: (0/0/CPU0 XBAR 0) (0/RSP0/CPU0 XBAR 0) fabric link is down
Each retraining would take around 30 sec & this is transient traffic impact
To minimise the traffic impact , this fix will provide below
To Limit the outage duration, Additional check of 3 retrain instance in 2 min will result in FIA slice down
For proactive monitoring, 8 or 10 retrain instance in a day for specific LC XBAR-RSP XBAR or LC XBAR to FIA , it should throw critical syslogs
This critical syslogs will enable customer for action (LC OIR) & repetitive retain instance after OIR will declare for RMA?
If this is picked as a SMU, its must to include CSCwf31914