Symptom
Multiple DIMM sensor's reporting failure state randomly.
Log messages like these will be seen. (Note, look for multiple DIE_DIMM numbers)
0/RP1/ADMIN0:Apr 15 02:20:41.250 : envmon[4104]: %PKT_INFRA-FM-4-FAULT_MINOR : ALARM_MINOR :sensor in a failure state :DECLARE :0/RP1: DIE_DIMM2 has raised an alarm for device error
0/RP1/ADMIN0:Apr 15 02:12:20.966 : envmon[4104]: %PKT_INFRA-FM-4-FAULT_MINOR : ALARM_MINOR :sensor in a failure state :DECLARE :0/RP1: DIE_DIMM3 has raised an alarm for device error
0/RP1/ADMIN0:Apr 15 02:20:41.250 : envmon[4104]: %PKT_INFRA-FM-3-FAULT_MAJOR : ALARM_MAJOR :sensor in a failure state :DECLARE :0/RP1: multiple sensor faults
Conditions
root cause:
Canbus poll each sensor 10 sec interval.
So, canbus send same CB_DATA_DIMM_PCH_TEMP_GET for all sensor and update DB for all 5 whenever read for any.
When reply comes, we loop based on id, here id is same for all 5 sensors, so value get applied everytime you read for any sensor. We have a fault counter max 10 times per sensor, because of this within 2 read raising fault instead of 10 read.
fix:
so, we won't read for all, read for one and apply to others.
canbus_driver is using the same message id to read PCH, DIMM sensor's.
we don't want to send request for each sensor. will request only one sensor and re-use it for the other sensor.
so, we won't read for all, read for one and apply to others.