...
On a 28 shard cluster with each shard being a 3 node replicaset running MongoDb 3.2.5 WiredTiger, I saw a single secondary Mongod failed with read checksum error as below. The environment is CentOS Linux release 7.0.1406 and the mongod process writes to local disk. Also attaching the full log life of the secondary that failed. 2016-12-28T07:11:01.763-0500 I INDEX [repl writer worker 13] build index done. scanned 6723 total records. 0 secs 2016-12-28T07:11:07.614-0500 I COMMAND [conn18188] command local.oplog.rs command: getMore { getMore: 14529216539, collection: "oplog.rs", maxTimeMS: 5000, term: 41, lastKnownCommittedOpTime: { ts: Timestamp 1482927067000|217, t: 41 } } cursorid:14529216539 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:3 reslen:105611 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 287484 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_command 287ms 2016-12-28T07:11:19.374-0500 E STORAGE [thread2] WiredTiger (0) [1482927079:374811][17531:0x7faa199e1700], file:trancheinfodb_20161228/collection-392--4692130608470797293.wt, WT_SESSION.checkpoint: read checksum error for 4096B block at offset 339968: block header checksum of 1570021396 doesn't match expected checksum of 111389135 2016-12-28T07:11:19.374-0500 E STORAGE [thread2] WiredTiger (0) [1482927079:374861][17531:0x7faa199e1700], file:trancheinfodb_20161228/collection-392--4692130608470797293.wt, WT_SESSION.checkpoint: trancheinfodb_20161228/collection-392--4692130608470797293.wt: encountered an illegal file format or internal value 2016-12-28T07:11:19.374-0500 E STORAGE [thread2] WiredTiger (-31804) [1482927079:374871][17531:0x7faa199e1700], file:trancheinfodb_20161228/collection-392--4692130608470797293.wt, WT_SESSION.checkpoint: the process must exit and restart: WT_PANIC: WiredTiger library panic 2016-12-28T07:11:19.374-0500 I - [thread2] Fatal Assertion 28558
thomas.schubert commented on Tue, 3 Jan 2017 22:51:18 +0000: Hi darshan.shah@interactivedata.com, Thank you for reporting this issue. This assertion failure generally indicates that some or all of the data files have become corrupt in some way. Unfortunately, in cases like this without a clear reproduction it is challenging to determine the root cause of the corruption. Please ensure the integrity of your disk layer and if let us know if this issue reoccurs so we can continue to investigate. My recommendation to resolve this issue would be to resync the affected node. Kind regards, Thomas