...
After repairing the db due to disk corruption, the db kept failing due to this error: 2022-05-14T10:19:35.986+0000 E STORAGE [conn1137] WiredTiger error (0) [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206 Raw: [1652523575:986492][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_block_read_off, 283: collection-1792-6427612125299872100.wt: read checksum error for 12288B block at offset 14536716288: block header checksum of 0xf0e3571e doesn't match expected checksum of 0xd8f2e206 2022-05-14T10:19:35.986+0000 E STORAGE [conn1137] WiredTiger error (0) [1652523575:986814][1:0x7f62e8b4d700], file:collection-1792-6427612125299872100.wt, WT_CURSOR.next: __wt_bm_corrupt_dump, 135: {14536716288, 12288, 0xd8f2e206}: (chunk 1 of 12): 00 00 00 00 00 00 00 00 71 5a 80 00 00 00 00 00 b7 1a 01 00 06 00 00 00 07 05 00 00 00 40 00 00 1e 57 e3 f0 01 00 00 00 11 e3 3f c9 17 80 e2 1f 05 85 3f 00 00 07 5f 69 64 00 61 c0 dc dd f7 b7 23 3d 00 00 00 00 00 00 f7 b4 04 f0 3c 08 1f 5c 7f 83 65 02 75 73 65 72 6e 61 6d 65 00 0f 00 00 00 32 39 30 31 31 30 37 31 35 30 30 30 39 3.........
JIRAUSER1269347 commented on Thu, 19 May 2022 09:25:06 +0000: Hi Chris, Thanks for your help and insights. I wish you the best. Best regards, Ahmed JIRAUSER1265262 commented on Wed, 18 May 2022 19:44:44 +0000: Hi Ahmed, Thanks for providing a good timeline of events on this. Without logs, we can't really discern much more information, but there is a chance that the primary node is sustaining some sort of persistent issue that is happening irregularly. This could've led to latent corruption on the node that just happened to be a problem when the secondary required an initial sync, requiring the primary to visit pages with corrupt data (and then causing the error you mentioned). Among corruption cases, checksum failures are usually caused by file system or a disk issues, so we are somewhat suspect of glusterfs. There appear to be ways this can lead to data loss in certain situations. We can't speak much to it, but there appears to be a few examples of corruption that happen to those using Openshift with glusterfs in the past, including a mention of it taking place in SERVER-40088 with other database systems. Since we don't have much more information to go off of, and you have already resolved the issue, I'm going to go ahead and close this ticket for now. Regards, Christopher JIRAUSER1269347 commented on Tue, 17 May 2022 20:17:46 +0000: Hi Chris, Our cluster consists of 3 nodes, 1 primary, 1secondary and 1 arbiter. The cluster is deployed on openshift with glusterfs as the underlying file system. What happened is: 1- The secondary node failed due to similar error. 2- The whole volume of the secondary was removed and the node was started again with a clean volume. 3- The secondary joined the cluster and started syncing from the primary. 4- During syncing the primary failed with the corruption error. 5- The secondary then failed as it didn't sync enough data to be able to act as primary. 6- The primary was restarted automatically and was healthy. 7- Then the cycle from 3 to 6 kept repeating. JIRAUSER1265262 commented on Tue, 17 May 2022 19:24:35 +0000: Hi ahmed.nasr@fixedmea.com, It's hard to say without logs which would include exact information about your setup. Corruption can happen in any number of ways. If you still have logs of this event please upload them to the support link if you can. We would be especially interested in figuring out why your node's initial sync failed. Did you get the same exact error you reported mid-sync on the fresh node? Or did it happen on the other node at any point? If you could provide a clearer timeline of the events that would be helpful for future issues. However, we do have some guidelines that should cover some common reasons for this. To avoid a problem like this in the future, it is our strong recommendation to: Use a replica set. (Which you are it sounds like) Use the most recent version. (I would recommend switching to a newer version; 4.2 is nearing EOL in April 2023) Keep up to date backups of your databases. (Which you did - great) Follow all production notes, especially those for underlying storage systems. Schedule and perform regular checks of the integrity of your filesystems and disks. Never manipulate the underlying database files in any way. Regards, Christopher JIRAUSER1269347 commented on Tue, 17 May 2022 18:33:02 +0000: Hi Chris, Thanks for your support. Unfortunately it was a major incident and we couldn't afford to wait. We tried repairing but unfortunately it stated that it has to sync from another node but it was down. Actually it was still re syncing from this node and the node failed due to that error half way. We had to restore a previous backup on a clean new replica set. But for future reference, what do you think might have caused such corruption? So we would watch for it in our new deployment? JIRAUSER1265262 commented on Tue, 17 May 2022 18:23:31 +0000: Hi ahmed.nasr@fixedmea.com, The ideal resolution is to perform a clean resync from an unaffected node. In your case, I'd recommend that next if you are running a replica set. You can also try mongod --repair using the latest patch of your version (in your case, 4.2.20). In the event that running --repair using 4.2.20 is unsuccessful, then please provide the following: The logs leading up to the first occurrence of any issue The logs of the repair operation. The logs of any attempt to start mongod after the repair operation completed. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Regards, Christopher