...
When doing a lot of of update operations, mongod processes randomly crash (not always, not often). This is a development shard server (with core sharding) : 3 shards of 3 members each (Primary, Secondary, Arbiter) Data / Journal / Log are on separated disks, all mounted with LUKS filesystem I suspect LUKS to be the source of the problem. The running updates are using $addToSet / $elemMatch. I attached the error information and backtrace, let me know if you need more info.
hmducoulombier@marketing1by1.com commented on Tue, 28 Mar 2017 14:18:22 +0000: Little follow up about this issue. We tweaked the server and downgraded the RAM frequency (it's 2400Mhz DDR3 RAM) because the Motherboard considers 1800+ frequency as overclocking, and it's working fine so far (a complete memtest helped us identify instability in the RAM). Thanks again for you time spent on investigating the issue. Henri-Maxime hmducoulombier@marketing1by1.com commented on Mon, 6 Mar 2017 21:14:33 +0000: Hello @Mark Agarunov, Thanks for investigating. I'm hoping to be able to build a new development server with more hardware (and not 3 volumes per disk) and to build the LUKS encrypted volumes over xfs and not ext4 filesystem this time (does this issue happen with not encrypted ext4 too ?). Can't do that right now, we would need to invest both in hardware and time to migrate the datas, but this should happen in 2017. If the problem still occurs then, I'll post a new bug report. I've been looking of the web for info on LUKS and possible problems like that, but it seems that there are not so many people building systems like that, and literature is quite thin. Last but not least, the problem occurred today while only reading data, whereas it was occurring with massive updates/write before, this is why I thought it was important to post the logs this time. Thx Henri-Maxime mark.agarunov commented on Mon, 6 Mar 2017 19:29:19 +0000: Hello hmducoulombier@marketing1by1.com, Thank you for providing these logs. Looking over the output, it appears that the corruption is outside of Mongodb. Errors of this type frequently indicate an underlying hardware/storage layer issue. The error messages indicate that the data has changed between when it was written by Mongodb and when it was read: read checksum error for 12288B block at offset 691863552: calculated block checksum of 3408525212 doesn't match expected checksum of 3304251198 You are correct that a repair may not be needed with journaling enabled to get MongoDB running again, it may be the case there is a previously corrupted portion of the database, and the errors only occur when it is accessed, not on startup. A repair may fix this issue. Thanks, Mark hmducoulombier@marketing1by1.com commented on Mon, 6 Mar 2017 15:08:06 +0000: Here is a recent backtrace (just happened now) while doing a "simple" aggregate on datas (same environment, 3 shards reading at the same time on the same disk) jira-err.log Hope it helps. hmducoulombier@marketing1by1.com commented on Thu, 23 Feb 2017 19:51:37 +0000: Another backtrace of another (yet similiar) issue. hmducoulombier@marketing1by1.com commented on Thu, 23 Feb 2017 19:51:12 +0000: Hello Mark, Here are the answsers to your questions : Only one of the mongod crashes, sometimes 2 (but not the same, nor at the same time). Also, it's not necessarily the primary. The storage are SSD directly on hardware. 1 SSD is for the data, 1 for the journals and 1 for the logs. Each disk is partitioned with 12 partitions, and each mongod uses its own 3 partitions (I'm including the config servers here). We do not need to repair the database, restarting the affected mongod is enough (journaling is enabled, this is probably why). However, we did have another crash, more severe in my opinion, and had to restore the affected member of the replica set (from the secondary) because it had invalid BSON and it was making some requests impossible. I'm attaching a second log with this other backtrace (repairing the member failed (no output at all)) I'll post the complete log as soon as I can. Thanks for investigating this issue. mark.agarunov commented on Thu, 23 Feb 2017 17:27:17 +0000: Hello hmducoulombier@marketing1by1.com, Thank you for the report. After looking over the output you've provided, I have a few questions and requests so that we can better investigate this: Do all of the mongod instances crash or only once specific instance? What is the layout of the underlying storage? (lvm, virtual block storage, directly on hardware, etc) Have you tried running a repair on the affected database(s)? Additionally, please send the complete logs from any affected mongod instances by this behavior. Thanks, Mark