...
BugZero found this defect 2585 days ago.
We had several issues in the past few days under our applications, and after further analysis, we found out that they are all related to our running instances of sharded MongoDB. During this examination, we noticed that our servers were reporting a lot of connection timeouts. After analyzing our network we discovered that it didn't have any performance issues. The tests results are provided along with this ticket in 'reportNetwork_mongo-20170811.txt' file. Also, we found some links on the web of people having similar problems to this one that we're facing; this is a group of links that we found: https://jira.mongodb.org/browse/SERVER-24711 https://stackoverflow.com/questions/38485285/mongodb-replica-heartbeat-request-time-exceeded https://jira.mongodb.org/browse/SERVER-24058 https://github.com/rofl0r/proxychains-ng/issues/171 But even so, we think we are facing an undocumented problem here (maybe related to those others). A description of the problem: Some requests made to our sharded clusters give us a timeout message (found at the log from the DB) for no apparent reason or pattern that we could figure out. If we try to perform the same request right after the problematic one, it may works without any problems but sometimes It don't. We even implemented this retry at our application code to mitigate the problem, but since it takes much time to timeout and make the request again, our client experience some slowness due to this issue and making this retry isn't a satisfactory fix to the problem for us. Currently, we have three different clusters running sharded MongoDB. One of them is operating at Softlayer’s(IBM) data center, and two others at OVH. We're going to focus our data provided in this ticket on the one from IBMs despite also having problems on the other two, just because we think that we're already providing a lot of data to you guys, but if needed we can provide data from the other two clusters. The cluster is composed of (all logs will be relative to this cluster): Servers hosted in IBM’s data center. 3 MongoDB(3.4.4) Servers running in a VM with 32GB RAM and 8 cores. 3 MongoDB(3.4.4) Configs running in a VM with 2GB RAM and 2 cores. 8 nodes(3.4.4) with the replication factor of 3 Bare metal with 256GB RAM and 40 cores. All of them communicate between each other using 1Gbps network. We can say these issues aren't related to infrastructure problems at the DC, as the all of our clusters are having this issues, regardless of the infrastructure provider. Also, it is important to report that the cluster size is not a factor in this problem because we have a tiny cluster of only two servers and they are facing the same problems. Our DBA also tried to fine tune some configurations and had no luck in fixing the problem. The tuned configurations were: WiredTigerConcurrentReadOperations – from default to 256 WiredTigerConcurrentWriteOperations – from default to 256 ShardingTaskExecutorPoolHostTimeoutMS – from default to 7200000 ShardingTaskExecutorPoolMinSize – from default to 10 taskExecutorPoolSize – from default to 16 (twice the number of cores) We also tried to change the max connection pool but with no success. Our development team also tried different versions of the Java's mongo driver, also with no positive results. (currently using version 3.4.2) com.mongodb.MongoExecutionTimeoutException: Operation timed out, request was RemoteCommand 254187776 -- target:mongo-shard-geral7a.foobar.com:7308 db:admin cmd:{ isMaster: 1 } com.mongodb.MongoExecutionTimeoutException: Couldn't get a connection within the time limit We also tried to upgrade and downgrade the MongoDB binary versions (3.4.6, 3.4.3, 3.4.0) none of them had any effects to solve our problem. We have several individual replica-set deployments, and none of them faced this issue in the past three years. The data provided with this ticket is: All MongoDB Nodes (from the given cluster) logs and diagnostic.data files (past 3 days). All MongoDB Servers (from the given cluster) logs (past 3 days). All MongoDB Configs (from the given cluster) logs (past 3 days). Network tests between a 'mongos', a config server and a primary node. IP’s and DNS’ has been hidden or modified for security reasons.
lucasoares commented on Mon, 30 Oct 2017 23:18:27 +0000: Using glibc 2.23 doesn't solved the problem. lucasoares commented on Mon, 30 Oct 2017 16:52:30 +0000: Hello. I'm investigating older issues in the same component and I found interesting things. First of all, I'm using RHEL 7.3 (kernel 3.10 and glibc 2.17) on my mongo servers. The issue SERVER-26723 says, and I quote: We have completed our investigation and concluded that the issue described in this ticket is caused by a bug in glibc. The issue SERVER-26723 doesn't have the same error pattern of this issue, but I can't ignore and I will install 5 new mongo servers with one SO with kernel v4 and then I will try to replicate the same problem. I think issues SERVER-26654 and SERVER-29206 have more likely to be the same problem as I have, but with different mongo verions. One of the reporters also have a different operating system. lucasoares commented on Wed, 25 Oct 2017 23:31:20 +0000: OK. Please, ask anything you need to debug this. I'm here to help because I really want this problem fixed or a good explanation to help me to solve this. Thanks. mark.agarunov commented on Wed, 25 Oct 2017 17:20:57 +0000: Hello lucasoares, Thank you for providing the additional information. Unfortunately we have not yet determined the cause of this behavior and are still investigating. However the log you provided should give some additional insight into this. Thanks, Mark lucasoares commented on Wed, 25 Oct 2017 03:12:12 +0000: I uploaded a log file with verbose level 5 of one mongos with this same issue. I hope this helps. The file name is 'logging_verbose_5.7z'. lucasoares commented on Tue, 24 Oct 2017 18:54:02 +0000: Hello mark.agarunov. I'm still having this issue but now with a brand new cluster from a different datacenter with a different server provider. All my team believe the problem with MongoDB, because we could not find a solution or a functional workaround. We are trying to add lot of 'mongos' instances, and some of those are beeing used by one application of ours. This minimized the problem but is not a good solution to us, since we have more than 300 application servers running. We tried to use the same NTP server in all our servers as advised by one of MongoDB internal employee. We tried to change lots of infrastructure configuration on our servers following the production notes of MongoDB. We tried lot of things, and nothing are helping. Maybe the problem is due to an infrastructure configuration, but even different versions of OS we tried (Ubuntu 14.04 and 16.04). Unfortunately I'm not having any success to prove that isn't a MongoDB issue. I really need help at least saying what kind of parameter or configuration I can try to change, or at least one update of this issue. Thanks! mark.agarunov commented on Fri, 22 Sep 2017 21:21:27 +0000: Hello lucasoares, Thank you for providing these files. My apologies for the delay in response, unfortunately we have not yet determined the cause of this behavior, but we are still investigating this issue. We will provide updates on this as they're available. Thanks, Mark lucasoares commented on Mon, 14 Aug 2017 22:49:07 +0000: Thanks ramon.fernandez. We uploaded the files. ramon.fernandez commented on Mon, 14 Aug 2017 12:34:33 +0000: lucasoares, I've created an upload portal so you can send logs and diagnostic data; please let us know when the uploads are complete. Thanks, Ramón. lucasoares commented on Sat, 12 Aug 2017 02:52:11 +0000: Unfortunately there is a limit of 150MB for attachments. We have tons of logs and diagnostic.data files from past 3 days to upload. Can you guys open a private portal to upload these files? Thanks.