...
BugZero found this defect 2670 days ago.
Setup: Sharded cluster with replica set shards. MongoDB v3.4.6. WiredTiger with snappy. Collection X exists only on 1 shard (not sharded, probably not relevant). Problem: When a query fails due to a max time MS timeout (which happens now and again, since we are using a fairly tight limit), MongoS marks the node as failed. (This is incorrect.. the node is NOT failed). Result: Since the query was against the primary, and the primary is marked as failed, subsequent write operations fail due to unavailability of the primary. This lasts for a second or a few, presumably until MongoS heartbeat monitor detects the primary as up. This renders $maxTimeMS dangerous to use when making primary-side queries--any timed out query will temporarily make the shard write-unavailable. Furthermore, it seems that architecturally it's wrong to have the MongoS mark the host as failed, but then not trigger a failover. This means that the MongoS "failed primary" logic is completely disconnected from the actual primary/replica failover/election logic. This means that when MongoS reports "no primary found" for a shard, it's not because there's actually no primary in that replica set! (there is a primary and it's healthy) (I think that this problem applies to queries that hit replicas as well, where the replica is marked as failed, but I haven't specifically tested that.)
xgen-internal-githook commented on Thu, 20 Dec 2018 15:13:33 +0000: Author: {'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow'} Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit (cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea) Branch: v3.4 https://github.com/mongodb/mongo/commit/9aafd59034c627b4a4cbda492bce3cc6cbef7b9a ramon.fernandez commented on Thu, 20 Dec 2018 11:35:28 +0000: paulw@bitheads.com, there's a pull request for the backport that is currently being assessed for quality and safety. We'll post updates to this ticket as they happen. Thanks for your continued patience, Ramón. paulw@bitheads.com commented on Sat, 8 Dec 2018 18:23:43 +0000: Question - is there a technical reason why this fix wasn't also applied to the v3.4 branch? We just upgraded from 3.2 -> 3.4, and our production servers are now hitting this issue... We're looking at the possibility of applying the same fix to the 3.4 branch and rebuilding - but maybe those more intimate with the code have a reason for why it wouldn't work? (and thus wasn't done?). Any thoughts would be appreciated. Thanks! Paul. xgen-internal-githook commented on Mon, 11 Dec 2017 23:52:39 +0000: Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'} Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit (cherry picked from commit 06b3488461d820c9081e5d7fca0fc028470922ea) Branch: v3.6 https://github.com/mongodb/mongo/commit/5708ae71e0334874e63951a2868ec30e09287806 xgen-internal-githook commented on Mon, 11 Dec 2017 16:15:43 +0000: Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'} Message: SERVER-30768 Mark hosts as down on NetworkInterfaceExceededTimeLimit instead of ExceededTimeLimit Branch: master https://github.com/mongodb/mongo/commit/06b3488461d820c9081e5d7fca0fc028470922ea kaloian.manassiev commented on Tue, 21 Nov 2017 19:01:08 +0000: From looking at the code, I do not see a good reason to notify the replica set monitor on ExceededTimeLimit errors, because these should be an indicator that a user-requested time limit was exceeded, but not necessarily a problem with the system. As part of this fix we should inspect all the places, which return ExceededTmeLimit and also possibly add NetworkInterfaceExceededTimeLimit to the list of errors, which mark shards as failed. oleg@evergage.com commented on Thu, 24 Aug 2017 17:30:38 +0000: Thank you Mark! I've been paying attention to the message for marking hosts as failed. I noticed that there's another circumstance that led to a primary being marked as unavailable: 2017-08-23T15:38:27.153+0000 I NETWORK [conn28770] Marking host m17a.xyz:27017 as failed :: caused by :: RecvStaleConfig: can't query replica set node m17a.xyz:27017: stale config in DBClientCursor::dataReceived() ( ns : somedb.some_collection, received : 0|0||000000000000000000000000, wanted : 1|0||597ba0655167407959b03891, recv ) Should this be a separate issue? It feels related to this. It seems incorrect to flag the entire primary as down just because there is a stale config data refresh happening for one collection (out of many). mark.agarunov commented on Tue, 22 Aug 2017 19:57:15 +0000: Hello oleg@evergage.com, Thank you for the detailed report. I've managed to reproduce this behavior using the steps you outlined and have set the fixVersion to "Needs Triage" for this issue to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen. Thanks, Mark oleg@evergage.com commented on Mon, 21 Aug 2017 23:18:27 +0000: It looks like max_time_ms_sharded.js asserts only that the timeout was correctly sent back, but does not assert that the queried shard server is still reachable afterwards (and a primary is present). oleg@evergage.com commented on Mon, 21 Aug 2017 23:14:58 +0000: I believe the problem lies within shard_remote.cpp in ShardRemote::updateReplSetMonitor: https://github.com/mongodb/mongo/blob/r3.4.6/src/mongo/s/client/shard_remote.cpp#L135 It considers ExceededTimeLimit to be an unreachable host situation. That might have been ok if maxTimeMS were used for internal system queries only. However, it's not OK when it's an application query specifying maxTimeMS. I recommend removing the clause that marks the host as failed on ExceededTimeLimit. That, or we have to detect whether the time limit was exceeded on a system or internal command or an application find.
Set up a sharded cluster with replica sets Issue a find(query).maxTimeMS(1) command to cause a timeout, such that the Mongo shell prints: E QUERY [thread1] Error: error: { "ok" : 0, "errmsg" : "operation exceeded time limit", "code" : 50, "codeName" : "ExceededTimeLimit" } At this point, MongoS logs that it marked the primary as failed: # 2017-08-21T21:11:05.717+0000 I NETWORK [NetworkInterfaceASIO-TaskExecutorPool-1-0] Marking host shard3.xyz:27017 as failed :: caused by :: ExceededTimeLimit: operation exceeded time limit Immediately issue a write command, which will fail with: error 133: Write failed with error code 133 and error message 'could not find host matching read preference { mode: "primary", tags: [ {} ] } for set xyz_shard3