BugZero | MongoDB BugID 567704 - While upgrading from Shard 3.4 to Shard 3.6 config...

MongoDB - Defect ID: 567704

While upgrading from Shard 3.4 to Shard 3.6 config.system.sessions is not sharded.

MongoDB - Defect ID: 567704

While upgrading from Shard 3.4 to Shard 3.6 config.system.sessions is not sharded.

Last updated on 7/23/2018

Overall: 88.0

Severity: 8.28.2

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Critical - P2
Status: Closed

Overall: 88.0

Severity: 8.28.2

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Critical - P2
Status: Closed

Info

Hello, I have 2 shards. And each shard is running with 1 primary and 2 secondary. I am sharing the config server along with mongod to share the cost of the project. All components were running on mongo 3.4.10. For the upgrade I followed the below process. These are the error logs found. 2018-07-04T22:57:34.431+0800 I REPL [rsBackgroundSync] could not find member to sync from 2018-07-04T22:57:39.429+0800 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-dba-test01.mylab.com:27017: InvalidSyncSource: Sync source was cleared. Was mongo-dba-test01.mylab.com:27017 2018-07-04T22:57:40.476+0800 I REPL [replexec-7] Starting an election, since we've seen no PRIMARY in the past 10000ms NetworkInterfaceASIO-ShardRegistry-0] Connecting to mongo-dba-test01.mylab.com:27017 2018-07-04T22:56:48.659+0800 I ASIO [NetworkInterfaceASIO-ShardRegistry-0] Successfully connected to mongo-dba-test01.mylab.com:27017, took 2ms (1 connections now open to mongo-dba-test01.mylab.com:27017) 2018-07-04T22:56:48.661+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }' bad command. The configuration files and the error logs are attached.

Top User Comments

baabhu@yahoo.com commented on Tue, 10 Jul 2018 04:44:43 +0000: Are you sure that works. I just executed on my lower environment. It just hangs. It finished running after a few minutes. Still the same message. Since this my UAT environment let me know where should I upload my log files. esha.maharishi@10gen.com commented on Tue, 10 Jul 2018 03:19:21 +0000: baabhu@yahoo.com, have you set the feature compatibility version of the cluster to 3.6? See https://docs.mongodb.com/manual/reference/command/setFeatureCompatibilityVersion/. Until the feature compatibility version has been set to 3.6, the config server will not create (and shard) the config.system.sessions collection (see v3.6.5 code). baabhu@yahoo.com commented on Tue, 10 Jul 2018 00:53:58 +0000: How do I reopen the issue? The link you attached is a different issue. I have a workable cluster just the config.sessions collection is not sharded. baabhu@yahoo.com commented on Tue, 10 Jul 2018 00:51:42 +0000: Thanks for your comments. The issue is config.system.sessions is not sharded. I followed the standard operating procedure and I am facing the issue of config.system.sessions not sharded. Any advice on this. I have a 3.6.5 upgraded cluster with this error popping out Refresh for collection config.system.sessions took 8 ms and found the collection is not sharded every few seconds. I am fairly concerned about not having the system collections not sharded. If you think CommandNotFound: no such command: 'forceRoutingTableRefresh is not an issue lets ignore it and continue investigating the root cause for collection config.system.sessions not sharded and help me to shard the collection. And I would like to have someone assigned with Apac time zone if possible. In that case the response to the queries will be much faster. esha.maharishi@10gen.com commented on Mon, 9 Jul 2018 17:39:46 +0000: Note, the fact that the secondary tried to refresh for config.system.sessions while in feature compatibility version 3.4 is a separate issue tracked by SERVER-35795. esha.maharishi@10gen.com commented on Mon, 9 Jul 2018 17:37:22 +0000: Hi baabhu@yahoo.com, I looked closely at your logs and configuration files and found: This is the cluster configuration: mongo-dba-test01 27017, shard01 27018, shard02 27019, config mongo-dba-test02 27017, shard01 27018, shard02 27019, config mongo-dba-test03 27017, shard01 27018, shard02 27019, config The "CommandNotFound" errors for the forceRoutingTableRefresh command were only logged by the (secondary) nodes for shard01 and shard02 on mongo-dba-test03: ./03/mongo_shard1.log:2980:2018-07-04T22:56:48.661+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }' ./03/mongo_shard2.log:3256:2018-07-04T22:56:56.845+0800 I SHARDING [ShardServerCatalogCacheLoader-0] Refresh for collection config.system.sessions took 4 ms and failed :: caused by :: CommandNotFound: no such command: 'forceRoutingTableRefresh', bad cmd: '{ forceRoutingTableRefresh: "config.system.sessions", maxTimeMS: 30000 }' At the time (2018-07-04T22:56:48) the error was logged by the shard01 node on mongo-dba-test03, the primary for shard01 was mongo-dba-test01/27017, which was running v3.4.15: 01/mongo_shard_01.log:2018-07-04T21:36:18.425+0800 I CONTROL [initandlisten] db version v3.4.15 01/mongo_shard_01.log:2018-07-04T21:44:41.632+0800 I REPL [ReplicationExecutor] transition to PRIMARY 01/mongo_shard_01.log:2018-07-04T21:44:43.557+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted // from 2018-07-04T21:44:43 to 2018-07-04T22:57:30, this node was shard01's primary, running v3.4.15 01/mongo_shard_01.log:2018-07-04T22:57:30.721+0800 I COMMAND [conn43] Attempting to step down in response to replSetStepDown command 01/mongo_shard_01.log:2018-07-04T22:57:30.722+0800 I REPL [conn43] transition to SECONDARY 01/mongo_shard_01.log:2018-07-04T23:00:32.939+0800 I CONTROL [conn56] shutting down with code:0 01/mongo_shard_01.log:2018-07-04T23:03:32.905+0800 I CONTROL [main] ***** SERVER RESTARTED ***** 01/mongo_shard_01.log:2018-07-04T23:03:32.918+0800 I CONTROL [initandlisten] db version v3.6.5 Similarly, at the time (2018-07-04T22:56:56) the error was logged by the shard02 node on mongo-dba-test03, the primary for shard02 was mongo-dba-test01/27018: 01/mongo_shard_02.log:2018-07-04T21:36:22.240+0800 I CONTROL [initandlisten] db version v3.4.15 01/mongo_shard_02.log:2018-07-04T22:50:05.569+0800 I REPL [ReplicationExecutor] transition to PRIMARY 01/mongo_shard_02.log:2018-07-04T22:50:07.344+0800 I REPL [rsSync] transition to primary complete; database writes are now permitted // from 2018-07-04T22:50:07 to 2018-07-04T22:57:52, this node was shard02's primary, running v3.4.15 01/mongo_shard_02.log:2018-07-04T22:57:52.803+0800 I COMMAND [conn39] Attempting to step down in response to replSetStepDown command 01/mongo_shard_02.log:2018-07-04T22:57:52.803+0800 I REPL [conn39] transition to SECONDARY 01/mongo_shard_02.log:2018-07-04T23:00:22.718+0800 I CONTROL [conn52] shutting down with code:0 01/mongo_shard_02.log:2018-07-04T23:03:44.676+0800 I CONTROL [main] ***** SERVER RESTARTED ***** 01/mongo_shard_02.log:2018-07-04T23:03:44.689+0800 I CONTROL [initandlisten] db version v3.6.5 Assuming the clocks were relatively synchronized across the machines, it is expected that the v3.6 secondary nodes for the shards on mongo-dba-test03 would log this error while the shards had v3.4 primary nodes. I am closing this as Works as Designed, but please feel free to re-open if you feel this was an incorrect diagnosis. baabhu@yahoo.com commented on Mon, 9 Jul 2018 00:37:12 +0000: Any updates? baabhu@yahoo.com commented on Fri, 6 Jul 2018 03:13:40 +0000: I could start the session. No problem. I confirm there are no secondaries or primary running on 3.4. The error should have gone on its own. I am seeing the messages even after I upgraded all components to 3.6.5 esha.maharishi@10gen.com commented on Thu, 5 Jul 2018 15:33:22 +0000: Hi baabhu@yahoo.com, thank you for your report. You should see this error message logged on any 3.6.x secondaries if the primary node in its replica set is 3.4.x. The error message is an artifact of adding safe secondary reads in 3.6. With safe secondary reads, primaries persist their routing table cache so that secondaries within their replica set can replicate the routing table cache via the oplog. If a secondary finds that the persisted routing table cache has not been written yet (the issue you are probably running into, since the routing table refresh is for the "config.system.sessions" namespace) or is stale (because the secondary was contacted by a mongos that had a more fresh version of the routing table than what had been replicated to that secondary), the secondary will try to provoke its primary to refresh by sending the "forceRoutingTableRefresh" command to the primary. However, if the primary node is still 3.4.x, it will not have the "forceRoutingTableRefresh" command, so the request against the secondary will simply fail with CommandNotFound. A user can use the default readConcern on secondaries, 'available', for reading from that secondary until the entire replica set has been upgraded to 3.6. However, I am surprised that this issue is happening for the refresh for "config.system.sessions", which is a system collection (not a user collection). This suggests that a read or direct refresh against "config.system.sessions" is being done on a secondary with non-default readConcern. baabhu@yahoo.com, 1) In addition to seeing this error message in the logs, are you unable to use the cluster properly? For example, are you unable to start or use a session? 2) If you step up the 3.6 node to primary, are your issues resolved? misha.tyulenev, do you know where the internal refresh for config.system.sessions comes from? baabhu@yahoo.com commented on Wed, 4 Jul 2018 08:52:57 +0000: Logs.rar contains the log files of the entire cluster.

Steps to Reproduce

1. db.adminCommand( { setFeatureCompatibilityVersion: "3.4" } ) Run this command from Mongos. 2. Made sure the compatiblity is 3.4 before starting the upgrade. 3. Stop the balancer. 4. Choose the secondary. Gracefully shutdown the servers. 5, Remove the 3.4 packages. 6. Install 3.6.5 packages. 7. Start the mongod and mongo config server. 8. Wait for the servers to join the cluster. 9. Repeat the steps from 4 to 8 for the other secondary. 10. Step down the mongo shard primary and mongo config primary. 11. Repeat steps 4 to 8 on the last secondary. 12. After that upgrade the mongos component. 13. Start the balancer. 14. Monitor the error logs.

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 567704

While upgrading from Shard 3.4 to Shard 3.6 config.system.sessions is not sharded.

MongoDB - Defect ID: 567704

While upgrading from Shard 3.4 to Shard 3.6 config.system.sessions is not sharded.

Last updated on 7/23/2018

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?