...
removeShard on a config shard succeeds despite existence of the config database on that shard. The config database is still accessible after, getShardMap shows the removed shard but it has been removed from the still accessible config.shards collection Update: The behavior seen here may be correct but is definitely confusing/misleading for users. We should ban use of removeShard for a config shard by users and point them to the transitionToDedicatedConfigServer command which is meant for this purpose Cluster configuration (note that config db is on the "config" shard) mongos> sh.status() --- Sharding Status --- sharding version: { "_id" : 1, "clusterId" : ObjectId("6408d34d1b1386f4db260a16") } shards: { "_id" : "config", "host" : "configRepl/localhost:27020", "state" : 1, "topologyTime" : Timestamp(1678299982, 3), "draining" : true } { "_id" : "jamesRepl", "host" : "jamesRepl/localhost:27030", "state" : 1, "topologyTime" : Timestamp(1678299983, 2) } active mongoses: "7.0.0-alpha-538-g7cec1b7" : 1 autosplit: Currently enabled: yes automerge: Currently enabled: yes balancer: Currently enabled: yes Currently running: yes databases: { "_id" : "config", "primary" : "config", "partitioned" : true } config.system.sessions shard key: { "_id" : 1 } unique: false balancing: true chunks: config 696 jamesRepl 328 too many chunks to print, use verbose if you want to force print movePrimary of the config database to jamesRepl is disallowed mongos> db.adminCommand({movePrimary: "config", to: "jamesRepl"}) { "ok" : 0, "errmsg" : "Can't move primary for config database", "code" : 72, "codeName" : "InvalidOptions", "$clusterTime" : { "clusterTime" : Timestamp(1678300947, 29), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678300947, 29) } Despite movePrimary of the config db being disallowed, running removeShard on "config" appears to succeed mongos> db.adminCommand({removeShard : "config"}) { "msg" : "draining started successfully", "state" : "started", "shard" : "config", "note" : "you need to drop or movePrimary these databases", "dbsToMove" : [ ], "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302328, 3), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302328, 3) } mongos> db.adminCommand({removeShard : "config"}) { "msg" : "removeshard completed successfully", "state" : "completed", "shard" : "config", "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302331, 5), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302331, 5) }mongos> db.adminCommand({removeShard : "config"}) { "ok" : 0, "errmsg" : "Shard config does not exist", "code" : 70, "codeName" : "ShardNotFound", "$clusterTime" : { "clusterTime" : Timestamp(1678302405, 2), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302405, 2) } Post removal leaves mongos in an inconsistent state with getShardMap showing the shard as existing, but the entry for that shard has been removed from the config.shards collection (which is still accessible despite the shard with the config database having been removed) mongos> db.adminCommand({getShardMap: 1}) { "map" : { "jamesRepl" : "jamesRepl/localhost:27030", "config" : "configRepl/localhost:27020" }, "hosts" : { "localhost:27030" : "jamesRepl", "localhost:27020" : "config" }, "connStrings" : { "configRepl/localhost:27020" : "config", "jamesRepl/localhost:27030" : "jamesRepl" }, "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302405, 2), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302405, 2) } switched to db config mongos> db.shards.find() { "_id" : "jamesRepl", "host" : "jamesRepl/localhost:27030", "state" : 1, "topologyTime" : Timestamp(1678302331, 2) } mongos>
xgen-internal-githook commented on Tue, 4 Apr 2023 18:42:26 +0000: Author: {'name': 'wenqinYe', 'email': 'wenqin908@gmail.com', 'username': 'wenqinYe'} Message: SERVER-74705: removeShard should not be allowed for config shard Branch: master https://github.com/mongodb/mongo/commit/e0a1eb3ce7cb84669666e94c9f37d1d3bffe53ec james.wahlin@10gen.com commented on Wed, 8 Mar 2023 20:06:30 +0000: Also worth mentioning is that occasionally when I try this instead of the config shard removal completing (as per the removeShard return messaging) it will appear stuck on the "draining ongoing" phase. When in this state, trying to remove the other shard fails with "Operation not allowed because it would remove the last shard", indicating that we are in an in-progress state for config shard removal. Edit: It looks like the removeShard command getting stuck happens if I setup my 2 shard config-shard cluster and let it sit idle for a number of minutes before attempting to call removeShard on the config shard. Edit 2: This appears to be delayed rather than stuck. It is due to a slow migration of chunks for the config.system.sessions. Despite being an almost unused cluster there are 1000 chunks being migrated at around 1 per second on my mac Edit 3: The removeShard completed but raises and question. Is it correct that a removeShard on the config shard should result in a move of all config.system.sessions chunks to the non-config shard? My expectation would instead that we migrate chunks back to the config shard and make the collection unsharded as part of moving back to a dedicated config shard. james.wahlin@10gen.com commented on Wed, 8 Mar 2023 19:07:44 +0000: Update: The description above was changed to reflect this comment Even stranger, on a second test removeShard claims that the config shard has been removed, but it still shows up on getShardMap invocation. I can still read from collections in the config database as well and see data. mongos> db.adminCommand({removeShard : "config"}) { "msg" : "draining started successfully", "state" : "started", "shard" : "config", "note" : "you need to drop or movePrimary these databases", "dbsToMove" : [ ], "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302328, 3), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302328, 3) } mongos> db.adminCommand({removeShard : "config"}) { "msg" : "removeshard completed successfully", "state" : "completed", "shard" : "config", "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302331, 5), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302331, 5) }mongos> db.adminCommand({removeShard : "config"}) { "ok" : 0, "errmsg" : "Shard config does not exist", "code" : 70, "codeName" : "ShardNotFound", "$clusterTime" : { "clusterTime" : Timestamp(1678302405, 2), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302405, 2) } mongos> db.adminCommand({getShardMap: 1}) { "map" : { "jamesRepl" : "jamesRepl/localhost:27030", "config" : "configRepl/localhost:27020" }, "hosts" : { "localhost:27030" : "jamesRepl", "localhost:27020" : "config" }, "connStrings" : { "configRepl/localhost:27020" : "config", "jamesRepl/localhost:27030" : "jamesRepl" }, "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1678302405, 2), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1678302405, 2) } // Notice that getShardMap sees "jamesRepl" whereas is has been removed from config.shardsmongos> use config switched to db config mongos> db.shards.find() { "_id" : "jamesRepl", "host" : "jamesRepl/localhost:27030", "state" : 1, "topologyTime" : Timestamp(1678302331, 2) } mongos>