...
We are running a 3 shard cluster with the latest mongodb v4 (v4.0.27). In the logs of one shard we see constantly that chunks from the config.systen.session collection can't be moved due to different collection UUID's: 2021-09-17T09:52:36.600+0200 W SHARDING [conn350] Chunk move failed :: caused by :: OperationFailed: Data transfer error: migrate failed: InvalidUUID: Cannot create collection config.system.sessions because we already have an identically named collection with UUID cfebe605-b114-47d7-a490-691a3668a2d7, which differs from the donor's UUID b7fa2327-0b02-40fe-bf5b-1c1e0e32fefb. Manually drop the collection on this shard if it contains data from a previous incarnation of config.system.sessions The log suggests to drop the collection but mongodb's documentation clearly states not to drop collections from the config database. We cannot even execute find on this collection because the user is not authorized: offerStoreIT01:PRIMARY> db.system.sessions.find() Error: error: { "operationTime" : Timestamp(1631867031, 157), "ok" : 0, "errmsg" : "not authorized on config to execute command { find: \"system.sessions\", filter: {}, lsid: { id: UUID(\"fd897f41-b167-4ee4-82ca-e037d2b7bada\") }, $clusterTime: { clusterTime: Timestamp(1631866971, 1435), signature: { hash: BinData(0, FC3E1B6F3AC0342900DEC1131A12EC3191387A9F), keyId: 6948315370398680269 } }, $db: \"config\" }", "code" : 13, However, the user has the following roles: "roles" : [ { "role" : "clusterManager", "db" : "admin" }, { "role" : "restore", "db" : "admin" }, { "role" : "clusterMonitor", "db" : "admin" }, { "role" : "dbAdmin", "db" : "admin" }, { "role" : "readAnyDatabase", "db" : "admin" }, { "role" : "userAdminAnyDatabase", "db" : "admin" }, { "role" : "hostManager", "db" : "admin" }, { "role" : "dbAdminAnyDatabase", "db" : "admin" }, { "role" : "readWriteAnyDatabase", "db" : "admin" }, { "role" : "userAdmin", "db" : "admin" }, { "role" : "readWrite", "db" : "admin" }, { "role" : "clusterAdmin", "db" : "admin" }, { "role" : "root", "db" : "admin" } ] Obviously the suggestion from the log file to drop the collection seems not to work and also contradicts mongodb's documentation. So how to fix this UUID issue?
renctan commented on Fri, 11 Feb 2022 19:48:51 +0000: It is not advisable to touch the config.system.sessions, but this case is extraordinary and perhaps the only way to get out of this situation. I suspect the reason why the user you were using to touch the config collection is getting auth errors is because the user is configured for the cluster. This user will only work when connected directly to mongos. If you were connecting directly to the shard, you will need a different user to be able to perform privileged tasks. If you're using keyFile, you can check this doc that explains how to add local users to the shard you're trying to connect to. If you're using x509, you can check this doc explaining how to connect to the shard with authentication. Once you confirmed that you can perform operations on the config.system.sessions, you can perform the steps below. Open mongo shell connections to one mongos (any mongos is fine), the csrs primary, and all the shard primaries, to prepare for the drop quickly. On another window, tail the csrs primary’s logs, grep for LogicalSessionCacheRefresh. Wait for a group of log messages to show up. From that point, we have around 5 mins (default session refresh interval) to perform the drop & metadata removal in the next two steps. (Ensure step4 and step5 are done before the next group of LogicalSessionCacheRefresh msgs appear) On the csrs primary and shard primaries, drop the collection. db.getSiblingDB('config').system.sessions.drop() On the mongos, remove the metadata: db.getSiblingDB('config').chunks.remove( {ns:'config.system.sessions'}) db.getSiblingDB('config').collections.remove( {_id:'config.system.sessions'}) Clear routing cache for sessions collection in every mongos by running the command. {flushRouterConfig: 'config.system.sessions'} Note that this procedure can potentially cause active sessions to expire. It can also potentially cause some errors regarding sessions until the config server recreates the config.system.sessions again. JIRAUSER1257066 commented on Fri, 8 Oct 2021 18:44:12 +0000: Thank you for your patience as I further investigate this issue, and for following up with your discussion on our community forums. I agree that MongoDB should not be able to reach this configuration where the config.system.sessions collection has inconsistent UUIDs across shards. I'm passing this on to the appropriate team to investigate steps to remediate this issue. kmm.telecom@gmail.com commented on Thu, 7 Oct 2021 14:02:44 +0000: This happen to me as well after sharded cluster upgrade from 3.6.10 to 4.0.16 ... kay.agahd@idealo.de commented on Thu, 7 Oct 2021 09:26:45 +0000: Looks like other users encountered this issue as well. To overcome the issue, they simply dropped this collection even though it contradicts mongodb's documentation. See: https://www.mongodb.com/community/forums/t/cannot-create-collection-config-system-sessions-because-we-already-have-an-identic/125229 aloplop85@gmail.com commented on Thu, 7 Oct 2021 08:58:59 +0000: Hi Edwin and Kay, I also agree with Kay in that it seems to be a bug present on MongoDB 4.0.x or earlier versions. We have exactly the same issue and the operations were carried out following MongoDB recommendations. I guess we should find a way to recreate the collection in the affected shards with the proper UUID in a "clean" way checked by MongoDB, since it seems that they are "previous incarnations" created at some uncertain point in time. Unfortunately, we haven't been able to reproduce it in a test cluster either. Edwin, could you check on the MongoDB side how this can be done? Thanks in advance, Álvaro. kay.agahd@idealo.de commented on Thu, 7 Oct 2021 06:36:43 +0000: Hi edwin.zhou, we do believe that's a MongoDB bug, as a DBS should never be in such a state, even if a user made an unintentional mistake. As I wrote already, we always followed MongoDB's documentation and therefore are not aware of any misconfiguration due to our actions. The DBS by itself should prevent to be in such a state. Also, the log message contradicts mongodb's documentation. At least this should be fixed. However, as you suggested, I posted the same issue on the MongoDB Developer Community Forums. Let's see how helpful this might be. JIRAUSER1257066 commented on Wed, 6 Oct 2021 17:21:49 +0000: Hi kay.agahd@idealo.de, For that question we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums. If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project. Best, Edwin kay.agahd@idealo.de commented on Wed, 6 Oct 2021 16:26:25 +0000: Hi edwin.zhou, the cluster is running since 2013 already. During this time it has been migrated to several newer servers multiple times already, resulting in adding or removing some shards from time to time. Also MongoDB and OS versions have been upgraded continuously. Obviously, something got wrong during these changes even though we think we followed always the procedures recommended and documented by MongoDB. We understand that you want to reproduce it to be able to fix the root cause but we unfortunately don't know what might have caused the different collection UUID's among the shards. So the question rather is how do we get out of this misconfiguration? Thanks for your help! JIRAUSER1257066 commented on Wed, 6 Oct 2021 15:57:52 +0000: Hi kay.agahd@idealo.de and aloplop85@gmail.com, Thanks for your report and description for this behavior. I haven't been able to reproduce this behavior given the information you've provided so far. When adding a new shard to a cluster, MongoDB drops the new shard's system.sessions collection during the add process.. I suspect there may have been a misconfiguration during the add shard process. Can you provide the steps that occurred prior to observing this behavior? Best, Edwin aloplop85@gmail.com commented on Thu, 23 Sep 2021 10:14:36 +0000: I confirm that it also occurs in 4.2.14 Community version, after upgrading a Shard Cluser with two shards from 4.0.23 Community, and it is not possible to delete documents in that collection. The following trace appears almost every 10 seconds: 2021-09-23T11:38:36.915+0200 I SHARDING [conn21] moveChunk data transfer progress: \{ waited: true, active: false, ns: "config.system.sessions", from: "rs0/host0:port0", fromShardId: "rs0", min: { _id: MinKey }, max: \{ _id: { id: UUID("00400000-0000-0000-0000-000000000000") } }, shardKeyPattern: \{ _id: 1 }, state: "fail", errmsg: "migrate failed: InvalidUUID: Cannot create collection config.system.sessions because we already have an identically named collection with UUID 9ac9d8e...", counts: \{ cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0, $gleStats: \{ lastOpTime: { ts: Timestamp(1632389914, 3), t: 7 }, electionId: ObjectId('7fffffff0000000000000007') }, lastCommittedOpTime: Timestamp(1632389914, 3), $configServerState: \{ opTime: { ts: Timestamp(1632389916, 6), t: 23 } }, $clusterTime: \{ clusterTime: Timestamp(1632389916, 6), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1632389914, 3) } mem used: 0 documents remaining to clone: 0 [...] SHARDING [conn21] Chunk move failed :: caused by :: OperationFailed: Data transfer error: migrate failed: InvalidUUID: Cannot create collection config.system.sessions because we already have an identically named collection with UUID 9ac9d8e7-e34c-4dc3-ba00-d1fee861f6e4, which differs from the donors UUID d5a22dfd-b732-449e-9006-25a13570bfb5. Manually drop the collection on this shard if it contains data from a previous incarnation of config.system.sessions Then, it is possible to check the different UUIDs in both Shards for "ns" : "config.system.sessions": rs0:PRIMARY> db.getCollectionInfos() "info" : { "readOnly" : false, "uuid" : UUID("d5a22dfd-b732-449e-9006-25a13570bfb5") }, rs1:PRIMARY> db.getCollectionInfos() "info" : { "readOnly" : false, "uuid" : UUID("9ac9d8e7-e34c-4dc3-ba00-d1fee861f6e4") }, However, in config DB: rs0:PRIMARY> db.cache.collections.find() { "_id" : "config.system.sessions", "epoch" : ObjectId("5c779b6486045749004853e8"), "key" : { "_id" : 1 }, "refreshing" : false, "unique" : false, "uuid" : UUID("d5a22dfd-b732-449e-9006-25a13570bfb5"), "lastRefreshedCollectionVersion" : Timestamp(2, 1024) } rs1:PRIMARY> db.cache.collections.find() { "_id" : "config.system.sessions", "epoch" : ObjectId("5c779b6486045749004853e8"), "key" : { "_id" : 1 }, "refreshing" : false, "unique" : false, "uuid" : UUID("d5a22dfd-b732-449e-9006-25a13570bfb5"), "lastRefreshedCollectionVersion" : Timestamp(2, 1024) }