...
The issue is that the callback is running in the fixed executor and then also tries doing a blocking call inside the fixed executor here. So if the change callback for different replica sets happened roughly at the same time, all of them will be blocked waiting for a worker thread to be available (which are none, since all of them are waiting for the same thing) Original description: Hi I setup the cluster with MongoDB 4.0 without problems, but I am struggling to setup a cluster with MongoDB 4.2. As soon as I add more than 3 shards to the cluster (I would like to have up to 24), I am not able to run a mongos anymore. It looks fine until the first restart of the mongos process. When I remove all shards and only keep 3 of them, everything is fine again. It does not matter which shard I keep, it seems something with the number of > 3. I tried a lot of things, removed auth, removed encryption, removed compression: no luck. I cannot believe that this is a bug in 4.2.0 because even in the highest loglevel with exception tracing I cannot see any errors, the mongos is just not starting up until the point when it should start listening on all interfaces (with net.bindIpAll true) or on localhost (started with no special interface binding). Can you give me some help regarding this problem I am facing. Thank you
martin.gasser@comfone.com commented on Thu, 5 Sep 2019 08:39:24 +0000: Please let us know when we can support with additional informations or logfiles. BTW: shards are in PSA mode: /* 1 */ { "set" : "shard0003", "date" : ISODate("2019-09-05T08:36:50.943Z"), "myState" : 1, "term" : NumberLong(1), "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "heartbeatIntervalMillis" : NumberLong(2000), "optimes" : { "lastCommittedOpTime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "lastCommittedWallTime" : ISODate("2019-09-05T08:36:48.660Z"), "readConcernMajorityOpTime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "readConcernMajorityWallTime" : ISODate("2019-09-05T08:36:48.660Z"), "appliedOpTime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "durableOpTime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "lastAppliedWallTime" : ISODate("2019-09-05T08:36:48.660Z"), "lastDurableWallTime" : ISODate("2019-09-05T08:36:48.660Z") }, "members" : [ { "_id" : 0, "name" : "cmodb812.togewa.com:27018", "ip" : "10.108.2.42", "health" : 1.0, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 1210127, "optime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "optimeDate" : ISODate("2019-09-05T08:36:48.000Z"), "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "infoMessage" : "", "electionTime" : Timestamp(1566462529, 1), "electionDate" : ISODate("2019-08-22T08:28:49.000Z"), "configVersion" : 2, "self" : true, "lastHeartbeatMessage" : "" }, { "_id" : 1, "name" : "cmodb813.togewa.com:27018", "ip" : "10.108.2.43", "health" : 1.0, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1210092, "optime" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "optimeDurable" : { "ts" : Timestamp(1567672608, 1), "t" : NumberLong(1) } , "optimeDate" : ISODate("2019-09-05T08:36:48.000Z"), "optimeDurableDate" : ISODate("2019-09-05T08:36:48.000Z"), "lastHeartbeat" : ISODate("2019-09-05T08:36:50.708Z"), "lastHeartbeatRecv" : ISODate("2019-09-05T08:36:49.104Z"), "pingMs" : NumberLong(3), "lastHeartbeatMessage" : "", "syncingTo" : "cmodb812.togewa.com:27018", "syncSourceHost" : "cmodb812.togewa.com:27018", "syncSourceId" : 0, "infoMessage" : "", "configVersion" : 2 }, { "_id" : 2, "name" : "cmodb805.togewa.com:27025", "ip" : "10.108.2.35", "health" : 1.0, "state" : 7, "stateStr" : "ARBITER", "uptime" : 1210092, "lastHeartbeat" : ISODate("2019-09-05T08:36:49.979Z"), "lastHeartbeatRecv" : ISODate("2019-09-05T08:36:49.855Z"), "pingMs" : NumberLong(0), "lastHeartbeatMessage" : "", "syncingTo" : "", "syncSourceHost" : "", "syncSourceId" : -1, "infoMessage" : "", "configVersion" : 2 } ], "ok" : 1.0, "$gleStats" : { "lastOpTime" : Timestamp(0, 0), "electionId" : ObjectId("7fffffff0000000000000001") } , "lastCommittedOpTime" : Timestamp(1567672608, 1), "$configServerState" : { "opTime" : { "ts" : Timestamp(1567672605, 1), "t" : NumberLong(1) } }, "$clusterTime" : { "clusterTime" : Timestamp(1567672608, 1), "signature" : { "hash" : { "$binary" : "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "$type" : "00" } , "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1567672608, 1) } ben.caimano commented on Wed, 4 Sep 2019 19:31:03 +0000: This doesn't look like a network or topology failure to me. I'm handing it back to kaloian.manassiev to analyze how far the mongos got into its initialization. ben.caimano commented on Wed, 4 Sep 2019 19:30:09 +0000: Thanks for those details, martin.gasser@comfone.com! The following line is the start of mongos accepting connections. It shows up once before the restart at 06:34:50 GMT. This means the mongos doesn't finish its initialization after the restart. 2019-09-04T06:33:49.192+0000 I NETWORK [mongosMain] waiting for connections on port 27017 We noticeably don't get to FTDC init (example log line below) which means we haven't made it through sharding initialization. 2019-09-04T06:33:49.191+0000 I FTDC [mongosMain] Initializing full-time diagnostic data capture with directory '/var/log/mongodb/mongos.diagnostic.data' I do see that we have lines like: 2019-09-04T06:35:20.851+0000 D1 NETWORK [ReplicaSetMonitor-TaskExecutor] Refreshing replica set shard0000 took 0ms 2019-09-04T06:35:20.851+0000 D1 NETWORK [ReplicaSetMonitor-TaskExecutor] Refreshing replica set shard0001 took 0ms 2019-09-04T06:35:20.856+0000 D1 NETWORK [ReplicaSetMonitor-TaskExecutor] Refreshing replica set shard0003 took 4ms 2019-09-04T06:35:20.856+0000 D1 NETWORK [ReplicaSetMonitor-TaskExecutor] Refreshing replica set shard0002 took 5ms 2019-09-04T06:35:21.348+0000 D1 NETWORK [ReplicaSetMonitor-TaskExecutor] Refreshing replica set configrs took 0ms These continue well after the restart, so the topology scanning is in good shape. martin.gasser@comfone.com commented on Wed, 4 Sep 2019 07:03:35 +0000: found this: [root@cmodb801 admin]# cat /var/log/mongodb/mongos.log | grep maximum {{2019-09-04T06:33:47.192+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:33:47.192+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:50.856+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:50.856+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:50.856+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:50.856+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:50.856+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum 2019-09-04T06:34:52.852+0000 D2 EXECUTOR [ShardRegistry] Not starting new thread in pool Sharding-Fixed because it already has 4, its maximum}} But this is probably not an error. Cheers Martin martin.gasser@comfone.com commented on Wed, 4 Sep 2019 06:58:42 +0000: Added the config server log mongod_config_cmodb803.zip martin.gasser@comfone.com commented on Wed, 4 Sep 2019 06:57:56 +0000: Hi Benjamin Thanks for investigating on this issue. This is correct, I was not aware running with the appendlog:false option. Everything is now in the mongos log. Now I repeated the steps with the logging options you suggested: Started the mongos connected with mongo shell and executed sh.status() Added shard0003 : sh.addShard("shard0003/cmodb812.togewa.com:27018,cmodb813.togewa.com:27018") again a sh.status which gives the correct output with 4 shards quit shell and restart mongos service Not able to connect anymore to router You can find the attached logs of mongos router (cmodb801) , the command from the shell and the log of one of the config servers during that time (cmodb803). mongos.log mongoshell.txt mongod_config_cmodb803.log Maybe this will help you to see any hints in this files. Thanks, Martin ben.caimano commented on Tue, 3 Sep 2019 21:57:47 +0000: Hey martin.gasser@comfone.com, I only see one startup event and one shutdown event in that mongos log. Given that there is a "command: addShard" line in mongos.log.2019-08-30T11-20-05, I'm assuming that is a before restart log. I'm particularly interested in the log for mongos after your restart in this procedure. You should see a sequence of messages about it contacting each shard and updating its topology information. On the note of the "D3 CONNPOOL [ShardRegistry] ..." lines, I personally added these to track our internal connection pool when at a high verbosity level. They actually indicate that your mongos is maintaining healthy connections to cmodb802:27019 and cmodb804:27019 during that period. Fascinatingly, cmodb803:27019 isn't listed there. If it happens to be inaccessible, then your issue may be SERVER-43015. SERVER-42790 makes these log lines even more restricted to D4. D3 and above for NETWORK, ASIO, and CONNPOOl are very verbose to help use diagnose complicated network behavior. If you want to avoid seeing these lines, you can modify your mongos.conf to be less verbose on network events: systemLog: destination: file logAppend: false path: /var/log/mongodb/mongos.log traceAllExceptions: true verbosity: 5 component: network: verbosity: 2 martin.gasser@comfone.com commented on Fri, 30 Aug 2019 11:30:40 +0000: Hi Kal Thanks for the lightning fast answer. Attached you will find a) the shell log b) the mongos log In the shell log you can see the steps done open mongo shell add shard No4 quit shell restart mongos -> no more able to connect. mongo_shell.txt mongos.log.2019-08-30T11-20-05 Add shard does start in the logs at line 2019-08-30T11:19:56.595+0000 D1 TRACKING [conn29] Cmd: addShard, TrackingId: 5d69065c3c00392d854ff620 Thanks a lot! Martin kaloian.manassiev commented on Fri, 30 Aug 2019 11:14:44 +0000: Hi rui.ribeiro@comfone.com and martin.gasser@comfone.com, Thank you for the report and for including the repro steps. I don't have access to 3 different machines, but I tried a quick repro on my laptop and wasn't able to reproduce this problem. In order to help us investigate what is going on, would it be possible to attach the complete mongos log, which includes the entire timeline of when you added the shards and also after the restart. Best regards, -Kal. martin.gasser@comfone.com commented on Fri, 30 Aug 2019 10:58:16 +0000: I can provide additional info for this setting. No special configs done, just install 3 config servers (Centos7.6), started with: /etc/mongod.conf: net: bindIp: 0.0.0.0 port: 27019 processManagement: fork: true pidFilePath: /var/run/mongodb/mongod.pid timeZoneInfo: /usr/share/zoneinfo storage: dbPath: /data/db directoryPerDB: true journal: enabled: true wiredTiger: engineConfig: directoryForIndexes: true systemLog: destination: file logAppend: true path: /var/log/mongodb/mongod.log security: authorization: disabled replication: oplogSizeMB: 1024 replSetName: configrs sharding: clusterRole: configsvr Configured config RS: rs.initiate( { _id: "configrs", configsvr: true, members: [ { _id : 0, host : "cmodb802.togewa.com:27019" }, { _id : 1, host : "cmodb803.togewa.com:27019" }, { _id : 2, host : "cmodb804.togewa.com:27019" } ] } ) var cfg = rs.conf(); cfg.members[0].priority = 2 cfg.members[1].priority = 1; cfg.members[2].priority = 1; cfg.members[0].votes = 1; cfg.members[1].votes = 1; cfg.members[2].votes = 1; rs.reconfig(cfg); Then started mongos with /etc/mongos.conf: # where to write logging data. systemLog: destination: file logAppend: false path: /var/log/mongodb/mongos.log verbosity: 5 traceAllExceptions: true # how the process runs processManagement: fork: true # fork and run in background pidFilePath: /var/run/mongodb/mongos.pid # location of pidfile # network interfaces net: port: 27017 bindIp: localhost, 10.108.2.15 sharding: configDB: configrs/cmodb802.togewa.com:27019,cmodb803.togewa.com:27019,cmodb804.togewa.com:27019 Added the shards: sh.addShard("shard0000/cmodb806.togewa.com:27018,cmodb807.togewa.com:27018") .. (4 shards more) And after adding more than three shards, the mongos does not bind to IP anymore. No hints or messages in log. After adding shard no4 (shard0003): mongos> sh.status() --- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("5d5e45a7ac9313827bdd8ab9") } shards: { "_id" : "shard0000", "host" : "shard0000/cmodb806.togewa.com:27018,cmodb807.togewa.com:27018", "state" : 1 } { "_id" : "shard0001", "host" : "shard0001/cmodb808.togewa.com:27018,cmodb809.togewa.com:27018", "state" : 1 } { "_id" : "shard0002", "host" : "shard0002/cmodb810.togewa.com:27018,cmodb811.togewa.com:27018", "state" : 1 } { "_id" : "shard0003", "host" : "shard0003/cmodb812.togewa.com:27018,cmodb813.togewa.com:27018", "state" : 1 } active mongoses: "4.2.0" : 1 autosplit: Currently enabled: yes balancer: Currently enabled: yes Currently running: no Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: No recent migrations databases: { "_id" : "config", "primary" : "config", "partitioned" : true } config.system.sessions shard key: { "_id" : 1 } unique: false balancing: true chunks: shard0000 1 { "_id" : { "$minKey" : 1 } } -->> { "_id" : { "$maxKey" : 1 } } on : shard0000 Timestamp(1, 0) Then do: [root@cmodb801 mongodb]# systemctl restart mongos [root@cmodb801 mongodb]# systemctl status mongos ● mongos.service - High-performance, schema-free document-oriented database Loaded: loaded (/usr/lib/systemd/system/mongos.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2019-08-30 10:54:47 UTC; 6s ago Docs: https://docs.mongodb.org/manual Process: 28418 ExecStartPre=/usr/bin/chmod 0755 /var/run/mongodb (code=exited, status=0/SUCCESS) Process: 28414 ExecStartPre=/usr/bin/chown mongod:mongod /var/run/mongodb (code=exited, status=0/SUCCESS) Process: 28412 ExecStartPre=/usr/bin/mkdir -p /var/run/mongodb (code=exited, status=0/SUCCESS) Main PID: 28421 (mongos) CGroup: /system.slice/mongos.service ├─28421 /usr/bin/mongos -f /etc/mongos.conf ├─28423 /usr/bin/mongos -f /etc/mongos.conf └─28424 /usr/bin/mongos -f /etc/mongos.conf Aug 30 10:54:47 cmodb801.togewa.com systemd[1]: Starting High-performance, schema-free document-oriented database... Aug 30 10:54:47 cmodb801.togewa.com systemd[1]: Started High-performance, schema-free document-oriented database. Aug 30 10:54:47 cmodb801.togewa.com mongos[28421]: about to fork child process, waiting until server is ready for connections. Aug 30 10:54:47 cmodb801.togewa.com mongos[28421]: forked process: 28424 And not able to connect anymore to mongos. Log of mongos only repeating: 2019-08-30T10:55:48.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:48.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:49.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:49.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:49.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:49.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Triggered refresh timeout for cmodb804.togewa.com:27019 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Refreshing connection to cmodb804.togewa.com:27019 2019-08-30T10:55:49.155+0000 D3 NETWORK [ShardRegistry] Compressing message with snappy 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 0, pending: 1, active: 0, isExpired: false } 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:49.155+0000 D3 NETWORK [ShardRegistry] Decompressing message with snappy 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Finishing connection refresh for cmodb804.togewa.com:27019 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:49.155+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:50.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:50.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:50.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:50.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:51.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:51.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:51.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:51.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:52.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:52.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:52.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:52.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:53.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb804.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:53.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb804.togewa.com:27019 to Controls: { maxPending: 2, target: 1, } 2019-08-30T10:55:53.153+0000 D3 CONNPOOL [ShardRegistry] Updating controller for cmodb802.togewa.com:27019 with State: { requests: 0, ready: 1, pending: 0, active: 0, isExpired: false } 2019-08-30T10:55:53.153+0000 D3 CONNPOOL [ShardRegistry] Comparing connection state for cmodb802.togewa.com:27019 to Controls: { maxPending: 2, target: 1, }