...
[SERVER-53898] High frequency try to connect mongod which shutdown
JIRAUSER1257066 commented on Tue, 2 Feb 2021 18:16:31 +0000: Hi bingfeng198878@163.com, I'm currently unable to reproduce your issue but I'll provide my reproduction steps below so you evaluate what can be missing. Using MongoDB 3.6.3, I launched a PSS sharded replica set with 1 mongos. I launched the mongos with the following parameters: --maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3 --fork I then insert 10mil documents db = db.getSiblingDB("test") coll = db.getCollection("test_col") coll.drop(); bulk = coll.initializeUnorderedBulkOp(); for (i = 0; i < 10000000; i++) { bulk.insert({ _id: i, localField: i }); } bulk.execute(); and perform a find query with readPref set to "secondary" db = db.getSiblingDB("test") coll = db.getCollection("test_col") db.getMongo().setReadPref("secondary"); coll.find({ localField: 1 }) and shutdown the secondary thats performing the op mongo "localhost:27019/admin" --eval "db.shutdownServer()" Looking into mongos.log, I don't see many log lines printed out. On the other hand, the primary mongod.log is printing out these 11 lines every 2 seconds: 2021-02-02T13:14:19.088-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020 2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused 2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection 2021-02-02T13:14:19.090-0500 I REPL_HB [replexec-1] Error in heartbeat (requestId: 982) to localhost:27020, response status: HostUnreachable: Connection refused 2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020 2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused 2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection 2021-02-02T13:14:19.091-0500 I REPL_HB [replexec-3] Error in heartbeat (requestId: 984) to localhost:27020, response status: HostUnreachable: Connection refused 2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020 2021-02-02T13:14:19.092-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused 2021-02-02T13:14:19.092-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection 2021-02-02T13:14:19.092-0500 I REPL_HB [replexec-3] Error in heartbeat (requestId: 986) to localhost:27020, response status: HostUnreachable: Connection refused However, that doesn't cover the issue you're seeing in the mongos.log file. Is there anything missing in my reproduction steps that could help narrow down whats causing the issue you're observing? Thanks, Edwin bingfeng198878@163.com commented on Mon, 1 Feb 2021 02:33:06 +0000: Hi Edwin Zhou, in production: 1.I use some query request to secondary 2.shutdown secondary sometimes,mongos will has this problem,but it‘s not inevitable. So, I use debug and add breakpoint to reproduction the problem. JIRAUSER1257066 commented on Fri, 29 Jan 2021 19:35:54 +0000: Hi bingfeng198878@163.com, Unfortunately, without a $dbpath covering the timeline of the incident, it is difficult to diagnose why excessive log lines are being printed in production. Are you able to provide a series of steps or events leading up to the incident in production? Best, Edwin bingfeng198878@163.com commented on Fri, 29 Jan 2021 03:17:33 +0000: Hi Edwin Zhou, It's a problem in production, I reproduce it in debug. But I can't provide “dbpath” in production.In production , we have a flood of requests to "secondary", when "secondary" shut down, some mongos will have the problem and other mongos won't. JIRAUSER1257066 commented on Thu, 28 Jan 2021 16:16:02 +0000: Hi bingfeng198878@163.com, We typically diagnose issues that are in production. Is this something you're able to reproduce this issue without setting any breakpoints? Can you reproduce this issue in production? Thanks, Edwin bingfeng198878@163.com commented on Thu, 28 Jan 2021 02:26:38 +0000: Hi Edwin Zhou, I don't understand. What do you mean. I appear the issue on MacOS with Xcode. Step 2 ,I make a breakpoint at "NetworkInterfaceASIO::_asyncRunCommand" funciton ,"asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));". For example: I query data like: db.uuid.find({userId:1}) , I know "userId:1" should to shard1 and port 16451(a secondary), when the breakpoint was hit,and "op->request().target.port()" is equal 16451,kill the process with listen port 16451. JIRAUSER1257066 commented on Wed, 27 Jan 2021 15:18:42 +0000: Hi bingfeng198878@163.com, I requested it as it helps to pinpoint where and when the incident occurs, and gives us context surrounding the incident you're observing. However, I'm unable to find the incident that you're referring to in your data. Could you clarify step 2 of your reproduction step? 2.make breakpoint at NetworkInterfaceASIO::_asyncRunCommand ‘s asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback)); I'm not sure here where you are setting your breakpoint. Could you go into more detail? Thanks, Edwin bingfeng198878@163.com commented on Wed, 27 Jan 2021 02:55:45 +0000: Hi Edwin Zhou, I have upload dbpath, but it's not important,any db can appear. JIRAUSER1257066 commented on Tue, 26 Jan 2021 18:18:27 +0000: Hi bingfeng198878@163.com, Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Kind regards, Edwin bingfeng198878@163.com commented on Fri, 22 Jan 2021 08:28:24 +0000: supplement: mongos run command ./mongos --configdb rs_config/127.0.0.1:15350,127.0.0.1:15351,127.0.0.1:15352 --bind_ip 0.0.0.0 --port 2910 --logpath /Users/xinli/mongodb_dir/mongos/logs --maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3 if "ShardingTaskExecutorPoolMinSize" set 1,the problem won't appear.
Online,I call "rs.remove()" repeatedly,mongos will print huge amounts of log like this(1GB+ logs everyminute ): 2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection 2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection 2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection 2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection 2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused 2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113 finished with response: HostUnreachable: Connection refused 2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{} 2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452 I reproduce the problem local. 1.mongos set db.getMongo().setReadPref('secondary') 2.make breakpoint at NetworkInterfaceASIO::_asyncRunCommand ‘s asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback)); 3.exec find operate,like:db.test.find({userId:3000}) 4.when hit the breakpoint(step 2),affirm op is "IsMaster" and shutdown the secondary which the find operate target。 5.Then huge amounts of log will print and will not stop 6. When shut down the target physical machine ,because 5(huge amounts try to connect)from "connect refused" to "physical machine offline", NetworkInterfaceASIO-TaskExecutorPool will very slow to process other op.