...
Shutting down the service entry point outside TSAN and ASAN builds immediately returns true without running any shutdown code. A separate interface, shutdownAndWait, is introduced to shutdown the service entry point. bool ServiceEntryPointImpl::shutdown(Milliseconds timeout) { #if __has_feature(address_sanitizer) || __has_feature(thread_sanitizer) // When running under address sanitizer, we get false positive leaks due to disorder around // the lifecycle of a connection and request. When we are running under ASAN, we try a lot // harder to dry up the server from active connections before going on to really shut down. return shutdownAndWait(timeout); #else return true; #endif } We should remove the special-case handling in shutdown and have it run the body of shutdownAndWait. This would also obviate the need for shutdownAndWait. Acceptance criteria: Change the code to have #if at the higher level of the stack. Make sure we link these to the relevant shutdown project.
JIRAUSER1263891 commented on Tue, 1 Mar 2022 17:46:02 +0000: After being investigated, this ticket was addressed by SERVER-63674 and the upcoming initiative to fix the server's shutdown. Now ServiceEntryPointImpl::shutdownAndWait is held behind a server parameter flag and manually triggered in evergreen. JIRAUSER1263891 commented on Fri, 14 Jan 2022 16:15:37 +0000: Per discussion with blake.oler this ticket isn't important enough to spend more time on, the fix does introduce some hangs specifically for the windows flavor. Here are the related error messages, before something crashed and burned [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:55.932+00:00 I NETWORK 5742701 [eventTerminate] "Entering ServiceEntryPointImpl::shutdown(...)" [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:56.190+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:57.190+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} --- [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:57.903+00:00 I NETWORK 22944 [conn3] "Connection ended","attr":{"remote":"10.122.19.223:62131","uuid":"3747f031-d0ea-4a2f-8583-ddefbdcc365a","connectionId":3,"connectionCount":2} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:58.190+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:56:59.190+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:00.190+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:01.191+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:02.191+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:03.191+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:04.192+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:05.193+00:00 I STORAGE 22263 [TimestampMonitor] "Timestamp monitor is stopping","attr":{"error":{"code":11600,"codeName":"InterruptedAtShutdown","errmsg":"interrupted at shutdown"}} --- [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:05.932+00:00 I NETWORK 22947 [eventTerminate] "shutdown: exhausted grace period active workers to drain; continuing with shutdown...","attr":{"workers":2} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:05.932+00:00 I NETWORK 4907200 [eventTerminate] "Failed to shutdown ServiceExecutorSynchronous","attr":{"error":{"code":262,"codeName":"ExceededTimeLimit","errmsg":"passthrough executor couldn't shutdown all worker threads within time limit."}} [js_test:reconfig_fails_no_cwwc_set_sharding] d20271| 2022-01-13T02:57:05.932+00:00 I NETWORK 5742702 [eventTerminate] "Total duration, and sanitizerer","attr":{"totalTurationMicros":10000000,"sanitizer":false} My brief assessment tells me there is a lot of opportunity for improving the logs here, by checking the code it looks like the thread calling shutdown calls every underlying session to "end" and then sits on a conditional variable loop until either the connection counter reaches 0 or it times out. The issue seems to only occur on Windows and more have to be understood about ASIO on Windows and how the code is written to end a session. JIRAUSER1263891 commented on Thu, 13 Jan 2022 22:26:53 +0000: All failing tests related with repairShardedCollectionChunksHistory ** are documented in a known issue in 5.1 [1] matthew.saltz mentioned: It may be worth trying to measure the cost of clean shutdown **A log line was added to measure the total cost of a clean shutdown. In tests[2] we could see the shutdownAndWait taking up to 2 seconds, amirsaman.memaripour has a nice suggestion on how to improve the code[3]. [1] [BF-24011] failed: sharding_jscore_multiversion_passthrough on enterprise-rhel-80-64-bit-multiversion [mongodb-mongo-v5.1 @ f92817a4] (views_all_commands.js) - MongoDB Jira [2] Evergreen - Version Patch 30 by daniel.morilha (mongodb.com) [3] https://jira.mongodb.org/browse/SERVER-62402 JIRAUSER1263891 commented on Thu, 13 Jan 2022 18:34:04 +0000: amirsaman.memaripour helped me understand repairShardedCollectionChunksHistory found in jstests/sharding/database_versioning_all_commands.js fails after applying the suggestion found in here. Next step is to make sure this is really related or not. If so, Sam suggested to reach out to the sharding team, and they might help understanding the context. So be it! JIRAUSER1263891 commented on Tue, 11 Jan 2022 14:36:58 +0000: After time spent properly setting my ubuntu box which came all funky (thanks ryan.egesdahl and max.hirschhorn for the various ideas), I could compile, run and analyze the reports and they are valid! Apparently the shutdownAndWait does guarantee a proper destruction. Now the question is what is the priority for that, given tests synthetically create and destroy ServiceStateMachine instances. I would imagine this isn't the case on Atlas for instance. JIRAUSER1263891 commented on Fri, 7 Jan 2022 15:26:20 +0000: as a side note from someone who is learning, by adopting the suggestion found in the description I was able to see only a small number of failed tests, and one most likely related: https://logkeeper.mongodb.org/lobster/build/d1ede97d64fdfe11c11d2e6c5e880a4a/test/61d761b9c2ab681cd278f728#bookmarks=0%2C168&l=1 Next step is to rerun the test and investigate if I can locally reproduce it. JIRAUSER1263891 commented on Thu, 23 Dec 2021 16:11:55 +0000: Hmm, as far as my memory serves file descriptors may leak if you don't "properly close" the ASIO sockets. While shutting down every associated session completely might be expensive, we can certainly investigate a way to close the local socket / file descriptor while the remote peer will perceive the connection as being aborted. I will have to get back and deep into the ins & outs of ASIO to confirm. billy.donahue commented on Fri, 4 Jun 2021 09:01:24 +0000: There's another thing we discussed about this function that I just wanted to capture. The behavior of the ServiceEntryPointImpl::shutdownAndWait, in the case when the Sessions don't terminate before a deadline, is pretty loose. // ... If we haven't closed // drained all active operations within the deadline, just keep going with shutdown: the OS will // do it for us when the process terminates. _terminateAll(lk); auto result = _waitForNoSessions(lk, deadline); https://github.com/mongodb/mongo/blob/master/src/mongo/transport/service_entry_point_impl.cpp#L260-L262 This is pretty strange. If it's significant that the terminated sessions remove themselves, then I'd think we would abort if we can't achieve that on time. If we don't care about the terminated sessions removing themselves, then why wait for any of them?