
OPERATIONAL DEFECT DATABASE
...

...
Create a collection with a single 1 kB document; start 1000 threads doing a findOne() on that collection. Observed query rate ~55 k/s in 3.4.3, drops to ~35 k/s in 3.4.4.
xgen-internal-githook commented on Tue, 22 Aug 2017 16:49:54 +0000: Author: {'username': 'spencerjackson', 'email': 'spencer.jackson@mongodb.com', 'name': 'Spencer Jackson'} Message: SERVER-30643: Ensure thread IDs observed by OpenSSL are uniformly distributed (cherry picked from commit d40503d57d9fd27d1ad59e4d74aa827191d77564) Branch: v3.4 https://github.com/mongodb/mongo/commit/9747381e76070df44d824b037e67a0bfc3472502 xgen-internal-githook commented on Tue, 22 Aug 2017 16:37:15 +0000: Author: {'username': 'spencerjackson', 'email': 'spencer.jackson@mongodb.com', 'name': 'Spencer Jackson'} Message: SERVER-30643: Ensure thread IDs observed by OpenSSL are uniformly distributed Branch: master https://github.com/mongodb/mongo/commit/d40503d57d9fd27d1ad59e4d74aa827191d77564 bruce.lucas@10gen.com commented on Wed, 16 Aug 2017 14:58:27 +0000: The following change, partially reverting SERVER-28530, fixes the performance regression; tbd why. diff --git a/src/mongo/util/net/ssl_manager.cpp b/src/mongo/util/net/ssl_manager.cpp index da8af7e..b26d958 100644 --- a/src/mongo/util/net/ssl_manager.cpp +++ b/src/mongo/util/net/ssl_manager.cpp @@ -139,6 +139,10 @@ IMPLEMENT_ASN1_ENCODE_FUNCTIONS_const_fname(ASN1_SEQUENCE_ANY, ASN1_SET_ANY, ASN * terminated, it must call ERR_remove_state or ERR_remove_thread_state. These functions may * themselves invoke the application provided callback. */ + +AtomicUInt32 _next; +MONGO_TRIVIALLY_CONSTRUCTIBLE_THREAD_LOCAL int _my_id = _next.fetchAndAdd(1); + class SSLThreadInfo { public: static unsigned long getID() { @@ -147,10 +151,13 @@ public: #ifdef _WIN32 return GetCurrentThreadId(); #else + /* static_assert(sizeof(void*) == sizeof(unsigned long), "OpenSSL needs the address of a thread-unique object to be castable to" "unsigned long"); return reinterpret_cast(&errno); + */ + return _my_id; #endif } spencer.jackson@10gen.com commented on Wed, 16 Aug 2017 00:02:22 +0000: SSL_read is supposed to return SSL_ERROR_WANT_READ on basically every read. When we see it, we have to recv a socket, and dump the contents into our network BIO. When we re-call SSL_read, it pulls that data out of the other half of that BIO. It may additionally return SSL_ERROR_WANT_WRITE, to force us to shuttle back TLS overhead to the client. Manually stepping across all calls to ERR_get_state during a single query for such a document as described in the description doesn't reveal any behavioral differences between 3.4.3 and 3.4.4. bruce.lucas@10gen.com commented on Mon, 14 Aug 2017 17:52:34 +0000: In 3.4.3 most of the threads are waiting here for the next request from a client: #0 0x00007f4b595a901f in __libc_recv (fd=21, buf=0x7f4b4fc89cd0, n=5, flags=16384) at ../sysdeps/unix/sysv/linux/x86_64/recv.c:33 #1 0x00005563dc482874 in mongo::(anonymous namespace)::SSLManager::_flushNetworkBIO(mongo::SSLConnection*) [clone .isra.158] () #2 0x00005563dc482abf in mongo::(anonymous namespace)::SSLManager::_doneWithSSLOp(mongo::SSLConnection*, int) [clone .isra.159] () #3 0x00005563dc482e63 in mongo::(anonymous namespace)::SSLManager::SSL_read(mongo::SSLConnection*, void*, int) () #4 0x00005563dc47cec7 in mongo::Socket::recv(char*, int) () Whereas in 3.4.4 most are waiting here: #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f60bada3d82 in __GI___pthread_mutex_lock (mutex=0x556803dc9810) at ../nptl/pthread_mutex_lock.c:115 #2 0x0000556800ff5809 in mongo::(anonymous namespace)::SSLThreadInfo::lockingCallback(int, int, char const*, int) () #3 0x00007f60bb9458e7 in CRYPTO_add_lock () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #4 0x00007f60bb9ff462 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #5 0x00007f60bb9ffa0c in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #6 0x00007f60bba003b4 in ERR_get_state () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #7 0x00007f60bba006b6 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 #8 0x00007f60bbd66170 in SSL_get_error () from /lib/x86_64-linux-gnu/libssl.so.1.0.0 #9 0x0000556800ff6104 in mongo::(anonymous namespace)::SSLManager::_doneWithSSLOp(mongo::SSLConnection*, int) [clone .isra.155] () #10 0x0000556800ff64c3 in mongo::(anonymous namespace)::SSLManager::SSL_read(mongo::SSLConnection*, void*, int) () #11 0x0000556800ff0937 in mongo::Socket::recv(char*, int) () Since there are no actual SSL errors logged, does this mean that for some reason in 3.4.4 SSL_read is returning SSL_ERROR_WANT_READ frequently resulting in more calls to SSL_get_error, creating a bottleneck?
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.