Loading...
Loading...
OS: CentOS Stream release 9 (Linux xxx-test-db-2.azr.etn 5.14.0-472.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jun 27 20:15:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux) Mongo version: 7.0.12 (the same behaviour for build from your yum repository - mongodb-org-server-7.0.12-1.el9.x86_64 and our custom build with debug info) HW: Azure VM Standard F16s v2 (16 vcpus, 32 GiB memory) Clients: Mongodb-exporter-0.39.0-0.el9.x86_64 mongo-java-client-4.6.1 Current configuration: systemLog: destination: file logAppend: true logRotate: reopen path: /var/log/mongodb/mongod.log storage: dbPath: /var/lib/mongo engine: wiredTiger directoryPerDB: false processManagement: fork: false pidFilePath: /var/run/mongodb/mongod.pid timeZoneInfo: /usr/share/zoneinfo # network interfaces net: bindIp: 0.0.0.0 # Listen to local interface only, comment to listen on all interfaces. tls: mode: requireTLS certificateKeyFile: /etc/ssl/xxx-test-db-2.azr.etn.pem CAFile: /etc/ssl/xxx-test-db-2.azr.etn.CA.pem allowConnectionsWithoutCertificates: true allowInvalidCertificates: true logVersions: TLS1_0,TLS1_1,TLS1_2,TLS1_3 ipv6: false maxIncomingConnections: 500 port: 27017 replication: oplogSizeMB: 1024 replSetName: repl-xxx-test security: authorization: enabled keyFile: /etc/mongo.key We have identified a bug during reading a TLS stream. Mongo is trying to read from malfunctioning TLS stream (error SSL_ERROR_SYSCALL) and then the connection thread gets into infinite loop. The bug happens when Mongo runs either in replica set (it can happen on both primary and secondary nodes) or as a single instance. The system shows increased load, but there is no significant IO activity. The load is generated by connection threads. Those threads are not present in db.currentOp() status. Stacktrace (pmp_strace.log) shows that these threads are mainly present in ssl handling parts of the code (ie. functions like ERR_clear_error, SSL_read) or memory free function (tc_free - called from ERR_clear_error function). Our investigation starts in function engine:perform. We identified that ssl error returns status 0x5 (SSL_ERROR_SYSCALL, as described here https://www.openssl.org/docs/man3.0/man3/SSL_get_error.html it means "Some non-recoverable, fatal I/O error occurred."). The asio::error_code we get varies in different connection threads, examples: After match condition ssl_error == SSL_ERROR_SYSCALL the function returns 0 (want_nothing) Next interesting part is in asio::detail::read_buffer_sequence function. Mongo detects that the buffer is not empty. And then goes to read_some function. This function returns us to engine::perform again. Buffer contains following data
amirsaman.memaripour commented on Wed, 2 Jul 2025 17:45:31 +0000: The issue is now fixed and backported to all major releases of the database: 8.1.0-rc0, 8.0.5, 7.0.17, and 6.0.21. Marking as "Gone away" since this is no longer an issue. amirsaman.memaripour commented on Thu, 16 Jan 2025 22:08:39 +0000: Hey jean_nsilva@hotmail.com, thanks for the details! About the fix - we've actually got it working in our master branch and have backported it to 8.0 (it'll be in the upcoming 8.0.5 release). We're still working on bringing it back to 7.0 - that should take a few more weeks. In the meantime, I recommend sticking with the OpenSSL downgrade as a workaround since it addresses the CPU spikes. Once we have the fix backported to 7.0, we will update this ticket to let you know, so you can upgrade OpenSSL again. Let me know if you have any other questions! jean_nsilva@hotmail.com commented on Thu, 16 Jan 2025 00:00:36 +0000: Hi! Just adding to the discussion, the same issue was observed on: Red Hat Enterprise Linux release 9.5 (Plow) - 5.14.0-503.14.1.el9_5.x86_64 MongoDB 7.0.12 - Sharded Cluster In this scenario, the CPU usage spike occurred on the hosts where mongoSes color: Color value is invalid were running. color: Color value is invalid Downgrading the OpenSSL package fixed the problem. That said. Can we expect a future fix from MongoDB side(and if so, is there any ETA)? Or should downgrading the OpenSSL package be considered the permanent solution? Thank you! petr.medonos@etnetera.cz commented on Mon, 29 Jul 2024 05:30:37 +0000: Thank you for pointing out the difference. I still live in a world where RHEL and CentOS are compatible, but they are not. I have downgraded openssl to 3.0.7 and this bug has not appeared. So this problem is related to openssl version 3.2 (3.2.2). I am currently unable to test the same behavior on ARM, but there is the same version of openssl (of course, a lot of the behavior may be architecture specific). JIRAUSER1265262 commented on Fri, 26 Jul 2024 15:28:02 +0000: Looking at our compatibility matrix, it doesn't look like we support CentOS 9 on x86_64 (though we support it on arm64). I wonder if this is related to unsupported SSL library versions that come with the OS. Would you be able to share some information about this? I would expect most of these libraries to be similar to what is on RHEL 9, so this is surprising. It's unclear to me if this issue could reproduce on a supported OS; it sounds like "maybe, if an unsupported library is installed" but I am unsure. To set expectations If CentOS 9 on x84_64 is not explicitly supported, we would be looking at this problem based on its potential impact to supported configurations. (or this means we need to update the compatibility page) MongoDB version 7.0 is built and tested against RHEL 7.9. Earlier versions of MongoDB are tested against RHEL 7 and assume forward compatibility. petr.medonos@etnetera.cz commented on Fri, 26 Jul 2024 10:32:51 +0000: chris.kelly@mongodb.com Same issue was triggered on Mongo 7.0.11. petr.medonos@etnetera.cz commented on Thu, 25 Jul 2024 09:02:36 +0000: chris.kelly@mongodb.com color: Color value is invalid The issue occurred on a project where we are upgrading from Centos 7 to Centos 9 on a test environment. db2 is running on a newly installed Centos 9 with Mongo 7.0.12 with no previous upgrades. We configured 2 nodes ReplicaSet just to have a place to debug the issue without impacting the running applications. The problem occurred on single node instance (Centos 9 with Mongo 7.0.12) as well. We will downgrade Mongo to version 7.0.11 on Centos 9 and will keep you posted on the result. JIRAUSER1265262 commented on Wed, 24 Jul 2024 22:45:11 +0000: Thank you for providing this data! I think we have enough to assign this out and investigate further. Detailed reports like this help make MongoDB better for everybody, and we really appreciate it. I took an initial look. Some observations: This looks like a 2 node replica set without an arbiter. db1 is running 7.0.11. It does not see this issue. db2 is running 7.0.12. It is having these issues. db2 immediately starts using max CPU as soon as it is started at B 2024-07-23T07:00:01.085Z Data Interestingly, we don't see 16 or more threads waiting around at startup, but we clearly see max cpu user used by the mongod. It also appears there was no primary between points B (2024-07-23T07:00:01.085Z) and C (2024-07-23T07:00:10.135Z) Calltrees Browser-formatted: stack.html Text stack.txt Question for Reporter Did you observe these TLS issues immediately after upgrading from 7.0.11 to 7.0.12? Does it resolve if you temporarily return to 7.0.11? petr.medonos@etnetera.cz commented on Wed, 24 Jul 2024 04:27:50 +0000: chris.kelly@mongodb.com thank you for quick response. I uploaded requested file through upload portal. Problematic server is db2, stack trace is included in its mongo log. Petr JIRAUSER1265262 commented on Tue, 23 Jul 2024 22:14:31 +0000: Thanks for your detailed investigation and report, petr.medonos@etnetera.cz. Can you please collect a stack trace with SIGUSR2 at the time the incident occurs? kill -s USR2 $pid After doing that, can we please get logs and diagnostic data to review? This will help us further investigate what the server is doing during this time. I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link: the mongod logs the $dbpath/diagnostic.data directory (the contents are described here) Chris
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.