...
We migrated our mongodb server from AWS r5d.16xlarge to r6gd.16xlarge (Intel Xeon Platinum 8000 -> AWS Graviton 2) and faced with significant performance degradation. Performance degrades in case of many parallel queries of any nature (we noticed degradation on aggregation/distinct/filter queries, so, don't assume that issue is related to kind of query) On the screenshot below red metrics for two same databases in the one ReplicaSet, the green one is ARM(r6gd.16xlarge), the yellow one is x86_64(r5d.16xlarge), the vertical red bar is the point to a time when we switched primaries (from ARM to x86_64). Our clients reading only from primaries. How you can see, under the same kind of load produce 100% load on the ARM server but ~10% load on the x86_64 server. Load is constant within a day.
JIRAUSER1259576 commented on Mon, 17 May 2021 19:35:28 +0000: Hey, Dmitry! Looks like we got GCC 8.5 released - https://gcc.gnu.org/releases.html Can we expect a fix for our issue soon? JIRAUSER1259576 commented on Fri, 30 Apr 2021 15:10:49 +0000: Hey! We prefer to proceed with 2nd option - (Waiting for the release of SERVER-56347 ) Thanks for help! dmitry.agranat commented on Wed, 28 Apr 2021 09:06:19 +0000: Hi ivan.takarlikov@sensortower.com, I have a few follow-up options in regards to the next steps. In order to progress this investigation, these are the options that should bring us closer to the reported issue resolution: Providing you with a custom 4.4.5 build which targets armv8.1-a or armv8.2-a, where the LSE intrinsics are built-in. If this is a valid option for you, could you provide the exact OS version and MongoDB edition you would need? I saw that you were using Amazon Linux release 2 (Karoo) but I am not sure if this is an enterprise or community edition. Waiting for the release of SERVER-56347 (Currently, I am unable to provide the ETA for its release). Providing us with a simple reproducer that shows the reported difference between the ARM and x86_64 architecture. Please let us know which option works best for you. Dima JIRAUSER1259576 commented on Tue, 27 Apr 2021 13:48:12 +0000: Thanks, Dmitry! Should we wait for the new mongo minor release (4.4.6 for example) after the release of GCC 8.5? dmitry.agranat commented on Tue, 27 Apr 2021 13:40:48 +0000: Thanks ivan.takarlikov@sensortower.com for providing the requested information. It turns out the reason you have experienced such an issue on the ARM instance is because of SERVER-56347. And we are currently waiting for GCC 8.5 release. You can start watching SERVER-56347 for updates and please do let us know if you have any questions. Regards, Dima JIRAUSER1259576 commented on Mon, 26 Apr 2021 18:08:37 +0000: Attached files with logs/diagnostic.data and perf data. Perf data was collected during that period 2021-04-26T14:24:26Z - 2021-04-26T14:31:33Z (it also reflected on file names) [ec2-user@itunes-sales-reports-arm-db-master perf_data]$ while true; do sudo perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done [ perf record: Woken up 265 times to write data ] [ perf record: Captured and wrote 74.550 MB perf.data.2021-04-26T14:24:26Z (265712 samples) ] [ perf record: Woken up 380 times to write data ] [ perf record: Captured and wrote 106.594 MB perf.data.2021-04-26T14:25:26Z (378341 samples) ] [ perf record: Woken up 324 times to write data ] [ perf record: Captured and wrote 88.790 MB perf.data.2021-04-26T14:26:27Z (315432 samples) ] [ perf record: Woken up 379 times to write data ] [ perf record: Captured and wrote 106.679 MB perf.data.2021-04-26T14:27:28Z (378378 samples) ] [ perf record: Woken up 373 times to write data ] [ perf record: Captured and wrote 103.880 MB perf.data.2021-04-26T14:28:29Z (368561 samples) ] [ perf record: Woken up 374 times to write data ] [ perf record: Captured and wrote 106.713 MB perf.data.2021-04-26T14:29:30Z (378749 samples) ] [ perf record: Woken up 376 times to write data ] [ perf record: Captured and wrote 106.685 MB perf.data.2021-04-26T14:30:32Z (378772 samples) ] [ perf record: Woken up 124 times to write data ] [ perf record: Captured and wrote 47.446 MB perf.data.2021-04-26T14:31:33Z (167866 samples) ] About `Slow query` - yeah, it was cleaned by me, because it contains some sensitive info. But if that info important for you, I can provide an example of that logs with changed sensitive data to a random one and with the kept structure of log. BTW, thanks for the investigation! dmitry.agranat commented on Mon, 26 Apr 2021 09:14:16 +0000: Thanks ivan.takarlikov@sensortower.com, after inspecting the provided data, I see what you mean. I have a couple of clarification points at this stage: Is it possible to collect perf during a similar event when running on Graviton 2 instance? Provided logs do not contain any "Slow query" information, is it because all the data was redacted? How to record perf call stack samples and generate text output: # capture in separate files of 60 seconds each while true; do perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done # then run perf script as above on the subset of files of interest for fn in ...; do perf script -i $fn >$fn.txt; done After the perf data is collected, we will need the exact timestamp when the perf data was collected, a fresh set of diagnostic.data and mongod logs covering the time of the event. Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine. Also, the perf utility, which is a part of linux-tools package, is not installed by default. JIRAUSER1259576 commented on Fri, 23 Apr 2021 19:48:07 +0000: There are two log files and data directories from both x86 and ARM servers. ARM server was primary and experienced problems since Fri Apr 23 09:30:00 UTC 2021, so we switched that primary back to x86 at Fri Apr 23 15:33:10 UTC 2021 (see screenshot for details). After switch request rate and nature of queries to mongo stayed the same but load decreased significantly on x86 server. MongoDB versions are the same on both instances - 4.4.5 dmitry.agranat commented on Wed, 21 Apr 2021 20:25:39 +0000: Hi ivan.takarlikov@sensortower.com, Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Please mention the exact timestamp (start/end) and the timezone of the event you'd like us to investigate. So that we could compare and comment on the reported degradation, please upload the requested data separately for: AWS r5d.16xlarge to r6gd.16xlarge (Intel Xeon Platinum 8000) AWS r5d.16xlarge to r6gd.16xlarge (AWS Graviton 2) One clarifying question at this time, was the MongoDB version the same on these two instances? Dima JIRAUSER1259576 commented on Wed, 21 Apr 2021 16:43:24 +0000: Screenshot from description is in attachments
Deploy 2 mongodb databases, one on ARM AWS server, the second one on x86_64 arch Make ARM server primary Put many concurrent load there, notice 100% load Switch primary to x86_64, notice reduced load