BugZero | MongoDB BugID 1953500 - SBE Multiplanning can be slow when suboptimal plan...

MongoDB - Defect ID: 1953500

SBE Multiplanning can be slow when suboptimal plan runs first

MongoDB - Defect ID: 1953500

SBE Multiplanning can be slow when suboptimal plan runs first

Last updated on January 29th, 2025

BugZero Risk Score
8.1 High

Overall: 8.1

Severity: 8.2

Community: 8.5

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Critical - P2
Status: Closed
Views: 86

Description

Info

Issue Status as of Apr 18, 2024 ISSUE DESCRIPTION AND IMPACT Queries that use the Slot Based Execution (SBE) engine can experience long query optimization times (this ticket) and/or have non-optimal plans selected (SERVER-83196). This happens because SBE is designed for fast execution of the winning plan, instead of efficient round-robin execution of candidate plans during planning. Ultimately, SBE planning time can be proportional to the longest plan, instead of the shortest plan. Because of this inefficiency, the SBE planner also has less time available to gather information about the candidate plans, which can lead to a worse decision when the planning period ends. More technical details can be found in the this README. This issue has been fixed in MongoDB 8.0.0, which avoids using the SBE planner for multiplanning: instead the server always uses the Classic Engine for multiplanning, even when SBE is used to execute the winning plan. DIAGNOSIS AND AFFECTED VERSIONS The issue is present in MongoDB 6.0, 7.0 and 7.3. It is fixed in MongoDB 8.0.0. A query affected by this bug will use SBE and will spend a lot of time planning. Both symptoms are visible in "Slow query" log lines: "queryFramework":"sbe" means the query is using SBE. "planningTimeMicros" shows how much time was spent planning the query. When looking at an explain plan, the presence of a "slotBasedPlan" field means the query uses SBE. WORKAROUNDS As an immediate workaround on MongoDB versions affected by the bug, users can disable SBE by setting the internalQueryFrameworkControl parameter to “forceClassicEngine”. Since SBE often outperforms the Classic Engine, this option may affect the performance of queries which formerly executed in SBE. Another workaround is to hint the affected queries, using the hint() method, or using an index filter. An index filter allows you to hint a specific query without changing the application, but it only exists for the duration of the server process and does not persist after shutdown. Note that we plan to deprecate index filters starting in version 8.0, in favor of Persistent Query Settings (SERVER-17625). Original description: Currently, the strategy used in SBE multiplanning is as follows: We run non blocking plans before blocking ones. We run each plan’s trial period to completion (i.e. until we return 101 documents or we use up the plans budget). We use the number of reads performed by said plan to bound the number of reads used by any remaining plans. The problem with this approach is that if the first plan we run is not the optimal one, we are stuck running it and we can potentially use all of the reads. As an example, consider two plans, A and B. Plan A needs to perform 10k storage engine reads to get 101 documents, while plan B needs to perform 101 reads to get 101 documents. If Plan B runs first, we have no problems: we will set the reads limit for plan A to 101, and it will stop running after 101 reads. If Plan A runs first however, we will be stuck running plan A for all 10k reads. Though we’ll eventually run plan B and it will be chosen, this negatively impacts the performance of queries which need to use the multiplanner.

Top User Comments

JIRAUSER1254095 commented on Fri, 1 Dec 2023 16:26:06 +0000: David, Ivan thank you both it means a lot to me as a customer that MongoDB is tackling these issues with high priority. david.storch commented on Fri, 1 Dec 2023 00:01:16 +0000: To add onto what Ivan said, I'm going to move this ticket back to the "Open" state – we are no longer "Investigating" this issue, but rather are executing on an engineering project to fix the problem. The solution will be delivered against a sequence of related Jira tickets rather than developing directly against this ticket. However, we can provide high-level progress updates here. JIRAUSER1270969 commented on Thu, 30 Nov 2023 11:42:21 +0000: Yes, I am aware. We are working on a solution. However, it requires redesign of the whole multi planning process with SBE and will take some time to develop, test and release. To improve customer experience in the meantime, we are planning a change in default configuration via SERVER-83470. As well we are going to improve our testing process not to miss this again with the next SBE release. JIRAUSER1254095 commented on Thu, 30 Nov 2023 11:25:31 +0000: ivan.fefer@mongodb.com are you also aware of this issue? This is another one we are seeing related to SBE. JIRAUSER1254095 commented on Wed, 8 Nov 2023 19:14:03 +0000: Following from my report in SERVER-82549, I'd like to underscore how app-breaking and frustrating this issue is in MongoDB 7. (We didn't see any effects in Mongo 5 or 6.) I anticipate many CRUD apps on MongoDB will be affected by this. When we initially upgraded to MongoDB 7 ~3 weeks ago we saw it breaking our app in a number of critical places. Please give this issue the attention it deserves. JIRAUSER1257467 commented on Tue, 12 Apr 2022 15:08:46 +0000: Waiting on other tickets first david.storch commented on Thu, 7 Apr 2022 20:41:47 +0000: Returning this to the triage queue. At the moment our efforts related to this problem fall under SERVER-63642 and SERVER-63641, so there is no action for me currently planned against this umbrella ticket. david.storch commented on Mon, 14 Feb 2022 23:19:22 +0000: Another quick update. We have filed two additional offshoot tickets: SERVER-63642 "Add serverStatus metrics to measure multi-planning performance". The work for this ticket would add telemetry to help us to understand the performance of the SBE multi-planner across the Atlas fleet. SERVER-63641 "Improve SBE multi-planning by choosing which plan to work next based on a priority metric". This ticket tracks the improvement to the SBE multi-planning algorithm proposed by mihai.andrei which I have already summarized above. The ticket description contains a more detailed writeup of the proposed change. This work could help to improve the performance of SBE multi-planning beyond what was already achieved in SERVER-62981. Folks interested in this ticket may wish to watch these two new related ones. This ticket will continue to serve as the umbrella. There is no specific engineering work planned against the umbrella ticket at this time, but SERVER-63642 is scheduled and SERVER-63641 will be triaged by the Query Execution team. david.storch commented on Fri, 28 Jan 2022 18:13:00 +0000: Related ticket SERVER-62981 has now been completed for versions 5.3.0 and 5.2.1 which we anticipate will help a lot with the problem described by this ticket. david.storch commented on Tue, 25 Jan 2022 23:58:57 +0000: The Query Team has been internally brainstorming several potential solutions to this problem. We have generated a handful of ideas of various implementation complexity which I will describe below, mostly for the benefit of query engineering. However, we think there is one simple change that we should implement immediately, which we expect should go a long way towards mitigating the problem described here: SERVER-62981. I suggest that folks interested in this ticket also watch SERVER-62981. Once SERVER-62981 is complete, we could consider pursuing one of the following additional changes in the future: mihai.andrei's idea: during SBE multi-planning, always call getNext() on whichever plan currently seems the most promising. This shouldn't be too hard to implement, but it still suffers from the problem where a single call to getNext() for an unselective plan could expend the entire reads budget. christopher.harris points out that we could use the classic multi-planner for plan selection, but then hand the winning plan off to the SBE engine for execution. In this scheme, we would continue to use SBE when recovering plans from the plan cache. One downside of this approach is that any partial results computed during the trial period would have to be completely thrown out, and execution started from the beginning in SBE. It would also be quite complex to implement. However, it would mean that enabling SBE would not change the behavior of the system with regards to plan selection. It would also give us an opportunity to restore some of the more useful aspects of the explain format at "allPlansExecution" verbosity. I propose a two-phase approach in the SBE multi-planner. The first round would run the trial plan for each candidate much like the SBE multi-planner's current process, but it would use a much smaller reads budget. For instance, this reads budget could be on the order of 100 or 200. If any plan hits EOF or produces its first batch of results, then a winner is chosen according to the current ranking formula. Otherwise, we move onto the second round using the much larger reads budget of 10,000. The idea is to make sure that multi-planning terminates as quickly as possible without exploring bad candidate plans when one of the available candidates is very cheap.

Steps to Reproduce

Relevant Products

Click on a version to see all relevant bugs

Affected versions:5.1.1, 5.2.0-rc1, 6.0.12, 7.0.4

Fixed versions: 8.0.0-rc0

Relevant Products

Click on a version to see all relevant bugs

Affected versions:5.1.1, 5.2.0-rc1, 6.0.12, 7.0.4

Fixed versions: 8.0.0-rc0

Top MongoDB Defects

8.4Defect ID: 3431051
MongoDB config server will crash in a cluster which is upgrade from 6.0 version
8.4Defect ID: 3423956
Config server crashes with invariant failure in QueryAnalysisCoordinator::onSamplerDelete when documents are deleted from config.mongos
6.8Defect ID: 3422474
$project silently drops root-level fields after $lookup + $unwind when multiple nested documents contain a type field with different value types
6.8Defect ID: 3426651
Clustered collections do not correctly filter out results with $lt on _id
5.5Defect ID: 3439952
Add incompatible_ppc tag to sharding_stepdown_fcv_upgrade_downgrade_jscore_passthrough

Ready to prevent the next vendor outage?

Get a demo

MongoDB - Defect ID: 1953500

SBE Multiplanning can be slow when suboptimal plan runs first

MongoDB - Defect ID: 1953500

SBE Multiplanning can be slow when suboptimal plan runs first

Last updated on January 29th, 2025

BugZero Risk Score8.1 High

Bug Details

Info

Top User Comments

Steps to Reproduce

Top MongoDB Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
8.1 High