BugZero | MongoDB BugID 380397 - Regex query with indexed $in clause use incorrect ...

MongoDB - Defect ID: 380397

Regex query with indexed $in clause use incorrect index

MongoDB - Defect ID: 380397

Regex query with indexed $in clause use incorrect index

Last updated on 10/27/2023

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

We appear to be hitting an issue where MongoDB is choosing to use a less efficient index when performing a regex query, that it actually could. Our data has the following structure { title : "String title", cms_title_sort : "string title", tags : ["foo", "bar", "baz"], rank : 1 } With two indexes: ensureIndex({ cms_title_sort : 1 }) ensureIndex({ tags : 1, rank : 1, cms_title_sort : 1 }) If we perform find({ cms_title_sort : /keyword/, tags : { $in : ["foo"] } }) sometimes it uses the cms_title_sort index, other times it uses the tags index. The table contains between 15k and 100k records, so not the largest of tables. When it uses the cms_title_sort index, it's very fast and finishes in cms_title_sort index, it performs the regex in memory, and then does the tag portion on the documents. Because the regex has a very high cardinality the result set is significantly smaller, and it goes quickly. In general I would think that most regex queries have high cardinality, so then, should that not instruct mongo to possible lean towards those indices? We have tried a few different index formats. According to the docs, it would seem that an index which looks like ensureIndex({ tags : 1, cms_title_sort : 1 }) should allow for a query to be entirely performed in memory. Even in that case, the system still performs the cms_title_sort component in a FETCH step, rather than in the index step where it checks the array. In the docs for indexed arrays, it mentions some other caveats, if they don't work with regex it may be best to call that as well. I have created a test script which shows the issue. Consistently with this test data the cms_title_sort index performs the most optimally for all 3 queries, yet mongo sometimes chooses the tags index. In this example it's not as big of a problem, both queries are fairly fast, but as the data gets larger and more documents, then things can slow into the seconds. From my tests the index chosen changes depending on how many records you put in the db and the specific keys you attempt to search on. Lastly, we are using 3.2.6, this behavior may have changed in newer versions. If it is better in newer versions that would be great to know!

Top User Comments

mark.agarunov commented on Fri, 19 May 2017 21:49:43 +0000: Hello owenallenaz, Thank you for the report. Looking over the information you've provided, it seems that this is expected behavior. What you are describing may be due to the query planner caching the query plan. The query planner will cache the plan for better performance, however if there is a large difference in the number of results returned between queries of similar shape, the query planner may be re run and choose a different index which would be more optimal for that query. This plan is then cached and may cause future queries to use this same index. To avoid this, you may be able to use planCacheSetFilter to have more deterministic query execution. Thanks, Mark

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 380397

Regex query with indexed $in clause use incorrect index

MongoDB - Defect ID: 380397

Regex query with indexed $in clause use incorrect index

Last updated on 10/27/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?