...
We appear to be hitting an issue where MongoDB is choosing to use a less efficient index when performing a regex query, that it actually could. Our data has the following structure { title : "String title", cms_title_sort : "string title", tags : ["foo", "bar", "baz"], rank : 1 } With two indexes: ensureIndex({ cms_title_sort : 1 }) ensureIndex({ tags : 1, rank : 1, cms_title_sort : 1 }) If we perform find({ cms_title_sort : /keyword/, tags : { $in : ["foo"] } }) sometimes it uses the cms_title_sort index, other times it uses the tags index. The table contains between 15k and 100k records, so not the largest of tables. When it uses the cms_title_sort index, it's very fast and finishes in cms_title_sort index, it performs the regex in memory, and then does the tag portion on the documents. Because the regex has a very high cardinality the result set is significantly smaller, and it goes quickly. In general I would think that most regex queries have high cardinality, so then, should that not instruct mongo to possible lean towards those indices? We have tried a few different index formats. According to the docs, it would seem that an index which looks like ensureIndex({ tags : 1, cms_title_sort : 1 }) should allow for a query to be entirely performed in memory. Even in that case, the system still performs the cms_title_sort component in a FETCH step, rather than in the index step where it checks the array. In the docs for indexed arrays, it mentions some other caveats, if they don't work with regex it may be best to call that as well. I have created a test script which shows the issue. Consistently with this test data the cms_title_sort index performs the most optimally for all 3 queries, yet mongo sometimes chooses the tags index. In this example it's not as big of a problem, both queries are fairly fast, but as the data gets larger and more documents, then things can slow into the seconds. From my tests the index chosen changes depending on how many records you put in the db and the specific keys you attempt to search on. Lastly, we are using 3.2.6, this behavior may have changed in newer versions. If it is better in newer versions that would be great to know!
mark.agarunov commented on Fri, 19 May 2017 21:49:43 +0000: Hello owenallenaz, Thank you for the report. Looking over the information you've provided, it seems that this is expected behavior. What you are describing may be due to the query planner caching the query plan. The query planner will cache the plan for better performance, however if there is a large difference in the number of results returned between queries of similar shape, the query planner may be re run and choose a different index which would be more optimal for that query. This plan is then cached and may cause future queries to use this same index. To avoid this, you may be able to use planCacheSetFilter to have more deterministic query execution. Thanks, Mark