...
This issue was identified by joe.caswell, who observed the following correctness problem after creating a wildcard index: > db.c.drop() true > db.c.insert({a: [["test"]]}) WriteResult({ "nInserted" : 1 }) > db.c.find({"a.0.0": "test"}) { "_id" : ObjectId("5d40b2c8a1925e90174dbd65"), "a" : [ [ "test" ] ] } // Document matches as expected. > db.c.createIndex({"$**": 1}) { "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "commitQuorum" : 1, "ok" : 1 } > db.c.find({"a.0.0": "test"}) // Document does not match! This is a bug due to an incorrect plan. When a wildcard index is present, the planner attempts to use it in order to satisfy the equality predicate over the path "a.0.0". In this example, both "0" path components resolve to array indexes. The access plan, however, does not correctly account for these array index path components. This leads to missing results. This issue should only impact queries which attempt to match a nested array by using multiple subsequent array indexes. Operations which do not query by array index, or which do not have multiple adjacent array index path components (e.g. "a.0" or "a.0.b.1") are not affected.
xgen-internal-githook commented on Thu, 22 Aug 2019 16:29:30 +0000: Author: {'username': 'dstorch', 'email': 'david.storch@10gen.com', 'name': 'David Storch'} Message: SERVER-42518 Prevent paths with successive array indexes from using a wildcard index. Consider a query on a generic path "c1.c2...arr.0.1...cn" consisting of n path components. Somewhere in the middle of this path, there is a component "arr" that is known to be multikey, followed by multiple successive array indexes ("0" and "1" in this example). Any predicate whose path matches this description cannot correctly be answered with a wildcard index. This is because wildcard indexes do not recursively expand directly nested arrays. That is, when indexing an array such as [[1, 2], [3, 4]], the index keys will be [1, 2] and [3, 4], as opposed to 1, 2, 3, and 4. Since the index does not contain keys for individual nested array elements, it cannot be cheaply used for queries against nested array elements. This change fixes the planner to avoid attempting to use a wildcard index in such cases, thereby fixing a query correctness bug. (cherry picked from commit 0f0463d0442b2c77cb650a8e84fea4d634fb63e6) Branch: v4.2 https://github.com/mongodb/mongo/commit/ad7bb60d1f0ab0dd33a228a138aded97003006c2 xgen-internal-githook commented on Thu, 22 Aug 2019 14:04:34 +0000: Author: {'username': 'dstorch', 'email': 'david.storch@10gen.com', 'name': 'David Storch'} Message: SERVER-42518 Prevent paths with successive array indexes from using a wildcard index. Consider a query on a generic path "c1.c2...arr.0.1...cn" consisting of n path components. Somewhere in the middle of this path, there is a component "arr" that is known to be multikey, followed by multiple successive array indexes ("0" and "1" in this example). Any predicate whose path matches this description cannot correctly be answered with a wildcard index. This is because wildcard indexes do not recursively expand directly nested arrays. That is, when indexing an array such as [[1, 2], [3, 4]], the index keys will be [1, 2] and [3, 4], as opposed to 1, 2, 3, and 4. Since the index does not contain keys for individual nested array elements, it cannot be cheaply used for queries against nested array elements. This change fixes the planner to avoid attempting to use a wildcard index in such cases, thereby fixing a query correctness bug. Branch: master https://github.com/mongodb/mongo/commit/0f0463d0442b2c77cb650a8e84fea4d634fb63e6 david.storch commented on Tue, 30 Jul 2019 21:35:51 +0000: After review, the Server Query Team believes that the appropriate fix for this issue is to prevent queries which have multiple subsequent positional path components from using a wildcard index when the previous path component is known to be an array. For example, point queries over the a path like "a.0.0" will not be eligible to use a wildcard index if the index metadata indicates that "a" is multikey. In such cases, the index does not record metadata about whether "a.0" or "a.0.0" are arrays; we must defensively presume that they are array indexes, as opposed to field names, due to their spelling. If we pursue such a fix, wildcard indexes will no longer be useful for schemas which involve field names spelled like array indexes. For example, imagine a schema with documents like {a: [{"1": "foo"}}]}. The query find({"a.0.1": "foo"}) will no longer be able to use a wildcard index, since the planner cannot know that the "1" always refers to a field name in the user's schema. It looks like this new behavior can be implemented by extending the validateNumericPathComponents() function from planner_wildcard_helpers.cpp. The reason that queries like this cannot easily use the index is due to the internal index format. In a wildcard index, each key has a leading component which encodes the path as well as a trailing component which encodes the value. That is, each document is shredded into (path, value) pairs for indexing. For instance, the index keys for the document {a: 2, b: 3} with the wildcard index {"$**": 1} would be {"": "a", "": 2} and {"": "b", "": 3}. When the document has an indexed array, we follow the pre-existing behavior for multikey indexes by producing an index key for every array element. For example, the document {a: 2, b: [3, 4]} will result in three index keys: {"": "a", "": 2}, {"": "b", "": 3}, and {"": "b", "": 4}. Just like all other multikey indexes, nested arrays are not recursively expanded for indexing. Therefore, for your test document {a: [["test"]]} the index will contain only the key {"": "a", "": ["test"]}. Note that the array has been expanded just one level, as opposed to being recursively expanded. This design choice was important to make the semantics for using a multikey wildcard index to accelerate a query as similar as possible to how we build access plans for regular multikey indexes. Since nested array are not expanded, and since the array indexes are not explicitly encoded into the index keys, we cannot build useful index bounds for the query find({"a.0.0": "test"}). We could, of course, search for the key {"": "a", "": ["test"]}, but there is no good way for the planner to know a priori that this is a good idea. Should we also search for the key {"": "a", ["test", "foo"]} which could also be present for a matching document? It's possible that we could build clever index bounds that would incorporate all such index keys. This might only work well for array index "0", since the bounds necessary to capture all index keys containing arrays which have a particular value at element "n" would likely be inefficient.