...
Background — $sort + $match When $match follows $sort it gets moved to be ahead of the $sort. This makes sense because it reduces the N in the O(NlogN) of the sort, ie. the aggregation: > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $match: { c: 1 } } ] ) can be more efficiently implementated (with identical results) by: > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } } ] ) And indeed both of these aggregations have the same explain plan: > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $match: { c: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 } } } ], "ok" : 1 } > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 } } } ], "ok" : 1 } Background — $sort + $limit When $sort is followed by $limit (a common use case), the $limit gets pushed into the $sort stage. This allows a more efficient top-k algorithm to be used, which is usually O(Nlogk) (which reduces to O(N) if k is small) — compared to O(NlogN) to naively sort the documents and then apply the limit. eg. in the following aggregation, the maximum document output by $group can be found after a single pass: > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $limit: 1 } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$sort" : { "sortKey" : { "b" : 1 }, "limit" : NumberLong(1) } } ], "ok" : 1 } Problem — $sort + $limit + $match Now consider when a $match occurs after $sort + $limit. This table shows the number of documents processed after the $group stage, with a $limit of 1, for the best case ($match matches no documents) and worst case ($match matches all documents): aggregation best case worst case [ { $group: { _id: '$a' } }, { $match: { c: 1 } }, { $sort: { b: 1 } }, { $limit: 1 } ] N 2N + 1 [ { $group: { _id: '$a' } }, { $sort: { b: 1 } }, { $limit: 1 }, { $match: { c: 1 } } ] N N + 2 This shows that for $sort + $limit: 1 it is better (and never worse) for the $match to appear afterwards. For larger k, this may not be the case, and for k = N, the problem degenerates to sorting without a limit (and hence the $match is better off before the $sort). Unfortunately, the behaviour of the server is to always move $match ahead of the $sort + $limit (even though sometimes this is worse): > db.foo.explain().aggregate( [ { $group: { _id: '$a' } }, ... { $sort: { b: 1 } }, { $limit: 1 }, ... { $match: { c: 1 } } ] ) { "stages" : [ { "$cursor" : { "query" : { }, "fields" : { "a" : 1, "_id" : 0 }, "queryPlanner" : { "plannerVersion" : 1, "namespace" : "test.foo", "indexFilterSet" : false, "parsedQuery" : { }, "winningPlan" : { "stage" : "COLLSCAN", "direction" : "forward" }, "rejectedPlans" : [ ] } } }, { "$group" : { "_id" : "$a" } }, { "$match" : { "c" : 1 } }, { "$sort" : { "sortKey" : { "b" : 1 }, "limit" : NumberLong(1) } } ], "ok" : 1 } Conclusion Ideally there are two changes here: Avoid moving $match ahead of a $sort with limit. Move $match to appear after a $sort with limit of 1. The second could be optional, since the first would allow users to manually adjust their aggregation (based on their N and k). The server should avoid moving $match ahead of a $sort with limit. This change may be harder if the $match move happens before the $limit is coalesced into the $sort.
xgen-internal-githook commented on Mon, 26 Jun 2017 20:42:58 +0000: Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'} Message: SERVER-29647 Fix $match swapping to avoid moving before $sort-$limit. (cherry picked from commit 8b8845703f0c92bafca58ce5d9f36fd2327301d1) Conflicts: jstests/core/views/views_aggregation.js Branch: v3.4 https://github.com/mongodb/mongo/commit/8bc18f2367d616ef1e68b3be4acfaa1d2e8daa6e xgen-internal-githook commented on Fri, 23 Jun 2017 20:01:03 +0000: Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'} Message: SERVER-29647 Fix $match swapping to avoid moving before $sort-$limit. Branch: master https://github.com/mongodb/mongo/commit/8b8845703f0c92bafca58ce5d9f36fd2327301d1 asya commented on Mon, 19 Jun 2017 16:50:52 +0000: Right, since $match can't be moved ahead of $limit ever, it also can't be moved in front of sort-with-limit. kevin.pulo@10gen.com commented on Mon, 19 Jun 2017 07:33:43 +0000: Nice counter-example! I've updated the "Conclusion" section of the description accordingly. charlie.swanson commented on Thu, 15 Jun 2017 19:46:55 +0000: Missed the second point in your conclusion: Move $match to appear after a $sort with limit of 1. I think this is also incorrect, as it could possibly change the sort order. charlie.swanson commented on Thu, 15 Jun 2017 19:44:39 +0000: I think this is actually even simpler. I think it is a straight-up bug to move a $match before a $sort if the $sort has absorbed a $limit: (mongod-3.4.4) test> db.bar.aggregate([{$sort: {x: -1}}, {$limit: 1}, {$match: {x: {$mod: [2, 1]}}}]) { "_id": ObjectId("5942e317b8328df32ca4fdb2"), "x": 3 } (mongod-3.4.4) test> db.bar.aggregate([ {$sort: {x: -1}}, {$limit: 1}, {$project: {y: "$x"}} /* renames x to trick $match to stay after $sort */, {$match: {y: {$mod: [2, 1]}}} ]) // No results