...
As of DRIVERS-501 the driver CRUD specification defines a pipeline to be used to count documents without resorting to the count command (due to potential inaccurate counts without query predicates). db.foo.drop(); db.foo.insertMany([{},{},{},{},{}]); // Driver CRUD spec definition for countDocuments //////////////////// filter = {} skip = null limit = null pipeline = [{'$match': filter}] if (skip) { pipeline.push({'$skip': skip}) } if (limit) { pipeline.push({'$limit': limit}) } pipeline.push({'$group': {'_id': 1, 'n': {'$sum': 1}}}) ////////////////////////////////////////////////////////////////////// db.foo.explain("executionStats").aggregate(pipeline) Without specifying a filter the empty $match stage would result in a COLLSCAN plan winning, though a COUNT_SCAN should be appropriate. /* "nReturned": 5, "executionTimeMillis": 0, "totalKeysExamined": 0, "totalDocsExamined": 5, "executionStages": { "stage": "COLLSCAN", */ This could be worked around by unshifting a {$sort: { _id: 1 }} stage to the pipeline, however after discussing with david.storch@mongodb.com this may be addressable server-side.
JIRAUSER1273272 commented on Tue, 15 Aug 2023 11:58:39 +0000: Marking this as blocked by SERVER-23406 as the current performance numbers don't allow this patch to be merged as long as the index scans are significantly slower than a col scan. JIRAUSER1255446 commented on Mon, 7 Aug 2023 11:21:21 +0000: Local benchmark shows that for collections with a relatively small number of fields the collection scan is more efficient than the COUNT_SCAN over the _id index. Bellow are results for collections of size 10000, 100000, and 1 000 000. The documents have 6 fields {_id: i, a: i * 5, b: 100, c: 1000, d: "xxx", e: "abc".repeat(100) }. Each test was run 7 times and the first and last measurements were removed when computing the average. The following results were obtained when running with the SBE engine. Running for collection size 10000 Testing scan for hint { "$natural" : 1 } Average execution time (ms): 3 Median execution time (ms): 3 Testing scan for hint { "_id" : 1 } Average execution time (ms): 5 Median execution time (ms): 5 Testing scan for hint { "a" : 1 } Average execution time (ms): 4 Median execution time (ms): 4 Running for collection size 100000 Testing scan for hint { "$natural" : 1 } Average execution time (ms): 29.6 Median execution time (ms): 30 Testing scan for hint { "_id" : 1 } Average execution time (ms): 47 Median execution time (ms): 47 Testing scan for hint { "a" : 1 } Average execution time (ms): 42 Median execution time (ms): 42 Running for collection size 1000000 Testing scan for hint { "$natural" : 1 } Average execution time (ms): 304.6 Median execution time (ms): 305 Testing scan for hint { "_id" : 1 } Average execution time (ms): 470.6 Median execution time (ms): 468 Testing scan for hint { "a" : 1 } Average execution time (ms): 415.2 Median execution time (ms): 415 Similar results for collections with 3 fields, one idea faster over the large collection sizes. charlie.swanson commented on Tue, 6 Sep 2022 16:34:41 +0000: This does seem to be pretty easily addressed. This is a pretty hack-y patch I drummed up which seems to "fix" it and prefer a full COUNT_SCAN of the "_id" index over a COLLSCAN: Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yaml diff --git a/src/mongo/db/query/query_planner.cpp b/src/mongo/db/query/query_planner.cpp index ba398f446cb..4e23e3aa435 100644 --- a/src/mongo/db/query/query_planner.cpp +++ b/src/mongo/db/query/query_planner.cpp @@ -1494,6 +1494,17 @@ StatusWith>> QueryPlanner::plan( } } + // If there are no indexed alternatives, but this is a count request without a query predicate, + // we can use a COUNT_SCAN of the entire _id index (if it exists). + if (out.empty() && (params.options & QueryPlannerParams::IS_COUNT) && + query.getQueryObj().isEmpty()) { + for (auto&& index : params.indices) { + if (isIdIndex(index.keyPattern)) { + out.push_back(buildWholeIXSoln(index, query, params)); + } + } + } + // Check whether we're eligible to use the columnar index, assuming no other indexes can be // used. if (out.empty()) { I'll nominate this for a quick win. It won't be super easy since I believe we'll want to benchmark it and possibly consider whether a non-id index should be eligible? Only non-partial, non-multi key indexes would work. I'm not sure how we'd effectively choose amongst them though, so the "_id" index may be the sensible default. charlie.swanson commented on Tue, 6 Sep 2022 16:31:19 +0000: Oops! Disregard my last comment. I confused myself by looking at the example in that linked ticket, which does not include a filter or a field path access: db.items.aggregate([ {$group: {_id: category, count: {$sum: 1}}}]) It was later clarified that this was a typo though and "category" is meant to be a field path reference: db.items.aggregate([ {$group: {_id: "$category", count: {$sum: 1}}}]) That's a pretty different query than the pure count in this ticket, so I will re-open and undo the duplication. I'll also fix up the example on that ticket charlie.swanson commented on Tue, 6 Sep 2022 16:12:58 +0000: Thanks christopher.harris@mongodb.com and nicholas.zolnierz@mongodb.com. This does appear to be a duplicate of SERVER-29444. The empty predicate is the same as no $match, and the two are both impacted by this bug of not generating an entire-index COUNT_SCAN. The ticket christopher.harris@mongodb.com linked is worth keeping in mind for any change like this, but I think described a filtered COUNT, which makes it eligible for index scans.