BugZero | MongoDB BugID 534518 - Aggregation Index Performance

MongoDB - Defect ID: 534518

Aggregation Index Performance

MongoDB - Defect ID: 534518

Aggregation Index Performance

Last updated on 5/21/2018

Overall: 6.16.1

Severity: 6.46.4

Community: 6.46.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.16.1

Severity: 6.46.4

Community: 6.46.4

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

It seems that Index usage during $lookup phases is not fully optimized for $pipeline+$expr lookups.

Top User Comments

asya commented on Mon, 21 May 2018 18:21:25 +0000: I'm closing this issue as incomplete because it appears that the exact same query plan is being used in both cases. Feel free to reopen with additional details if there appears to be a problem here I've missed. I've also linked a related ticket that would allow the query to use a compound index on the two fields if one was present SERVER-34926 which could potentially improve performance if this query in some cases. asya commented on Mon, 30 Apr 2018 21:49:19 +0000: So I see the external query in loop and lookup which uses $expr are both using the same index to lookup data: "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "Borrower.Email" : 1 }, "indexName" : "Borrower.Email_1", "isMultiKey" : false, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 1, "direction" : "forward", "indexBounds" : { "Borrower.Email" : [ "[\"danielp@bompracredito.com.br\", \"danielp@bompracredito.com.br\"]" ] } } It's not immediately clear to me why it would be significantly slower. Could you run both to completion and then include total stats for how long it took? danielp@bompracredito.com.br commented on Sun, 29 Apr 2018 20:43:49 +0000: This time, I didn't bother to 'obfuscate' collection and propery names. The query planner outputs were so large that I attached a file with all info to this issue (explains.txt) asya commented on Sat, 28 Apr 2018 20:43:10 +0000: In that case can you please provide the output of this query: db.Collection.explain(true).find({'User.Email':,SimulationDate:{$gte:, $lte:}}) // and db.Collection.explain(true).find({$expr:{$and:[ {$eq: [, "$User.Email"]}, {$gte: ["$SimulationDate", ]}, {$lte: ["$SimulationDate",]} ]}}) In both cases please fill in some reasonable values for , and danielp@bompracredito.com.br commented on Sat, 28 Apr 2018 20:15:04 +0000: Sorry, I renamed FullProposals (our actual collection name) to "Collection" (a generic name) in all instances but in the "find" example. asya commented on Sat, 28 Apr 2018 20:05:42 +0000: We have several tickets tracking improving performance of $lookup stages, however $expr introduced a difficulty I described (several comparison operators do not have exactly the same meaning in find vs. aggregation expression. I'm trying to understand if there is a bug here (in the example you describe): your aggregation queries "Collection" and then does a $lookup from "Collection" - your JS on the other hand after query on "Collection" does a subsequent lookup (via find) on "FullProposals". Please clarify the intention of the aggregation so that I can ask you for correct "explain" output to diagnose this performance issue. danielp@bompracredito.com.br commented on Fri, 27 Apr 2018 20:31:48 +0000: @Asya Kamsky, User.Email is a string property. From an end-user experience standpoint, aggregations + lookups should behave exactly as SQL INNER JOINS. If indexes area available, they should be leveraged to achieve maximum performance. More to the point, I am seeing several scenarios, as a MongoDB developer, where I am experiencing better performance using finds + JavaScript than with aggregations with lookups. The expectation is that aggregations should be much faster than finds+javascript, since there's no "chat" going on between the client and MongoDB. That's especially frustrating for users coming from an SQL background (say, statisticians) that are used to pretty fast SQL joins. Not my case, but I am having to support those users. Are there any plans to improve aggregation lookup performance? asya commented on Fri, 27 Apr 2018 18:27:19 +0000: danielp@bompracredito.com.br please note that the JS example is not 100% identical to $lookup with expressive pipeline due to greater-than semantics in find and aggregation being different. This is why find can use an index for inequality when aggregation $expr cannot. You can see SERVER-31804 for details about how the wrong results can be returned if an index is used for inequality. In addition, even equality semantics are not the same in aggregation and find as find looks inside arrays and aggregation does not (on straight $eq), is "User.Email" field an array in your case? If not then then $lookup should still be relatively fast if there is an index to use for equality. You can see some of that discussion in SERVER-31760

Steps to Reproduce

The following aggregation: db.Collection.aggregate([ { $match: { "SimulationDate": {$gte: ISODate("2018-03-21 00:00:00")}, "ParentId": null } }, { $project: { "CPF": "$User.CPF", "Data": "$SimulationDate", "Email": "$User.Email", "TelCelular": {$concat: ["$User.MobilePhone.AreaCode","-","$Borrower.MobilePhone.Number"]}, "TelOutro": {$concat: ["$User.HomePhone.AreaCode","-","$Borrower.HomePhone.Number"]}, } }, { $lookup: { from: "Collection", let: { id: "$Email", data: "$Data" }, pipeline: [ { $match: { $expr: {$and: [ {$eq: ["$$id", "$User.Email"]}, {$gte: ["$SimulationDate", ISODate("2018-03-21 00:00:00")]}, {$lte: ["$SimulationDate", "$$data"]}, ]} } }, { $project: { "CPF_Email": "$User.CPF", "Data_Email": "$SimulationDate" } } ], as: "Email" } } ]) seems to run much slower (I couldn't even wait for the first line to return) than the (similar?) find+Javascript script: var ds = new Date(2018,2,21) db.Collection.find({"SimulationDate": {$gte: ds},"ParentId": null},{'SimulationDate':1,'User.Email':1}).forEach(p=>{ db.Collection.find({'User.Email':p.User.Email,SimulationDate:{$gte:ds, $lte:p.SimulationDate}}).forEach(pb=>{ print (pb.User.Email+","+pb.SimulationDate) }) })

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 534518

Aggregation Index Performance

MongoDB - Defect ID: 534518

Aggregation Index Performance

Last updated on 5/21/2018

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?