BugZero | MongoDB BugID 472647 - Use of options:x and comments with $regex search i...

MongoDB - Defect ID: 472647

Use of options:x and comments with $regex search including \n can lead to incorrect documents being returned

MongoDB - Defect ID: 472647

Use of options:x and comments with $regex search including \n can lead to incorrect documents being returned

Last updated on 10/27/2023

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 5.95.9

Severity: 6.46.4

Community: 66.0

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

With a simple set of documents: {_id: 1, f1: "realapp"} {_id: 2, f1: "rea"} {_id: 3, f1: "rea\napp"} {_id: 4, f1: "rea\nlapp"} {_id: 5, f1: "suretrea"} these two $regex expressions (identical except for use of quotes or slashes): find({f1:{$regex:'rea \n# comment 1\nl',$options:"x"}}) find({f1:{$regex:/rea \n# comment 1\nl/,$options:"x"}}) give different results, both incorrect, despite our $regex documentation suggesting that the syntaxes with the search strings wrapped in quotes or slashes are equivalent. Removing the option:x, the comment and the space leads to correct results from both forms.

Top User Comments

david.storch commented on Fri, 29 Dec 2017 19:51:54 +0000: Hi william.byrne, After digging into the details, I actually believe that this is working as designed. There's a lot going on here, so let me unpack some of the details by focusing on two queries highlighted in the ticket description: find({f1:{$regex:'rea \n# comment 1\nl',$options:"x"}}) find({f1:{$regex:/rea \n# comment 1\nl/,$options:"x"}}) First query Let's start by considering the first problem query: > db.regexOptionsX.find({f1: {$regex:'rea \n# comment 1\nl', $options: 'x'}}); { "_id" : 1, "f1" : "realapp" } If I understand correctly, you claim that this is the incorrect result set. Instead, you would expect the following: > db.regexOptionsX.find({f1: {$regex:'rea \n# comment 1\nl', $options: 'x'}}); { "_id" : 4, "f1" : "rea\nlapp" } In other words, you expect the newline character to be significant, but instead it is ignored. To the contrary, my understanding is that we would expect the "\n" character to be ignored here. The PCRE "x" option turns on "extended mode". Quoting from the documentation for PCRE_EXTENDED: If this bit is set, most white space characters in the pattern are totally ignored except when escaped or inside a character class. The "\n" is neither escaped nor inside a character class, so it seems correct that it is not considered part of the search pattern. Second query When I run this query against a recent version of 3.4, I get the following: > db.regexOptionsX.find({f1: {$regex:/rea \n# comment 1\nl/, $options: 'x'}}); { "_id" : 3, "f1" : "rea\napp" } { "_id" : 4, "f1" : "rea\nlapp" } However, when I run the query against 3.6.0, it fails: > db.regexOptionsX.find({f1: {$regex:/rea \n# comment 1\nl/, $options: 'x'}}); Error: error: { "ok" : 0, "errmsg" : "Regular expression is invalid: missing )", "code" : 2, "codeName" : "BadValue" } This is due to the changes made in SERVER-30986. Prior to this fix, the server would fail to return an error on regular expressions that are invalid from the perspective of the PCRE library. The incorrect results seen on 3.4 are a consequence of the MongoDB server mistakenly using an invalid regex for matching. The 3.6 behavior is correct. This begs a few questions. What does the error message "missing )" mean? Why is the regex valid when specified with quotes but invalid when specified with slashes? The "missing )" error message I can't explain: this is fairly cryptic, but it's just the error string that we have surfaced from the underlying PCRE compilation of the regex. Regarding the latter question: the difference in behavior between quotes versus slashes has to do with how the "\n" sequence is interpreted and the semantics of the "#" character when PCRE_EXTENDED is enabled. When using the syntax {$regex: "\n"} in the shell, the server receives 0x0a, the ASCII newline character. On the other hand, when using the syntax {$regex: /\n/}, the server receives 0x5c 0x6e, the ASCII codes for "\" and "n". I believe this is just JavaScript behavior, rather than something MongoDB-specific. In any case, this encoding difference is produced by the shell, not the server. PCRE_EXTENDED causes the "#" character to be interpreted specially as a comment. Quoting again from the PCRE manual: PCRE_EXTENDED also causes characters between an unescaped # outside a character class and the next newline, inclusive, to be ignored. ... Note that the end of this type of comment is a literal newline sequence in the pattern; escape sequences that happen to represent a newline do not count. There are two interesting facts here. The first is that a newline character is used together with "#" in order to terminate a comment. It appears that PCRE requires this terminator, and the regular expression is considered invalid without one: // This query fails. > db.c.find({a: {$regex: "#", $options: "x"}}); Error: error: { "ok" : 0, "errmsg" : "Regular expression is invalid: missing )", "code" : 2, "codeName" : "BadValue" } // This query succeeds. > db.c.find({a: {$regex: "#\n", $options: "x"}}); Second, only the character 0x0a can terminate a comment; the sequence 0x5c 0x6e does not cut it. Putting everything together, we see that this query correctly fails because the comment is not terminated. Conclusion I believe the above should also explain the behavior of the remaining queries in the repro script. I am closing this ticket as Works as Designed and linking as related to SERVER-30986. Best, Dave

Steps to Reproduce

Pass the following in a file to the mongo shell: db.regexOptionsX.drop(); db.regexOptionsX.insert([ {_id: 1, f1: "realapp" }, {_id: 2, f1: "rea" }, {_id: 3, f1: "rea\napp" }, {_id: 4, f1: "rea\nlapp" }, {_id: 5, f1: "suretrea" } ]) ; print("\nAll: find()"); var a = db.regexOptionsX.find() ; while (a.hasNext()) {printjson(a.next())} print("\nQuotes: find({f1:{$regex:'rea \\n# comment 1\\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea \n# comment 1\nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nSlashes: find({f1:{$regex:/rea \\n# comment 1\\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea \n# comment 1\nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo Comment and Quotes: find({f1:{$regex:'rea \\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea \nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo Comment and Slashes: find({f1:{$regex:/rea \\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea \nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo space, no comment, and Quotes: find({f1:{$regex:'rea\\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea\nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo space, no comment, and Slashes: find({f1:{$regex:/rea\\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea\nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo options:x (or spaces or comments) and Quotes: find({f1:{$regex:'rea\\nl'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea\nl'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo options:x (or spaces or comments) and Slashes: find({f1:{$regex:/rea\\nl/}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea\nl/}}) while (x.hasNext()) {printjson(x.next())} Summary of output: The only document that should be returned is {"_id": 4, "f1": "rea\nlapp"} as it is the only one with the "rea" followed by a newline and an "l". If the $regex uses quotes and $options:'x', then this document is returned: {"_id": 1, "f1": "realapp"}, meaning the newline is being seen as a formatting "space" and not something to be searched for. With no #comment or $options:'x' present it gets the right results. If the $regex uses slashes, $options:'x', and has a comment, then this document: {"_id": 3, "f1": "rea\napp"} is returned as well as the correct one. With no #comment it gets the right results, even with $options:'x' present.

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 472647

Use of options:x and comments with $regex search including \n can lead to incorrect documents being returned

MongoDB - Defect ID: 472647

Use of options:x and comments with $regex search including \n can lead to incorrect documents being returned

Last updated on 10/27/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?