...
With a simple set of documents: {_id: 1, f1: "realapp"} {_id: 2, f1: "rea"} {_id: 3, f1: "rea\napp"} {_id: 4, f1: "rea\nlapp"} {_id: 5, f1: "suretrea"} these two $regex expressions (identical except for use of quotes or slashes): find({f1:{$regex:'rea \n# comment 1\nl',$options:"x"}}) find({f1:{$regex:/rea \n# comment 1\nl/,$options:"x"}}) give different results, both incorrect, despite our $regex documentation suggesting that the syntaxes with the search strings wrapped in quotes or slashes are equivalent. Removing the option:x, the comment and the space leads to correct results from both forms.
david.storch commented on Fri, 29 Dec 2017 19:51:54 +0000: Hi william.byrne, After digging into the details, I actually believe that this is working as designed. There's a lot going on here, so let me unpack some of the details by focusing on two queries highlighted in the ticket description: find({f1:{$regex:'rea \n# comment 1\nl',$options:"x"}}) find({f1:{$regex:/rea \n# comment 1\nl/,$options:"x"}}) First query Let's start by considering the first problem query: > db.regexOptionsX.find({f1: {$regex:'rea \n# comment 1\nl', $options: 'x'}}); { "_id" : 1, "f1" : "realapp" } If I understand correctly, you claim that this is the incorrect result set. Instead, you would expect the following: > db.regexOptionsX.find({f1: {$regex:'rea \n# comment 1\nl', $options: 'x'}}); { "_id" : 4, "f1" : "rea\nlapp" } In other words, you expect the newline character to be significant, but instead it is ignored. To the contrary, my understanding is that we would expect the "\n" character to be ignored here. The PCRE "x" option turns on "extended mode". Quoting from the documentation for PCRE_EXTENDED: If this bit is set, most white space characters in the pattern are totally ignored except when escaped or inside a character class. The "\n" is neither escaped nor inside a character class, so it seems correct that it is not considered part of the search pattern. Second query When I run this query against a recent version of 3.4, I get the following: > db.regexOptionsX.find({f1: {$regex:/rea \n# comment 1\nl/, $options: 'x'}}); { "_id" : 3, "f1" : "rea\napp" } { "_id" : 4, "f1" : "rea\nlapp" } However, when I run the query against 3.6.0, it fails: > db.regexOptionsX.find({f1: {$regex:/rea \n# comment 1\nl/, $options: 'x'}}); Error: error: { "ok" : 0, "errmsg" : "Regular expression is invalid: missing )", "code" : 2, "codeName" : "BadValue" } This is due to the changes made in SERVER-30986. Prior to this fix, the server would fail to return an error on regular expressions that are invalid from the perspective of the PCRE library. The incorrect results seen on 3.4 are a consequence of the MongoDB server mistakenly using an invalid regex for matching. The 3.6 behavior is correct. This begs a few questions. What does the error message "missing )" mean? Why is the regex valid when specified with quotes but invalid when specified with slashes? The "missing )" error message I can't explain: this is fairly cryptic, but it's just the error string that we have surfaced from the underlying PCRE compilation of the regex. Regarding the latter question: the difference in behavior between quotes versus slashes has to do with how the "\n" sequence is interpreted and the semantics of the "#" character when PCRE_EXTENDED is enabled. When using the syntax {$regex: "\n"} in the shell, the server receives 0x0a, the ASCII newline character. On the other hand, when using the syntax {$regex: /\n/}, the server receives 0x5c 0x6e, the ASCII codes for "\" and "n". I believe this is just JavaScript behavior, rather than something MongoDB-specific. In any case, this encoding difference is produced by the shell, not the server. PCRE_EXTENDED causes the "#" character to be interpreted specially as a comment. Quoting again from the PCRE manual: PCRE_EXTENDED also causes characters between an unescaped # outside a character class and the next newline, inclusive, to be ignored. ... Note that the end of this type of comment is a literal newline sequence in the pattern; escape sequences that happen to represent a newline do not count. There are two interesting facts here. The first is that a newline character is used together with "#" in order to terminate a comment. It appears that PCRE requires this terminator, and the regular expression is considered invalid without one: // This query fails. > db.c.find({a: {$regex: "#", $options: "x"}}); Error: error: { "ok" : 0, "errmsg" : "Regular expression is invalid: missing )", "code" : 2, "codeName" : "BadValue" } // This query succeeds. > db.c.find({a: {$regex: "#\n", $options: "x"}}); Second, only the character 0x0a can terminate a comment; the sequence 0x5c 0x6e does not cut it. Putting everything together, we see that this query correctly fails because the comment is not terminated. Conclusion I believe the above should also explain the behavior of the remaining queries in the repro script. I am closing this ticket as Works as Designed and linking as related to SERVER-30986. Best, Dave
Pass the following in a file to the mongo shell: db.regexOptionsX.drop(); db.regexOptionsX.insert([ {_id: 1, f1: "realapp" }, {_id: 2, f1: "rea" }, {_id: 3, f1: "rea\napp" }, {_id: 4, f1: "rea\nlapp" }, {_id: 5, f1: "suretrea" } ]) ; print("\nAll: find()"); var a = db.regexOptionsX.find() ; while (a.hasNext()) {printjson(a.next())} print("\nQuotes: find({f1:{$regex:'rea \\n# comment 1\\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea \n# comment 1\nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nSlashes: find({f1:{$regex:/rea \\n# comment 1\\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea \n# comment 1\nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo Comment and Quotes: find({f1:{$regex:'rea \\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea \nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo Comment and Slashes: find({f1:{$regex:/rea \\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea \nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo space, no comment, and Quotes: find({f1:{$regex:'rea\\nl',$options:'x'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea\nl',$options:'x'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo space, no comment, and Slashes: find({f1:{$regex:/rea\\nl/,$options:'x'}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea\nl/,$options:'x'}}) while (x.hasNext()) {printjson(x.next())} print("\nNo options:x (or spaces or comments) and Quotes: find({f1:{$regex:'rea\\nl'}})"); var s = db.regexOptionsX.find({f1:{$regex:'rea\nl'}}) while (s.hasNext()) {printjson(s.next())} print("\nNo options:x (or spaces or comments) and Slashes: find({f1:{$regex:/rea\\nl/}})"); var x = db.regexOptionsX.find({f1:{$regex:/rea\nl/}}) while (x.hasNext()) {printjson(x.next())} Summary of output: The only document that should be returned is {"_id": 4, "f1": "rea\nlapp"} as it is the only one with the "rea" followed by a newline and an "l". If the $regex uses quotes and $options:'x', then this document is returned: {"_id": 1, "f1": "realapp"}, meaning the newline is being seen as a formatting "space" and not something to be searched for. With no #comment or $options:'x' present it gets the right results. If the $regex uses slashes, $options:'x', and has a comment, then this document: {"_id": 3, "f1": "rea\napp"} is returned as well as the correct one. With no #comment it gets the right results, even with $options:'x' present.