...
A string search or comparison query like the following fails to return any documents that contains special (unprintable?) characters: Comparison: { "string": "substring ���콻�� substring" } Search: { "string": /.*substring.*/i } Example document: { "_id" : ObjectId("593708460121722b9463f7a1"), "string" : "substring ���콻�� substring" } This causes me to be unable to directly find such documents in my database, which is very annoying. If you wonder why, from time to time such special characters can occur in user-generated data in my DB. This is not a bug and by design / allowed. The original document is saved using the C driver - if you copy and paste it from here to insert into a database the bug will not show as the unprintable characters seem to become UTF-8 character which are fine (e.g. there are no issues when searching a string with just Chinese in it).
david.storch commented on Tue, 12 Sep 2017 19:37:15 +0000: Hi bastian, Apologies for the delay in our response. I looked into this and have some information to report. I think there are two issues at play here. The first is that our standard BSON validation routine, which is invoked whenever the server ingests BSON off the wire, does not validate UTF-8 strings. We have an open ticket tracking the work to add such UTF-8 validation: SERVER-12204. The second issue has to do with the behavior of regular expression matching. The server code includes the PCRE library as its regular expression engine, and interfaces with this library through a C++ wrapper: see pcrecpp.h. As far as I can tell, we enable UTF-8 when we build PCRE, and the expected behavior is that an error will be thrown when either the subject string or the regex pattern consists of invalid UTF-8. However, the C++ wrapper appears to discard this error when we call pcrecpp::RE::PartialMatch(). As a result, the server always fails to match when applying a regex to an invalid UTF-8 string. We could improve this behavior in one of two ways: Return an error when attempting to match invalid UTF-8. Implement the ability for partial matching to work, if the invalid portion of the string is not involved in the match. Both of these changes appear to require upstream changes in the PCRE library. While the former seems fairly straightforward, I suspect it is unlikely that we would pursue the latter. I'm copying mark.benvenuto, a MongoDB engineer who knows more about our PCRE integration, in case he has anything to add. Best, Dave ian@10gen.com commented on Fri, 25 Aug 2017 14:51:05 +0000: assigning to dave to investigate for triage. ian@10gen.com commented on Thu, 3 Aug 2017 15:00:42 +0000: david.storch can you check on where and how we do UTF-8 validation? mark.agarunov commented on Wed, 19 Jul 2017 21:03:52 +0000: Hello bastian, Thank you for the information. I've now been able to reproduce this behavior using the C driver. I've set the fixVersion to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen. Thanks, Mark bastian commented on Mon, 3 Jul 2017 14:48:10 +0000: Sorry for the delay. I'm afraid I cannot upload my database for you and unfortunately copying the entire collection also solves the issue, so I'm afraid I can only suggest what I mentioned above: Try to insert a document containing a string such as "substring \xB7\x73\xBC\x57\xB8\xEA\xAE\xC6\xA7\xA8 substring" into a collection using the C driver (mongoc_collection_insert) and then try to find this document using the search query targeting "substring". You should be able to reproduce it 100% like this. mark.agarunov commented on Thu, 22 Jun 2017 20:59:21 +0000: Hello bastian, We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide: The complete log files from mongod If possible, the database that contains the offending characters. Thanks, Mark mark.agarunov commented on Fri, 9 Jun 2017 20:31:33 +0000: Hello bastian, Thank you for the quick response. I would like to take a deeper look at this behavior so I'd like to request a couple things. Could you please provide: The complete log files from mongod If possible, the database that contains the offending characters. I've generated a secure upload portal so that you can send us these files privately. Thanks, Mark bastian commented on Fri, 9 Jun 2017 19:54:14 +0000: I did some research using raw file logs for the string in question and found the following: A string containing the bytes "B7 73 BC 57 B8 EA AE C6 A7 A8" was saved using the C driver. The problem with some of those bytes (e.g. B7) is that they are completely invalid in terms of UTF-8. When showing that document in my GUI, it shows as: "substring �s�W��Ƨ�" - the latter part represents these bytes. What happens is that the GUI or web browser simply shows these invalid bytes as the UTF-8 replacement character "�". Therefore, when copying this string from the browser or re-saving the printed document (I tried that), these invalid bytes will turn into valid UTF-8 characters and hence the string can suddenly be found using the aforementioned query. That also means that you can't reproduce the issue using any sort of UI (including console) or by copy&pasting but only using actual code where you can specify such bytes. So it's definitely an input issue when saving the document on my side, i.e. a string should never contain such invalid characters which the C driver seems to successfully save. I will fix that. The question remains if you consider it a bug that the regex string search fails because of these invalid characters in the string/document? I assume that the regex code fails somewhere because of these chars and hence this should be considered a "feature", i.e. no fix coming? Thanks again. mark.agarunov commented on Fri, 9 Jun 2017 18:39:47 +0000: Hello bastian, Thank you for the report. Unfortunately I haven't been have to reproduce the behavior you've described. After inserting the document from your example, the document is returned with a find() query: > db.test.insert({"string" : "substring ���콻�� substring"}) WriteResult({ "nInserted" : 1 }) > db.test.find({"string": /.*substring.*/i}) { "_id" : ObjectId("593ae9c39c8807f43fdd9180"), "string" : "substring ���콻�� substring" } If you are still seeing this issue, please append .explain(true) to the find() query and provide the output, for example: db.test.find({"string": /.*substring.*/i}).explain(true) Thanks, Mark bastian commented on Tue, 6 Jun 2017 20:31:30 +0000: Just realized: Please ignore the comparison part as that query naturally only contains printable/UTF-8 characters, hence it will never be equal to the actual string in the document which contains unprintable chars. The critical issue here is that the search query is not working.