...
The use of UTF-8 unicode characters in a database name will cause creation of directories with directoryperdb to fail. Because the BSON spec defines strings to be stored in UTF-8, strings in the server are also UTF-8. Windows, however, uses UTF-16 for its implementation of unicode, and as inputs for its APIs. This means that we must convert between our internally used 8 bit characters and Windows 16 bit characters before API calls are made. For file operations, we do this in two ways. mongo::File is the first. When open is called on a path, MultiByteToWideChar is called on the path, converting the UTF-8 encoded string to UTF-16. The second is through boost::filesystem::path. This class uses C++'s locale system. std::locale is an object which specifies different properties which a localization might have. These properties are called facets. One such facet is the codecvt, which handles conversion between different types of strings. The boost::filesystem::path instantiates a copy of the global std::locale, and overrides its codecvt with a custom converter object. This locale is then saved globally for use in path operations. When a path is created, or appended to, the codecvt is used, if necessary, to convert the provided string into the operating system's default character format. The original std::locale is left as is. Unfortunately, boost::filesystem's implementation of the codecvt, windows_file_codecvt, is incomplete. It will set the 8 bit character's code page to either ANSII, or the OS's OEM codepage. This means the conversion will be invalid. Because two mechanisms are used, it appears that we are creating an incorrect directory name, using boost::filesystem::path, creating that incorrect directory, then attempting to create a file in the correct path. The directory in the file path will not exist, and file creation will fail. FileAllocator's makeTempFileName and run functions will need to be modified. makeTempFileName produces a path as a string. Though it uses boost::filesystem::path internally, it translates the path back into 8 bit characters when it converts to std::string. run then uses c_str on said std::string without any width conversion. A plausible solution to this might be to use boost's locale library to generate a new std::locale object with a correct codecvt, as per the boost filesystem documentation here: http://www.boost.org/doc/libs/1_51_0/libs/locale/doc/html/default_encoding_under_windows.html
xgen-internal-githook commented on Fri, 16 Sep 2016 18:47:10 +0000: Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'} Message: SERVER-16725 Incorrect character conversion between UTF-8 and UTF-16 Branch: master https://github.com/mongodb/mongo/commit/f0d958c747cfc42dd831eb2f088e963475c0ed54
On Windows, start a mongod with directoryperdb. Create a database with a single multibyte UTF-8 character as its name. Insert a document.