...
Dear Sir/Madam. I know that you fix this problem, but i have problem with the unicode for Korean Language, I am trying to import a Korean wikipedia corpus to mongodb in Linunx, but when i want to search a word in mongodb through my java application, It could not find any match word, what i have to do? i tried to convert the corpus to utf-8 in my query and in mongo, but the results were same.
thomas.schubert commented on Thu, 19 Nov 2015 20:01:53 +0000: Hi 30yamak, Sorry for the long delay getting back to you. I have imported your data and successfully queried the term field. Please see the examples below: db.terms.findOne({term : {$regex : "지수적"}}) { "_id" : ObjectId("55e8a06cf02a8168ed428d2f"), "term" : "{\"term\":\"지수적\",\"vector\":{\"197761\":0.036434002220630646,\"296370\":0.04846245050430298,\"237083\":0.010118533857166767,\"57801\":0.1235201507806778,\"300077\":0.055651936680078506,\"62474\":0.007019163109362125,\"300030\":0.2067071944475174,\"165881\":0.011536050587892532,\"31741\":0.002140911528840661,\"238158\":0.05690254271030426,\"244254\":0.18086878955364227}}", "vector" : BinData(0,"AAAAAA==") } db.terms.findOne({term : {$regex : "\u110c\u1175\u1109\u116e\u110c\u1165\u11A8"}}) { "_id" : ObjectId("55e8a06cf02a8168ed428d2f"), "term" : "{\"term\":\"지수적\",\"vector\":{\"197761\":0.036434002220630646,\"296370\":0.04846245050430298,\"237083\":0.010118533857166767,\"57801\":0.1235201507806778,\"300077\":0.055651936680078506,\"62474\":0.007019163109362125,\"300030\":0.2067071944475174,\"165881\":0.011536050587892532,\"31741\":0.002140911528840661,\"238158\":0.05690254271030426,\"244254\":0.18086878955364227}}", "vector" : BinData(0,"AAAAAA==") } It's worth noting that some fonts may render two symbols as single character. Depending on your font these two symbols may appear the same 지 지. However, one of these characters has two unicodes, whereas the other has only one unique unicode. The unicodes in the document must match the query. I am closing this ticket since we can't reproduce this issue. If you can share a run-able reproduction script, preferably in javascript, we'll be happy take another look. Thank you, Thomas 30yamak commented on Fri, 25 Sep 2015 10:29:23 +0000: No new news? 30yamak commented on Thu, 17 Sep 2015 11:32:48 +0000: Dear Sam, I attached mongodump in my Dropbox. Because its size was more that allowance. https://dl.dropboxusercontent.com/u/6149013/terms.bson.gz https://dl.dropboxusercontent.com/u/6149013/terms.metadata.json.gz https://dl.dropboxusercontent.com/u/6149013/system.indexes.bson.gz With Best Wishes, Siamak samk commented on Wed, 16 Sep 2015 16:44:11 +0000: Can you provide some of your data in the form of a mongodump .bson file? This will allow me to try my reproduction with your data. Regards, sam 30yamak commented on Mon, 14 Sep 2015 11:42:59 +0000: Dear Sam, Thank you for your helping, About your questions: 1. Yes I tired it, but the result was same, NOTHING 2. I tired two version of driver: 2.12.4 and 3.0.3 and for both versions the result was same. 3. I tried it was some 4. No unfortunately. ( I inserted with 2.12.4 and tried to retrieve with 2.12.4 and 3.0.3) 5. I tired several strings the results were same. I attache my data in mongo With Best Wishes, Siamak samk commented on Fri, 11 Sep 2015 15:03:31 +0000: Sorry for not getting back to sooner. I've been trying to reproduce this issue with the mongo shell, without luck. You can see my attempt to translate your example here: (function() { "use strict"; var coll = db.getCollection('testColl') coll.drop(); assert.eq(0, coll.count()); var ustr = "\uC815\uC2E0\uBCD1\uC6D0" coll.insert({"_id": "one", "data": ustr }) assert.eq(1, coll.count()); assert.eq(1, coll.count({"data": ustr})) }()); I have some more questions about your issue: Are you able to reproduce your problem in the mongo shell? Which driver are you using where you see this issue? Which version of that driver are you using? Are you able to reproduce this issue with another driver? If you insert a document with your driver, can you successfully retrieve it with a different client? Do all of the Korean strings exhibit this error, or is it only some of them? Are you able to use strings pulled form other sources? Thanks again for your help. Regards, sam regards 30yamak commented on Wed, 2 Sep 2015 18:50:46 +0000: I also tried to find a specific string through \uxxx but the return value was NULL again : String original = "\uC815\uC2E0\uBCD1\uC6D0"; searchQuery.put("term",original); DBCursor cursor1 = collection.find(searchQuery); while (cursor1.hasNext()) { System.out.println(cursor1.next()); } 30yamak commented on Wed, 2 Sep 2015 15:03:08 +0000: favorite I have a problem with unicode for Korean Language in mongodb, I am trying to import a Korean wikipedia corpus to mongodb in Linunx, but when i want to search a word in mongodb through my java application, It could not find any match word, what i have to do? i tried to convert the corpus to utf-8 in my query and in mongo, but the results were same. It is my code that i convert the string to utf-8 when i insert my data to mongodb and find from mongodb //For Insert to Mongodb: byte[] utf8Bytes = term.get("term").toString().getBytes("UTF-8"); dbDao.insertVector(new String(utf8Bytes, "UTF-8"), parseVector((Map) term.get("vector"))); //To find From Mongodb byte[] utf8Bytes = term.get("term").toString().getBytes("UTF-8"); DBObject dbo = termsCollection.findOne(new BasicDBObject(TERM, new String(utf8Bytes, "UTF-8"))); // The return value always is NULL!!!