...
with unique option index + 'korean' content driver occur error when insert duplicate content see below cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ python mongo_test.py cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ python mongo_test.py Traceback (most recent call last): File "mongo_test.py", line 24, in result = col.insert_one(script) File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 625, in insert_one bypass_doc_val=bypass_document_validation), File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 530, in _insert check_keys, manipulate, write_concern, op_id, bypass_doc_val) File "/usr/local/lib/python2.7/dist-packages/pymongo/collection.py", line 512, in _insert_one check_keys=check_keys) File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 218, in command self._raise_connection_failure(error) File "/usr/local/lib/python2.7/dist-packages/pymongo/pool.py", line 346, in _raise_connection_failure raise error bson.errors.InvalidBSON: 'utf8' codec can't decode byte 0xeb in position 230: invalid continuation byte cswcsy@niklane-Samsung-Ubuntu:~/crawlers/CrawlerPlatform/utils$ like above, it reproduced 100% when i insert twice time in unique korean field. it didn't reproduce when i use another content(korean) here is my test code coding: utf-8 * from pprint import pprint from pymongo import ReplaceOne from pymongo import InsertOne import pymongo from pymongo import MongoClient from utils.mongomanager import MongoManager from pymongo.errors import BulkWriteError mongo = MongoClient('localhost', 27017) db = mongo['bigdata'] col = db['test'] script = {'brand_name': u'\ub77c\uc628', 'category0': u'\uc0dd\ud65c/\uac74\uac15', 'category1': u'\uacf5\uad6c', 'category2': u'\ubaa9\uacf5\uacf5\uad6c', 'category3': u'\ub300\ud328', 'entity': [], 'price': 9300, 'title': u'\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\uba74 \ub2e4\ub4ec\uae30'} result = col.insert_one(script) pprint(result) if need something more information or has some solution with this issue, plz reply me. thanks a lot
cswcsy commented on Fri, 3 Jun 2016 01:37:11 +0000: Thanks a lot! behackett commented on Wed, 1 Jun 2016 20:49:26 +0000: Hi cswcsy, we've committed a workaround to PyMongo master, slated for the next PyMongo release, 3.3. See PYTHON-1090 for details. behackett commented on Tue, 3 May 2016 14:48:22 +0000: Unfortunately, CodecOptions isn't applied to server write responses in the bulk write API. It also isn't applied for non-bulk write responses when using the legacy write operations (MongoDB 2.4). I've opened PYTHON-1090 to add a workaround. We'll likely just always use the replace error handler for server write responses in the future (but not query responses). cswcsy commented on Tue, 3 May 2016 05:29:15 +0000: @Bernie Hackett when i use bulk_write in pymongo, your W/A(set error handler to replace) didn't work. of course it works when i use insert_one. if you are not busy, can you check it? #-*- coding: utf-8 -*- from pprint import pprint from pymongo import ReplaceOne from pymongo import InsertOne import pymongo from pymongo import MongoClient from utils.mongomanager import MongoManager from pymongo.errors import BulkWriteError from bson.codec_options import CodecOptions mongo = MongoClient('localhost', 27017) db = mongo['bigdata'] col = db['test'].with_options(codec_options=CodecOptions(unicode_decode_error_handler='replace')) bulk = [] script ={'brand_name': u'\ub77c\uc628', 'category0': u'\uc0dd\ud65c/\uac74\uac15', 'category1': u'\uacf5\uad6c', 'category2': u'\ubaa9\uacf5\uacf5\uad6c', 'category3': u'\ub300\ud328', 'entity': [], 'price': 9300, 'title': '구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평면 다듬기'} bulk.append(InsertOne(script)) result = col.bulk_write(bulk, ordered=False, bypass_document_validation=True) pprint(result) cswcsy commented on Tue, 3 May 2016 02:45:56 +0000: Thanks for W/A ! behackett commented on Mon, 2 May 2016 18:21:22 +0000: You can work around this in PyMongo by changing the error handler for unicode decode errors: >>> from bson.codec_options import CodecOptions >>> coll = c.test.test.with_options(codec_options=CodecOptions(unicode_decode_error_handler='replace')) >>> coll.insert_one(doc) >>> del doc['_id'] >>> try: ... coll.insert_one(doc) ... except Exception as exc: ... exc.details ... {u'index': 0, u'code': 11000, u'errmsg': u'E11000 duplicate key error collection: test.test index: title_1 dup key: { : "\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\ufffd..." }'} Note that '\ud3c9\uba74' is being replaced with '\ud3c9\ufffd...' ('\ufffd' being the unicode replacement character). If instead we use the 'ignore' directive we get '\ud3c9...' I'm guessing this is just an unfortunate choice by the server of byte count to truncate the key value. The server appears to be truncating part of a code point. This can probably be fixed by counting characters, rather than bytes, when deciding where to truncate the string. behackett commented on Mon, 2 May 2016 18:06:24 +0000: I've tested this back to MongoDB 2.4. It appears this issue has always existed. My guess is the server is creating mojibake for the duplicate key error message. PyMongo can query for and display the document without issue: >>> c.test.test.find_one() {u'category1': u'\uacf5\uad6c', u'category0': u'\uc0dd\ud65c/\uac74\uac15', u'category3': u'\ub300\ud328', u'category2': u'\ubaa9\uacf5\uacf5\uad6c', u'title': u'\uad6c \uad6d\uc0b0 \ub300\ud328 \uc190\ub300\ud328 \ubaa9\uacf5\uacf5\uad6c \ubbf8\ub2c8\ub300\ud328 \ubaa8\uc11c\ub9ac\ub300\ud328 \ub300\ud328\ub0a0 \ubaa9\uacf5\uad6c \uc804\ub3d9\ub300\ud328 \ubaa9\uc218\uacf5\uad6c \ubaa9\uacf5\uc608 \ud648\ub300\ud328 DIY\uacf5\uad6c \ud3c9\uba74 \ub2e4\ub4ec\uae30', u'price': 9300, u'brand_name': u'\ub77c\uc628', u'entity': [], u'_id': ObjectId('57279630fba52269ef009e0d')} cswcsy commented on Mon, 2 May 2016 15:12:06 +0000: thanks for reply! i know korean title index didn't good choice for unique index.(for performance, etc..) so i need to find alternative way. anyway, i hope so this issue analyzed too! thanks again behackett commented on Mon, 2 May 2016 14:46:50 +0000: This is very strange. The problem is that MongoDB is returning duplicate key error because a document matching the unique index already exists. The message that MongoDB is returning includes the value that caused the error, but the server seems to have encoded it incorrectly, so python can't decode it to utf-8. In the server logs we have: 2016-05-02T07:41:59.453-0700 D WRITE [conn3] Caught Assertion in query, continuing :: caused by :: E11000 duplicate key error collection: test.test index: title_1 dup key: { : "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�..." } 2016-05-02T07:41:59.453-0700 I COMMAND [conn3] command test.test command: insert { insert: "test", ordered: true, documents: [ { category1: "공구", category0: "생활/건강", category3: "대패", category2: "목공공구", title: "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�...", price: 9300, _id: ObjectId('57276737fa5bd81c3ca2f5d8'), brand_name: "라온", entity: [] } ] } ninserted:0 keyUpdates:0 writeConflicts:0 exception: E11000 duplicate key error collection: test.test index: title_1 dup key: { : "구 국산 대패 손대패 목공공구 미니대패 모서리대패 대패날 목공구 전동대패 목수공구 목공예 홈대패 DIY공구 평�..." } code:11000 numYields:0 reslen:334 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_query 0ms This seems like it must be a bug in the server, but I'll have to do some research. Thanks for reporting this! cswcsy commented on Mon, 2 May 2016 14:23:11 +0000: + i used below for create index db.test.createIndex({title:1},{unique:true})