...
my cluster is 4.4.8 with three shards. cluster upgrade from 4.4.4 to 4.4.8. cluster is ok.after one hours,there is secondary crash ,indicates _id duplicate key . {"t": {"$date":"2021-08-17T18:54:43.182+08:00"} ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-0","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 } , keyValue: { _id: \"0be47ed91b91474f8a96429fa7d3cfec\" } }: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key: { _id: \"0be47ed91b91474f8a96429fa7d3cfec\" } "
601290552@qq.com commented on Sat, 25 Sep 2021 12:50:48 +0000: Hi eric: i read this two issue about duplicate key.it indicates: db crash or unclean shutdown. but i shutdown db using normal not abort. secondary: The bug can cause a Duplicate Key error on startup and prevent the node from starting. my 4.4.8 secondary can startup normal,it is runing a few hours that indicates Duplicate Key error not startup. eric.sedor commented on Thu, 23 Sep 2021 16:18:58 +0000: Hi 601290552@qq.com, I'm very sorry for the delay. I believe you should upgrade to 4.4.9 (just released) as it is very likely from the occurrence of duplicate key errors that you've been impacted by wither WT-7995 or WT-7984, which are fixed in this version. Eric 601290552@qq.com commented on Fri, 20 Aug 2021 08:45:14 +0000: Hi Eric: Have you made any progress for it?thanks. 601290552@qq.com commented on Wed, 18 Aug 2021 03:36:47 +0000: Hi Eric: My cluster version upgrade from 4.4.4 to 4.4.8.cluster has three shards. every shard has three members(PSS).i upgrade frist:config-shard-mongos.i will upload shard2 data. upgrade shard2 step: first upgrade all secondary member.i shutdown instance normal.after i use 4.4.8 binary mongo start it. when i upgrade primary.i execute rs.stepDown(600).because it has high priority.after primary becomes seconday,i shutdown it normal. after i use 4.4.8 binary mongo start it.After the upgrade is complete, the system runs for more than 8 hours. Then, can you elaborate on the timeline of what has occurred here? yes,it is ok. shard2 seconday is affected nodes(srvdb303.yto.cloud) 1、shutdown using 4.4.4 time:2021-08-17T16:12:36.840+08:00 {"t": {"$date":"2021-08-17T16:12:36.840+08:00"} ,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}} 2、startup using 4.4.8 time:2021-08-17T16:12:44.132+08:00 {"t": {"$date":"2021-08-17T16:12:44.132+08:00"} ,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"} 3、crash using 4.4.8 time:2021-08-18T10:07:19.822+08:00 [from the following log:this duplicate _id is op:u,update lead to this duplicate] {"t": {"$date":"2021-08-18T10:07:19.821+08:00"} ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception", "attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 } , keyValue: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" } }: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" } ", "oplogEntry":{"op":"u","ns":"customerService.TrackerDetail","ui": {"$uuid":"6345c1d9-8e32-4d82-a2bb-b6c6313f17c1"} , "o":{"$v":1,"$set":{"opTime": {"$date":"2021-08-18T02:07:19.000Z"} , "opUt": {"$date":"2021-08-18T02:07:19.824Z"} }},"o2": {"no":"xxxx","_id":"a148c2bd1c274559b674ef3eddb46d01"} , "ts":{"$timestamp":{"t":1629252439,"i":252}},"t":8,"v":2,"wall":{"$date":"2021-08-18T02:07:19.819Z"}}}} {"t": {"$date":"2021-08-18T10:07:19.822+08:00"} ,"s":"F", "c":"-", "id":23095, "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"DuplicateKey{ keyPattern: { _id: 1 } , keyValue: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" } }: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" } ","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}} {"t": {"$date":"2021-08-18T10:07:19.822+08:00"} ,"s":"F", "c":"-", "id":23096, "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"} 1、The logs for the affected nodes, including before, leading up to, and after the first sign of corruption. 2、a clear timeline of activities (with timestamps and timezones) including server restarts, replica set changes, version changes, etc. 1、shutdown using 4.4.4 time:2021-08-17T16:12:36.840+08:00 {"t": {"$date":"2021-08-17T16:12:36.840+08:00"} ,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}} 2、startup using 4.4.8 time:2021-08-17T16:12:44.132+08:00 {"t": {"$date":"2021-08-17T16:12:44.132+08:00"} ,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"} 3、crash using 4.4.8 time:2021-08-18T10:07:19.822+08:00 [from the following log:this duplicate _id is op:u,update lead to this duplicate] {"t": {"$date":"2021-08-18T10:07:19.821+08:00"} ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception", 3、the output of validate() on each affected node i need to enter standalone so that i can execute validate. db.runCommand( { validate: "TrackerDetail", full: true } ) "advice" : "A corrupt namespace has been detected. See http://dochub.mongodb.org/core/data-recovery for recovery steps.", why i shutdown it normal.it can corrupt. i have upload three files to support uploader location. curl -X POST https://upload.box.com/api/2.0/files/content -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' -H 'Content-Type: multipart/form-data' -F attributes='{"name": "diagnostic.data.tar", "parent": {"id": "143586341049"}}' -F file=@diagnostic.data.tar > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 192M 100 1171 100 192M 44 7503k 0:00:26 0:00:26 -::- 0 curl -X POST https://upload.box.com/api/2.0/files/content \ > -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' \ > -H 'Content-Type: multipart/form-data' \ > -F attributes='{"name": "shard2.log", "parent": {"id": "143586341049"}}' \ > -F file=@shard2.log > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 84.9M 100 1161 100 84.9M 12 922k 0:01:36 0:01:34 0:00:02 0 curl -X POST https://upload.box.com/api/2.0/files/content \ > -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' \ > -H 'Content-Type: multipart/form-data' \ > -F attributes='{"name": "validate.txt", "parent": {"id": "143586341049"}}' \ > -F file=@validate.txt > /dev/null % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 37433 100 1160 100 36273 519 16253 0:00:02 0:00:02 -::- 16258 601290552@qq.com commented on Wed, 18 Aug 2021 02:25:06 +0000: Hi Eric: Slowly, all the seconday instances started reporting duplicate key errors, causing the instance to go down.so i guess that it can not logical corruption for all secondays except primary.the instance is the small,Otherwise, no synchronization is completed, and only the primary instance is alive and eventually becomes read-only. 601290552@qq.com commented on Wed, 18 Aug 2021 01:46:56 +0000: Hi Eric: I will sort through the process for you. It needs some time. eric.sedor commented on Tue, 17 Aug 2021 22:28:33 +0000: Hi 601290552@qq.com, This error message leads us to suspect logical corruption. Please make a complete copy of each affected node's $dbpath directory to safeguard so that you can work off of the current $dbpath. Our ability to determine the source of this corruption depends greatly on your ability to provide: The logs for the affected nodes, including before, leading up to, and after the first sign of corruption. a clear timeline of activities (with timestamps and timezones) including server restarts, replica set changes, version changes, etc. the output of validate() on each affected node Would you please archive (tar or zip) the validate output, mongod.log files, and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Then, can you elaborate on the timeline of what has occurred here? The ideal resolution will likely be to perform a clean resync from an unaffected node. Thank you, Eric 601290552@qq.com commented on Tue, 17 Aug 2021 19:51:37 +0000: i check from primary: this record create at "createTime" : ISODate("2021-08-14T03:03:00.557Z"), i find "currentTime" : ISODate("2021-08-17T17:40:25Z"), this time the same from secondary carsh time.so it is maybe update at primary. but i can not find this record on secondary.why it indicate duplicate key "createTime" : ISODate("2021-08-14T03:03:00.557Z"), "rBranch" : "769902", "rRegion" : "755920", "_class" : "cn.yto.wdgj.rw.pojo.mongodo.WantedMonitorNewDO", "contractOrg" : null, "currentOrg" : "769902", "currentTime" : ISODate("2021-08-17T17:40:25Z"), 601290552@qq.com commented on Tue, 17 Aug 2021 19:43:22 +0000: another seconday for shard3 again carsh: {"t": {"$date":"2021-08-18T03:40:38.987+08:00"} ,"s":"F", "c":"REPL", "id":21235, "ctx":"initandlisten","msg":"Failed to apply batch of operations","attr":{"numOperationsInBatch":1013,"firstOperation":{"lsid":{"id": {"$uuid":"e0556e1e-9897-41c1-a4dc-6afa6d9e50db"} ,"uid":{"$binary": {"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"} }},"txnNumber":2339,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui": {"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"} ,"o":{"$v":1,"$set":{"backFlag":1,"currentOp":751,"currentTime": {"$date":"2021-08-17T17:43:19.000Z"} ,"dealStatus":19,"dealTime": {"$date":"2021-08-17T17:43:19.207Z"} ,"opList.7":162922219900751,"orgBranch":"579038"}},"o2":{"_id":{"$oid":"6118aaf405f34e22a4f681a4"}},"ts":{"$timestamp":{"t":1629222199,"i":6}},"t":6,"v":2,"wall": {"$date":"2021-08-17T17:43:19.212Z"} ,"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"lastOperation":{"lsid":{"id": {"$uuid":"783598dd-63df-4489-9505-1ab1923be921"} ,"uid":{"$binary": {"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"} }},"txnNumber":3065,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui": {"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"} ,"o":{"$v":1,"$set":{"currentOp":171,"currentOrg":"270902","currentTime": {"$date":"2021-08-17T17:44:43.000Z"} ,"nextOrg":"712001","opList.5":162922228300171,"warnLevel":20}},"o2":{"_id":{"$oid":"6119bda405f34e22a4f78351"}},"ts":{"$timestamp":{"t":1629222294,"i":2}},"t":6,"v":2,"wall": {"$date":"2021-08-17T17:44:54.404Z"} ,"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"failedWriterThread":12,"error":"DuplicateKey{ keyPattern: { _id: 1 } , keyValue: { _id: ObjectId('611732d899faf22fbb509820') } }: E11000 duplicate key error collection: returnMonitor.exp_want_new_monitor index: id dup key: { _id: ObjectId('611732d899faf22fbb509820') } "}} when i enter standalone execute find,database not find this record. > > db.exp_want_new_monitor.find({_id: ObjectId('611732d899faf22fbb509820')}) > db.exp_want_new_monitor.count(); 17806952 601290552@qq.com commented on Tue, 17 Aug 2021 11:16:38 +0000: when i initiate secondary,from log indicates: {"t": {"$date":"2021-08-17T19:14:52.076+08:00"} ,"s":"I", "c":"NETWORK", "id":22990, "ctx":"conn245","msg":"DBException handling request, closing client connection","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'find' operation on 'returnMonitor' database via fire-and-forget command execution."}}