BugZero | MongoDB BugID 245099 - Fatal assertion 28723 trying to rollback applyOps ...

MongoDB - Defect ID: 245099

Fatal assertion 28723 trying to rollback applyOps on a CSRS config server

MongoDB - Defect ID: 245099

Fatal assertion 28723 trying to rollback applyOps on a CSRS config server

Last updated on 4/6/2023

Overall: 6.36.3

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.36.3

Severity: 6.46.4

Community: 6.86.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

We deploy replica set between multi location(data center), and We use from mongodb 3.2. Here is some problems on some machines, The configsvrs after some machines after some minutes down. We create replica set for configsvr from 12 members, that The replica set have 7 vote for primary and the rest is vote=0 and priority=0. How to fix my issue? configsvr logs: 2015-12-26T23:46:39.737+0330 I REPL [ReplicationExecutor] Error in heartbeat request to shard2:47041; HostUnreachable Connection refused 2015-12-26T23:46:40.425+0330 F REPL [rsBackgroundSync] can't rollback this command yet: { applyOps: [ { op: "u", b: true, ns: "config.chunks", o: { _id: "INN.destinations-destination_code_MinKey", lastmod: Timestamp 1000|1, lastmodEpoch: ObjectId('567eb3043d88f022d5c0cf86'), ns: "INN.destinations", min: { destination_code: MinKey }, max: { destination_code: 0.0 }, shard: "shard1_1" }, o2: { _id: "INN.destinations-destination_code_MinKey" } }, { op: "u", b: true, ns: "config.chunks", o: { _id: "INN.destinations-destination_code_0.0", lastmod: Timestamp 1000|2, lastmodEpoch: ObjectId('567eb3043d88f022d5c0cf86'), ns: "INN.destinations", min: { destination_code: 0.0 }, max: { destination_code: MaxKey }, shard: "shard1_1" }, o2: { _id: "INN.destinations-destination_code_0.0" } } ], maxTimeMS: 30000 } 2015-12-26T23:46:40.425+0330 I REPL [rsBackgroundSync] cmdname=applyOps 2015-12-26T23:46:40.425+0330 E REPL [rsBackgroundSync] replica set fatal exception 2015-12-26T23:46:40.425+0330 I REPL [rsBackgroundSync] rollback finished 2015-12-26T23:46:40.425+0330 I - [rsBackgroundSync] Fatal assertion 28723 UnrecoverableRollbackError need to rollback, but unable to determine common point between local and remote oplog: replica set fatal exception @ 18752 2015-12-26T23:46:40.425+0330 I - [rsBackgroundSync] ***aborting after fassert() failure

Top User Comments

xgen-internal-githook commented on Fri, 29 Jan 2016 17:09:58 +0000: Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'} Message: SERVER-22016 Support rolling back certain applyOps oplog entries in replication. Specifically, support rolling back insert, update and delete applyOps operations such as those generated by sharding's moveChunk, splitChunk and mergeChunks operations. (cherry picked from commit fc81fdee1da1d949f80075c8a88998fa3b0c5e78) Branch: v3.2 https://github.com/mongodb/mongo/commit/5ca8d6e28255e9519db61d5d25807d70a3000ba7 ali.hallaji commented on Tue, 19 Jan 2016 19:49:30 +0000: Thank you all friends for your comments. dan@10gen.com commented on Tue, 19 Jan 2016 19:27:08 +0000: Yes, you can and should upgrade to 3.2.1. The problem fixed in this ticket is very specific and does have a workaround/fix if you were to hit it again. The upgrade from 3.2.1 to 3.2.2 is an upgrade to the executable only and should not be a problem. ali.hallaji commented on Tue, 19 Jan 2016 17:31:49 +0000: Can I install mongodb 3.2.1 for production and wait for new version(3.2.2). Can mongodb 3.2.1 upgrade to 3.2.2 without crash and data lost or any problem when new version will be release? dan@10gen.com commented on Sun, 17 Jan 2016 16:06:14 +0000: 3.2.2 won't be ready until mid- next month ali.hallaji commented on Sun, 17 Jan 2016 06:00:03 +0000: Hi Andy, When version 3.2.2 is released? We'll wait for new version, But how much long to release that version? xgen-internal-githook commented on Mon, 4 Jan 2016 21:05:52 +0000: Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'} Message: SERVER-22016 Support rolling back certain applyOps oplog entries in replication. Specifically, support rolling back insert, update and delete applyOps operations such as those generated by sharding's moveChunk, splitChunk and mergeChunks operations. Branch: master https://github.com/mongodb/mongo/commit/fc81fdee1da1d949f80075c8a88998fa3b0c5e78 schwerin commented on Wed, 30 Dec 2015 19:56:06 +0000: This failure occurred because an election took place in the config server replica set and switched which node was primary while a chunk split was being performed. The final config server writes performed by chunk split, merge and move operations are bundled into a single operation from the perspective of the replication subsystem. That operation is called "applyOps", and is only by sharding and certain backup/restore tools. When the election occurred, the applyOps operation in question had replicated to at most half of the voting nodes in the config server replica set, and so was not fully committed. The new primary had not yet received the operation, and so nodes that had were forced to roll it back. Unfortunately, the replication system in 3.2.0 and 3.2.1 does not know how to roll back these operations. We'll fix this for 3.2.2. In the meantime, you could probably reduce your exposure to this type of error by having fewer voting members in your config server replica set. The number of voters you need should be a function of the number of single-node failures you want to tolerate while still allowing chunk migrations, collection create and other metadata operations to proceed. We typically recommend 3 voting nodes, which allows the config servers to keep accepting writes when up to 1 node fails. Since you can keep accepting metadata reads as long as at least 1 config server is up (voting or non-voting), and metadata reads are all you need to accept document reads and writes, this is probably sufficient for most applications. Also, when this error happens, you can always remove the data files for the config server that entered this fatal state, and let it resynchronize with the other nodes. Since the config databases are typically small, this should not be a slow process. ali.hallaji commented on Mon, 28 Dec 2015 07:25:49 +0000: Hi Dan, We deploy new sharded cluster from scratch. We have add 13 members configsvr into replica set, and I setting 6 members into location A and the rest of members into site B. We have 7 members with for voting and other is votes=0 and primary=0. Please guide me to deploy strategy for replica set of configsvr for multi data center. Can I use from two replica set for configsvr? dan@10gen.com commented on Mon, 28 Dec 2015 00:18:03 +0000: Hi Ali, what is the history of your cluster? Are you upgrading from v3.0 or creating a new sharded cluster from scratch here?

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 245099

Fatal assertion 28723 trying to rollback applyOps on a CSRS config server

MongoDB - Defect ID: 245099

Fatal assertion 28723 trying to rollback applyOps on a CSRS config server

Last updated on 4/6/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?