...
Issue Status as of Aug 6, 2014 ISSUE SUMMARY When cleaning up after a chunk migration, if the shard is a replica set the primary node may for the cleanup to be replicated to secondaries. If during this time the primary becomes non-primary, the node throws an exception and aborts. The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred. USER IMPACT Under the conditions described above, the cleanup operation fails with an exception and the node aborts. WORKAROUNDS A possible workaround is to enable synchronous migration cleanups: use config db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }}); Note however that this makes migration cleanups slower, so this workaround has a performance impact. AFFECTED VERSIONS MongoDB 2.6 production releases up to 2.6.3 are affected by this issue. FIX VERSION The fix is included in the 2.6.4 production release. RESOLUTION DETAILS Catch the exception and log a message to the log file. The affected node is no longer primary, so the operation should not be restarted. Original description One of our our mongod's crashed. Thge last log lines are: 2014-06-16T01:06:07.662+0200 [conn170044] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3982 3926153ms 2014-06-16T01:06:07.662+0200 [conn170732] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 838061, after repl: 838351, at end: 838351 } 2014-06-16T01:06:07.662+0200 [conn170732] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:45 reslen:3982 857713ms 2014-06-16T01:06:07.664+0200 [conn170015] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 4003515, after repl: 4003805, at end: 4003805 } 2014-06-16T01:06:07.664+0200 [conn170015] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:62 reslen:3982 4063527ms 2014-06-16T01:06:07.679+0200 [conn170894] end connection 172.16.64.36:57234 (50 connections now open) 2014-06-16T01:06:07.680+0200 [conn170839] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:53226] 2014-06-16T01:06:07.680+0200 [conn170008] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:34421] 2014-06-16T01:06:07.680+0200 [conn170072] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:40371] 2014-06-16T01:06:07.680+0200 [conn170120] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:44809] 2014-06-16T01:06:07.680+0200 [conn169952] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45250] 2014-06-16T01:06:07.680+0200 [conn169973] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45466] 2014-06-16T01:06:07.680+0200 [conn170677] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:37881] 2014-06-16T01:06:07.680+0200 [conn170193] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:51427] 2014-06-16T01:06:07.680+0200 [conn169990] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:33256] 2014-06-16T01:06:07.680+0200 [conn170545] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50537] 2014-06-16T01:06:07.680+0200 [conn170836] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:52316] 2014-06-16T01:06:07.680+0200 [conn170578] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50862] 2014-06-16T01:06:07.680+0200 [conn169975] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:59916] 2014-06-16T01:06:07.680+0200 [conn170811] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:49906] 2014-06-16T01:06:07.680+0200 [conn170094] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:42575] 2014-06-16T01:06:07.708+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91] /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109] /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6] /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab] /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd] /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830] /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd] /usr/bin/mongod() [0x1205829] /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca] /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d] 2014-06-16T01:06:07.731+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform): 0x11c0e91 0x11c0118 0x7f9f8995c166 0x7f9f8995c193 0x7f9f8995c28e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91] /usr/bin/mongod() [0x11c0118] /usr/lib/libstdc++.so.6(+0xcb166) [0x7f9f8995c166] /usr/lib/libstdc++.so.6(+0xcb193) [0x7f9f8995c193] /usr/lib/libstdc++.so.6(+0xcb28e) [0x7f9f8995c28e] /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e] /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab] /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd] /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830] /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd] /usr/bin/mongod() [0x1205829] /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca] /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d] The crashed mongod is member of a replSet consisting of 3 members. The replSet is one of 5 shards. The cluster is running with authentication.
milkie commented on Tue, 17 Jun 2014 11:57:16 +0000: The error can happen whenever a node steps down for any reason. By not executing the replSetFreezeCommand, you are likely, but not guaranteed, to avoid the problem. Randolph's suggestion of using synchronous balancer deletes, on the other hand, will completely avoid the problem. kay.agahd@idealo.de commented on Mon, 16 Jun 2014 21:45:01 +0000: If the node stepped down for other reasons, it wouldn't have received a replSetFreezeCommand since we send this command only manually. So the error shouldn't occur then. Does this makes sense? renctan commented on Mon, 16 Jun 2014 21:13:11 +0000: kay.agahd@idealo.de You can also do that as well. But the issue can potentially occur again if the node stepped down for other reasons. kay.agahd@idealo.de commented on Mon, 16 Jun 2014 20:46:12 +0000: Thank you for the workaround Randolph! Could we avoid the error also by NOT executing the replSetFreezeCommand on the PRIMARY? We could adapt our compress script to do so because the primary does not need to be freezed since it will be stepped down in any case. What do you think about? Thanks! renctan commented on Mon, 16 Jun 2014 16:31:34 +0000: kay.agahd@idealo.de One workaround you can do to workaround this issue while waiting for the fix is to change the balancer setting so it will always perform synchronous migration cleanup. This can be done by setting the _waitForDelete field in config.settings document to true like this: use config db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }}); The side effect is that migrations will now take longer to complete because it will always wait for the deletion to complete. renctan commented on Mon, 16 Jun 2014 15:20:30 +0000: This is a bug and it is scheduled to be fixed on 2.6.3. kay.agahd@idealo.de commented on Mon, 16 Jun 2014 15:15:14 +0000: Am I right that's a bug? Will you fix it or should we avoid the replSetFreeze command? We are compacting our database on a regulary basis, When the primary needs to be compacted, we use the replSetFreeze command on the secondaries having the less free RAM. Thus, once stepped down the primary, the new primary will be the server having the most free RAM. renctan commented on Mon, 16 Jun 2014 14:39:33 +0000: Based from the logs the node step down after receiving a replSetFreeze command. The rangeDeleter thread asserted because the node is not primary anymore, but the exception was never captured so it terminated the server. kay.agahd@idealo.de commented on Mon, 16 Jun 2014 14:20:18 +0000: Greg, I've uploaded the log file mongod.log.1.tgz to you: root@s124:/home/admin# scp -P 722 /data/mongod/log/mongod.log.1.tgz SERVER-14261@www.mongodb.com: The authenticity of host '[www.mongodb.com]:722 ([75.101.156.249]:722)' can't be established. DSA key fingerprint is 8d:cf:ca:17:1e:00:66:e0:a0:9f:08:0b:e6:71:e1:ff. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '[www.mongodb.com]:722,[75.101.156.249]:722' (DSA) to the list of known hosts. SERVER-14261@www.mongodb.com's password: mongod.log.1.tgz root@s124:/home/admin# If you need more information, just tell me please. Thanks! greg_10gen commented on Mon, 16 Jun 2014 14:04:44 +0000: Is there any more of the log file you can provide us from this mongod - ideally something close to an hour before/after? This looks like it happened due to a replica set state change while waiting for replication, but more context would help us further confirm this.