
OPERATIONAL DEFECT DATABASE
...

...
Original title: rs.stepDown during migration causes fassert in logOp Original description: At 3am cron issues an rs.stepDown() as part of weekly compaction - having read SERVER-14186 I assume the migrate chunk errors are not relevant, but I've left them in the logs in case 2014-10-04T03:04:57.225+0000 [migrateThread] doc_metadata.metadata caught assertion addKeysToIndex doc_metadata.metadata.$url_1_sourceKey_1_id: ObjectId('519239f8e4b0be945d6b9fe6') 2014-10-04T03:04:57.225+0000 [migrateThread] scoped connection to replica_set1/10.40.131.234:27018,10.40.213.215:27018 not being returned to the pool 2014-10-04T03:04:57.225+0000 [migrateThread] about to log metadata event: { _id: "ip-10-60-18-179-2014-10-04T03:04:57-542f63d9417de929885b7e2e", server: "ip-10-60-18-179", clientAddr: ":27017", time: new Date(1412391897225), what: "moveChunk.to", ns: "doc_metadata.metadata", details: { min: { _id: ObjectId('5191d4f3e4b07927e24302bc') }, max: { _id: ObjectId('51926d79e4b056bb2e3823fc') }, step 1 of 5: 2, step 2 of 5: 5296, note: "aborted" } } 2014-10-04T03:04:57.258+0000 [migrateThread] ERROR: migrate failed: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing doc_metadata.metadata.$url_1_sourceKey_1 1163 { : "http://www.shippingonline.cn/Yellowpages/world_list.asp?page=164&user_addr='''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''...", : "www.google.com.search.64.84." } 2014-10-04T03:05:01.748+0000 [conn528628] end connection 10.226.114.59:33504 (272 connections now open) 2014-10-04T03:05:01.748+0000 [initandlisten] connection accepted from 10.226.114.59:33525 #528632 (273 connections now open) 2014-10-04T03:05:01.887+0000 [conn478926] remotely refreshing metadata for doc_metadata.metadata based on current shard version 660|1||53da748318304c3808174ba3, current metadata version is 660|7||53da748318304c3808174ba3 2014-10-04T03:05:01.889+0000 [conn478926] metadata of collection doc_metadata.metadata already up to date (shard version : 660|1||53da748318304c3808174ba3, took 1ms) 2014-10-04T03:05:01.903+0000 [migrateThread] starting receiving-end of migration of chunk { _id: ObjectId('5191d4f3e4b07927e24302bc') } -> { _id: ObjectId('51926d79e4b056bb2e3823fc') } for collection doc_metadata.metadata from replica_set1/10.40.131.234:27018,10.40.213.215:27018 at epoch 53da748318304c3808174ba3 2014-10-04T03:05:01.985+0000 [conn528629] end connection 10.245.78.137:48391 (272 connections now open) 2014-10-04T03:05:01.987+0000 [initandlisten] connection accepted from 10.245.78.137:48402 #528633 (273 connections now open) 2014-10-04T03:05:05.819+0000 [initandlisten] connection accepted from 127.0.0.1:37058 #528634 (274 connections now open) 2014-10-04T03:05:05.824+0000 [conn528634] end connection 127.0.0.1:37058 (273 connections now open) 2014-10-04T03:05:05.900+0000 [initandlisten] connection accepted from 127.0.0.1:37059 #528635 (274 connections now open) 2014-10-04T03:05:05.929+0000 [conn528635] 10.226.114.59:27019 is now closest at 542f63e1:37 2014-10-04T03:05:05.929+0000 [conn528635] replSet info stepping down as primary secs=60 2014-10-04T03:05:05.939+0000 [conn528635] replSet relinquishing primary state 2014-10-04T03:05:05.939+0000 [conn528635] replSet SECONDARY (lots of) 2014-10-04T03:05:05.940+0000 [conn478632] end connection 10.40.213.215:59486 (273 connections now open) 2014-10-04T03:05:05.951+0000 [migrateThread] replSet error : logOp() but not primary 2014-10-04T03:05:05.951+0000 [migrateThread] doc_metadata.metadata Fatal Assertion 17405 2014-10-04T03:05:06.000+0000 [initandlisten] connection accepted from 10.40.131.234:33182 #528636 (17 connections now open) 2014-10-04T03:05:06.008+0000 [initandlisten] connection accepted from 127.0.0.1:37060 #528637 (18 connections now open) 2014-10-04T03:05:06.067+0000 [conn528635] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:37059] 2014-10-04T03:05:06.081+0000 [initandlisten] connection accepted from 10.226.114.59:33553 #528638 (18 connections now open) 2014-10-04T03:05:06.126+0000 [migrateThread] doc_metadata.metadata 0x11e6111 0x1187e49 0x116a97d 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111] mongod(_ZN5mongo10logContextEPKc+0x159) [0x1187e49] mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116a97d] mongod() [0xe577aa] mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e] mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139] mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594] mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178] mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839] mongod() [0x122aaa9] /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851] /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d] 2014-10-04T03:05:06.126+0000 [migrateThread] ***aborting after fassert() failure 2014-10-04T03:05:06.133+0000 [migrateThread] SEVERE: Got signal: 6 (Aborted). Backtrace:0x11e6111 0x11e54ee 0x7f72afa02920 0x7f72afa028a5 0x7f72afa04085 0x116a9ea 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111] mongod() [0x11e54ee] /lib64/libc.so.6(+0x32920) [0x7f72afa02920] /lib64/libc.so.6(gsignal+0x35) [0x7f72afa028a5] /lib64/libc.so.6(abort+0x175) [0x7f72afa04085] mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116a9ea] mongod() [0xe577aa] mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e] mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139] mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594] mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178] mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839] mongod() [0x122aaa9] /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851] /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d] (Other issue - the primary node was running at 200% repeating the same log messages over and over again 10x per second: 2014-10-24T13:02:10.289+0000 [rsHealthPoll] warning: Failed to connect to 10.60.18.179:27019, reason: errno:111 Connection refused 2014-10-24T13:02:10.289+0000 [rsHealthPoll] couldn't connect to 10.60.18.179:27019: couldn't connect to server 10.60.18.179:27019 (10.60.18.179) failed, connection attempt failed
xgen-internal-githook commented on Mon, 22 Dec 2014 20:34:59 +0000: Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'} Message: SERVER-15798 Helpers::removeRange does not check if node is primary Branch: master https://github.com/mongodb/mongo/commit/bf0cdcee577aa5a05ba65506d8bd47f33eac30d2 renctan commented on Wed, 12 Nov 2014 22:29:08 +0000: Note: Other write ops like insert, delete and update performs this check after grabbing the exclusive lock. apiggott@ikanow.com commented on Wed, 12 Nov 2014 21:01:47 +0000: Thanks! I might leave that one running overnight.... ramon.fernandez commented on Wed, 12 Nov 2014 20:57:55 +0000: Hi apiggott@ikanow.com, sorry to hear you experienced another crash. You can find the documents with large values using a query like the following: db.foo.find({$where:"this.url.length > 1023"}) Replace the value of url above with the appropriate key path in your documents. I'm trying to reproduce the issue on this end, and a developer is looking at the code as well. apiggott@ikanow.com commented on Wed, 12 Nov 2014 20:33:08 +0000: Incidentally is there a slick way I can identify which records in my DB have this too long problem (apart from by looking in the logs each time it crashed - it crashed again today!) and deleting the specific one it cared about (forcing it to do a table scan lookup against the URL somehow)? apiggott@ikanow.com commented on Tue, 28 Oct 2014 20:31:57 +0000: (Incidentally isn't failIndexKeyTooLong the opposite of what I want ... since it will allow documents with long index keys to be inserted moving forward. My problem is that I have existing docs with long index keys because 2.4- allowed the insertion of such fields. At least now no new problematic documents are getting inserted, I just need to fix the ones that are already in the DB) [EDIT: oh or do you mean that the chunk migration won't fail any more if failIndexKeyTooLong is false? That would make sense! Though I think discarding any future "corrupt" docs is the better default] apiggott@ikanow.com commented on Tue, 28 Oct 2014 20:27:54 +0000: Thanks for looking into it - to clarify the first error I don't care about at all, I just included it in case it was the root cause of the fassert. Preventing the insertion of long fields is certainly my problem to fix! However, obviously even if I am trying to insert keys that are too long then that shouldn't crash the DB, so the purpose of this issue was to let you know that it was happening (or something along those lines - my guess would be any migration can cause the problem, it was just more likely to occur given the repeated attempts due to the long index failure). Unfortunately the database on which this occurred is operational so I can't mess about trying to make it fail. Presumably the steps would be: 1) create a DB with multiple shards 2) insert a record with a long index 3) migrate the chunk containing the long index 4) confirm that it was repeatedly trying and failing to move the chunk in the logs 5) try the rs.stepDown() Though presumably that's what you've been trying without luck? Since this is the first time the DB has crashed in about 2 years, and each of 2 shards has been stepping down every week (and I think I've seen "index too long" errors since forever), I'm guessing it might be quite a low probability event... Did you ping the dev who fixed SERVER-14186 to see if he just missed a case in his patch? ramon.fernandez commented on Tue, 28 Oct 2014 19:20:25 +0000: Hi apiggott@ikanow.com, the first error I see in the log snippet you sent is related to a migration failure: 2014-10-04T03:04:57.258+0000 [migrateThread] ERROR: migrate failed: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing doc_metadata.metadata.$url_1_sourceKey_1 1163 { : "http://www.shippingonline.cn/Yellowpages/world_list.asp?page=164&user_addr='''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''...", : "www.google.com.search.64.84." } Please see the list of compatibility changes in MongoDB 2.6 for more details about this issue. In your particular case: Chunk Migration will fail: Migrations will fail for a chunk that has a document with an indexed field whose index entry exceeds the limit. If left unfixed, the chunk will repeatedly fail migration, effectively ceasing chunk balancing for that collection. Or, if chunk splits occur in response to the migration failures, this response would lead to unnecessarily large number of chunks and an overly large config databases. You may need to set failIndexKeyTooLong to false and/or fix the offending documents first: db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } ) Then there's the fatal assertion message: 2014-10-04T03:05:05.951+0000 [migrateThread] replSet error : logOp() but not primary 2014-10-04T03:05:05.951+0000 [migrateThread] doc_metadata.metadata Fatal Assertion 17405 I'm trying to reproduce this error, unsuccessfully so far, so if you have a reproducing script it would be of great help. In the meantime, can you please review the issue described above with long keys and let us know if rs.stepDown() still triggers the assertion? apiggott@ikanow.com commented on Fri, 24 Oct 2014 16:03:19 +0000: (When I say "exactly the same issue", I mean "the same problem except during migration not map/reduce"! Migration wasn't mentioned in SERVER-14186 so you probably just missed a few places in the original fix?
4x server containing 2 replica sets, running Amazon Linux (~Centos6) with all running 2.6.4: [root@ip-10-60-18-179 mongo]# rpm -qa | grep mongo mongodb-org-mongos-2.6.4-1.x86_64 mongodb-org-2.6.4-1.x86_64 mongodb-org-shell-2.6.4-1.x86_64 mongodb-org-tools-2.6.4-1.x86_64 mongodb-org-server-2.6.4-1.x86_64
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.