...
BugZero found this defect 2620 days ago.
We have a 3-node replicaset running v3.4.7, the primary is running 6 map/reduce jobs every minute or so and due to circumstances we had to re-sync one of the secondary nodes. However at one of the last steps of the re-sync we get the following error: 2017-09-05T17:56:04.348+0000 E REPL [repl writer worker 5] Error applying operation: OplogOperationUnsupported: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } } ({ ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } }) 2017-09-05T17:56:04.348+0000 E REPL [replication-168] Failed to apply batch due to 'OplogOperationUnsupported: error applying batch: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } }' 2017-09-05T17:56:04.348+0000 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host graphs1-mongo3:27017 due to bad connection status; 2 connections to that host remain open 2017-09-05T17:56:04.348+0000 I REPL [replication-167] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1504634159000|4672, t: -1 }[-4041766555669726456] 2017-09-05T17:56:04.348+0000 I REPL [replication-168] Initial sync attempt finishing up. After this, MongoDB cleans up the files and starts the re-sync again, it's now basically stuck in a very big loop. This used to work fine with 2.x, when we re-synced quite a few times. I'm not sure what to do about this, we can't stop the map/reduce jobs for the duration of the resync as it takes about 8 hours to get to this point.
ramon.fernandez commented on Thu, 7 Sep 2017 21:43:15 +0000: I think there's a better workaround than downgrading (which may have other issues). You can use output to a collection with an action. For example: db.outputcollection.drop() // The output collection can't be empty, so insert a marker document db.outputcollection.insert({marker:1}) db.mycollection.mapReduce(myMapFunction, myReduceFunction, { out: { merge : "outputcollection" } }) Using this approach avoids the implicit renameCollection operation used by the out option of mapReduce altogether and should help with the initial sync issue originally reported. Regards, Ramón. thijs commented on Thu, 7 Sep 2017 11:03:18 +0000: Unfortunately we can't seem to sync a 3.2.11 node, the sync fails with an error message about missing indexes. We've started work on a solution to completely move away from using aggregations in our processing pipeline. I think this change was communicated extremely poorly. It was introduced in 3.2.12 with benign sounding description "renameCollection ‘c’ op should restart initial sync upon application". Also it's not even mentioned in the overview page here (https://docs.mongodb.com/manual/release-notes/3.2/), you have to click to the detailed list to even find it. There is no mention at all that this change would break initial syncs when you use big aggregations. This is a change with huge impact. I really don't understand that this change was made in a point release and didn't come with a huge warning in the change log. dan@10gen.com commented on Wed, 6 Sep 2017 18:40:12 +0000: Glad you have an escape valve. Please look through the caveats Spencer called out and remember that the version must be 3.2.11 or lower. thijs commented on Wed, 6 Sep 2017 18:27:13 +0000: Thanks for the thoughtful reply Spencer. Great news that this can be mitigated by running 3.2 We were under the impression that this issue was also present in 3.2. We can still run 3.2 since we haven't performed the auth upgrade yet. We are trying the downgrade approach now, will let you know how it goes. spencer commented on Wed, 6 Sep 2017 16:04:09 +0000: Hi thijs and robert@80beans.com, To provide some more context here, the underlying problem is that aggregation and mapReduce with $out perform a rename collection internally to rename the output collection from a temporary name to its final name. Unfortunately due to SERVER-4941 any collection that is renamed while it is being initial sync'ed can wind up with incomplete or incorrect data on the node that did the initial sync. Earlier versions of mongodb (SERVER-26117 we changed initial sync to no longer allow processing a renameCollection and to instead fail and restart the initial sync, to avoid possible data corruption. 3.6 will avoid this problem by using new features from the storage engine to ensure that initial sync can properly handle renameCollection without losing data. Until 3.6 is released, however, I understand you are in a tricky situation. The best thing to do is one of the workarounds that Thomas describes - either cease running the aggregations and mapReduces causing the problem for the duration of the initial sync, or rework the app to not use $out to have the agg or m/r write directly to the output collection, but instead get the results streamed back to your app then have your app manually write the data out to the output collection. If neither of those are an option, you can downgrade the initially syncing node to 3.2.11 or wait for SERVER-29772 to be completed, backported, and released. Both options will allow the initial sync to complete even if it encounters a renameCollection. In that case, however, the data for any collections that were renamed during the initial sync may be corrupt (see SERVER-4941 for more details). So if you were to do this, you would need to make sure to drop any collections generated by an agg or m/r as soon as the initial sync completes. You would also want to ensure that you're not reading from any of those collections to guarantee the integrity of the data returned. Once the initial sync is complete and all the agg and m/r output collections dropped, it would then be safe to run new aggregations or mapReduces and use the resulting data. I hope this clarifies the issue and the steps to work around it. I recognize that this is a difficult bug to work around and can cause real operational hurdles for you, and for that I sincerely apologize on behalf of MongoDB. We are working hard to get this fixed for 3.6. -Spencer thijs commented on Wed, 6 Sep 2017 12:17:08 +0000: Hi, We're in a very tight spot at the moment because of this issue: We cannot stop the aggregations and/or map reduces that are creating the temporary collections. When not allowing disk usage on aggregations our jobs cannot complete. Making a filesystem snapshot takes more than an hour which also breaks everything. Basically we need a fix for this otherwise we have to do an extremely hasty project to take this entire system out of commission . Do you have any other suggestions? Are nodes running WiredTiger also affected by this? To me it seems like this issue is not a duplicate of SERVER-4941. We're not renaming any collections or databases. The aggregation system creates temporary collections which I imagine don't even need to be synced to the secondaries. I therefore feel it makes more sense to keep this issue open. Why is a fix for this issue that causes a real risk of data-loss not given more priority? Thijs thomas.schubert commented on Tue, 5 Sep 2017 21:38:21 +0000: Hi robert@80beans.com, Unfortunately, there is no simple workaround for this issue. There are two possible workarounds, the first you've already identified and ruled out: Disable whatever application/process runs the map-reduce commands until the initial sync can complete Update the application/process that's running the map-reduce to not use out, but instead to get the results via a cursor and manually write the results back out to the output collection. Kind regards, Kelsey robert@appsignal.com commented on Tue, 5 Sep 2017 20:00:55 +0000: Will do, but that issue looks unresolved. In the meantime we're down one replicaset member. Is there any workaround to get an initial sync to complete? thomas.schubert commented on Tue, 5 Sep 2017 19:55:16 +0000: Hi robert@80beans.com, Thanks for the report, this issue is being tracked under SERVER-4941 and the fix will be included in MongoDB 3.6. Kind regards, Kelsey robert@appsignal.com commented on Tue, 5 Sep 2017 19:46:01 +0000: Additional information: The map/reduce jobs are preceded by an aggregation that also creates a temp collection, this because the aggregation is a lot faster then the map/reduce jobs.
Run a 3-node cluster Run map/reduce jobs that require a temp collection Re-sync one of the nodes, the re-sync will fail and start again