...
Note: I know that only WT csrs is supported since 3.6, but kaloian.manassiev recommended filing a bug since it causes the server to crash. If you start and initiate a mmapv1 CSRS on 3.7.5, the secondaries crash with the following fassert: 2018-04-25T11:35:01.441-0400 I REPL [replication-0] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1524670402251), initialSyncAttempts: [ { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" } ] } 2018-04-25T11:35:01.441-0400 E REPL [replication-0] Initial sync attempt failed -- attempts left: 0 cause: InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync. 2018-04-25T11:35:01.442-0400 F REPL [replication-0] The maximum number of retries have been exhausted for initial sync. 2018-04-25T11:35:01.443-0400 I STORAGE [replication-0] Finishing collection drop for local.temp_oplog_buffer (9c32c293-5d0b-491b-921a-6c2b83700ff9). 2018-04-25T11:35:01.444-0400 E REPL [replication-0] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync. 2018-04-25T11:35:01.445-0400 F - [replication-0] Fatal assertion 40088 InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync. at src/mongo/db/repl/replication_coordinator_impl.cpp 685 2018-04-25T11:35:01.445-0400 F - [replication-0] ***aborting after fassert() failure Then, if you try to restart that process after the fassert, you get a different FCV fassert: 2018-04-25T11:47:13.205-0400 F STORAGE [initandlisten] Unable to start up mongod due to missing featureCompatibilityVersion document. 2018-04-25T11:47:13.205-0400 F STORAGE [initandlisten] Please run with --journalOptions 4 to recover the journal. Then run with --repair to restore the document. 2018-04-25T11:47:13.206-0400 F - [initandlisten] Fatal Assertion 40652 at src/mongo/db/repair_database_and_check_version.cpp 527 2018-04-25T11:47:13.206-0400 F - [initandlisten] ***aborting after fassert() failure Attached the full log files for all 3 members of the csrs.
milkie commented on Thu, 26 Apr 2018 00:24:27 +0000: A brand new server that is started with --replSet or --shardsvr cannot simply write an FCV document without first becoming part of a replica set or sharded system, because FCV determination is synchronized in those environments. kaloian.manassiev commented on Wed, 25 Apr 2018 23:51:38 +0000: Ah ok, this makes sense - thank you for the explanation! Regardless of this, shouldn't a brand new server starting write its FCV as the first thing it does after recovery? This would avoid situations like this. milkie commented on Wed, 25 Apr 2018 23:48:58 +0000: I think the reason this is happening is because initial sync, upon exhausting retry attempts, does not do a final "clean everything out" before it shuts down the server. I don't think it's worth it to fix that. schwerin commented on Wed, 25 Apr 2018 23:09:38 +0000: There's nothing wrong here other than maybe the text of the message. Config servers have to use WT as their storage engine. kaloian.manassiev commented on Wed, 25 Apr 2018 20:37:32 +0000: Oh sorry, I didn't see there is a question. milkie, what seemed odd to me in this failure is this message Unable to start up mongod due to missing featureCompatibilityVersion document, which indicates that a node managed to create some collections before the FCV document has been written. Apart from that, I don't care if the secondaries fassert due to being unable to perform initial sync. louisa.berger@10gen.com commented on Wed, 25 Apr 2018 20:27:19 +0000: Assuming you want kaloian.manassiev to weigh in here, but from Cloud perspective I really don't have an opinion – it's not blocking us, I just came across this while testing. milkie commented on Wed, 25 Apr 2018 17:24:42 +0000: It appears that initial sync doesn't work because, due to the existing conversion logic that helps users convert from mirrored config servers to CSRS, any MMAP nodes are automatically moved to "REMOVED" state when they join a cluster. This state is prohibiting them from completing initial sync. I'm not sure what the expected behavior here is. Would you want the servers to not shut down, but simply remain running as REMOVED members? milkie commented on Wed, 25 Apr 2018 17:20:50 +0000: Note that the server isn't "rendered unusable"; the log message immediately following the error tells you how to fix it – by restarting the server with some different parameters to override the error detection and fix the problem. However, it is still a problem that initial sync is failing for this case. We'll still need to investigate that. kaloian.manassiev commented on Wed, 25 Apr 2018 16:01:03 +0000: Assigning to the storage team, because it looks like the "Unable to start up mongod due to missing featureCompatibilityVersion document" message appears as a result of the server being left in inconsistent state, where collections were created before the FCV document was written. Not sure if this is a scenario we care about, but rendering the server unusable in this state seems wrong.
Start up 3 config servers on mmapv1 with the following config files: net: bindIp: 0.0.0.0 port: 5006 processManagement: fork: "true" replication: replSetName: csrs_set sharding: clusterRole: configsvr storage: dbPath: /tmp/mms-automation/data/config1 engine: mmapv1 systemLog: destination: file path: /tmp/mms-automation/logs/config1_run.log Initiate the replica set (on 5006) : > rs.initiate({_id : "csrs_set", "members" : [ { _id : 0, host : "louisamac:5006"}, {_id : 1, host : "louisamac:5007"}, {_id : 2, host : "louisamac:5008" } ] }) 2018-04-25T11:33:20.237-0400 E QUERY [js] Error: error doing query: failed: network error while attempting to run command 'replSetInitiate' on host '127.0.0.1:5006' : DB.prototype.runCommand@src/mongo/shell/db.js:168:1 DB.prototype.adminCommand@src/mongo/shell/db.js:186:16 rs.initiate@src/mongo/shell/utils.js:1270:12 @(shell):1:1 2018-04-25T11:33:20.238-0400 I NETWORK [js] trying reconnect to 127.0.0.1:5006 failed 2018-04-25T11:33:20.242-0400 I NETWORK [js] reconnect 127.0.0.1:5006 ok Wait a while.. and see the first initial sync fassert on the secondaries. Restart the secondary with the same options. Immediately see the second fcv fassert on the secondaries.