...
BugZero found this defect 2921 days ago.
When the document returned by the GTE query against our sync source does not include our most recent optime (ie the term and timestamp of our most recent oplog entry), we currently unconditionally return OplogStartMissing and go into ROLLBACK. The GTE query only includes the timestamp however, not the term, so we need to check the term and if the optime (including term) of our most recent oplog entry is higher than the optime we got back from the GTE query, we should not go into rollback but should just choose a new sync source.
siyuan.zhou@10gen.com commented on Fri, 30 Dec 2016 21:14:30 +0000: Judah's proposal sounds correct given that we close connections on rollback (so RBID serves as a version). I think there might be an easier fix though. If GTE response starts from our last fetched optime, it means all the oplog entries up to our last fetched document are the same as those on our sync source, due to Log Matching property in Raft: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. In addition, we need to compare the last applied optime on sync source and make sure it's ahead of our last fetched optime. This requirement is the same in Raft. Actually the last applied optime on sync source is ready to use in ReplSetMetadata since we always query it in oplog fetcher. For each GTE response, we make sure 1) (Oplog Matching) its first entry is the same as our last fetched, and 2) (Oplog Freshness) their last applied (ReplSetMetadata::getLastOpVisible()) is greater than our last fetched one. If Oplog Matching is false but Oplog Freshness is true, we have to roll back. If Oplog Freshness is false, choose another sync source. As a result, the correctness is guaranteed by these two conditions, even if our sync source rolls back, and choosing sync source is only for performance. Maintaining RBID is not necessary in steady state replication then. judah.schvimer commented on Thu, 29 Dec 2016 22:09:33 +0000: Per our discussion we will be doing the following: The SyncSourceResolver will now make sure that our sync source is safe to use by both checking the Rollback ID (RBID) of the sync source candidate and by looking at its last applied OpTime. The SyncSourceResolver will now: 1. Choose a sync source as it currently does. The sync source selection algorithm will make sure that we only choose sync sources that are ahead of our last applied OpTime. 2. Get the sync source candidate's RBID and save it for later. Work on SERVER-27050 will do this, however it does it only if there is a minValid present, this ticket must make it an unconditional check. 3. Get sync source candidate's last applied OpTime and check to see it's greater than or equal to our last fetched OpTime. (This must be added by this ticket) 4. Get the sync source candidate's oldest oplog entry and see if we are even able to sync from this sync source. 5. Finally, Check if the sync source candidate contains minValid if one exists. As part of SERVER-27050, the RBID will be checked again in BackgroundSync before the sync source is used. This check should be moved by this ticket into OplogFetcher::checkRemoteOplogStart() so that it is performed on the first batch only and before we decide to go into Rollback. Finally, OplogFetcher::checkRemoteOplogStart() is now mostly fine. Since our sync source cannot have rolled back since we chose it, it appears impossible for our GTE query to return empty documents. We will add an fassert there instead of returning RemoteOplogStale. The OplogStartMissing cases should be fine now. If we fall off the back of the sync source's oplog between probing the sync source and getting our first batch, we will return an OplogStartMissing error here and go into rollback, leading to an inevitable UnrecoverableRollbackException since we will not be able to find a common point. That is a small window and users rarely fall of the back of their oplog. Further work could check if this is the case by re-querying the sync source's oldest oplog entry before going into rollback, but that is out of the scope of this ticket. spencer commented on Tue, 20 Dec 2016 19:24:48 +0000: You're right, that is a problem. For reference, the original problem this ticket is trying to avoid is node A or B rolling back in the following situation Timestamp 1 2 3 4 5 6 Node A (term) 1 1 1 1 3 Node B (term) 1 1 1 1 3 Node C (term) 1 1 2 2 2 2 siyuan.zhou@10gen.com commented on Mon, 19 Dec 2016 21:05:19 +0000: That's where our implementation differs from Raft. Even if a majority of nodes know about term 2, they can still replicate logs from the old primary or nodes that synced from the old primary. Even in Raft, I still think this is possible. At the beginning, Node C steps up in term 2. Timestamp 1 2 3 4 5 Node A (term) 1 1 1 1 Node B (term) 1 1 Node C (term) 1 1 2 Then Node A steps up again in term 3. Timestamp 1 2 3 4 5 Node A (term) 1 1 1 1 3 Node B (term) 1 1 Node C (term) 1 1 2 Node A replicates and commits logs from old terms (term 1). Timestamp 1 2 3 4 5 Node A (term) 1 1 1 1 3 Node B (term) 1 1 1 1 3 Node C (term) 1 1 2 spencer commented on Mon, 19 Dec 2016 20:23:50 +0000: The scenario you describe isn't possible. For node C to have been elected in term 2, a majority of nodes must have voted for it, and thus a majority of nodes must know about term 2, and all future committed writes must happen in a term greater than or equal to 2. siyuan.zhou@10gen.com commented on Mon, 19 Dec 2016 07:33:07 +0000: Should we compare our last optime against the last applied optime on our sync source rather than the first optime from the response of GLE? Imagine the following scenario, timestamps are shown as integers for simplicity. When Node C queries the oplog on Node B to find anything greater than or equal to (ts: 3), it will get (ts: 3, term 1), whose term is less than its last term in oplog (ts: 3, term 2). But Node B is actually ahead of Node C. Timestamp 1 2 3 4 5 Node A (term) 1 1 1 1 3 Node B (term) 1 1 1 1 3 Node C (term) 1 1 2 Having the last applied optime in GLE metadata of response and checking it against its last fetched optime is a robust way to detect rollback. spencer commented on Tue, 13 Dec 2016 21:04:41 +0000: I think what we should do is break out the case where we check if the oplog entry returned by our sync source matches our most recent oplog entries into 3 cases: If the sync source's entry is newer, return OplogStartMissing and go into rollback as we currently do. If our entry is newer, we return RemoteOplogStale, and then we update bgsync to not trigger rollback on RemoteOplogStale, and to just change sync sources instead (which is basically implementing SERVER-26253). If the two entries have the same OpTime but different hashes, fassert since that should be impossible. milkie commented on Tue, 13 Dec 2016 19:40:04 +0000: I suspect that changing the query this way may break oplogReplay, unfortunately. Our only other option is to do a local round of filtering after we get the data back. spencer commented on Tue, 13 Dec 2016 19:33:12 +0000: I believe we could do this with a change in the oplog query we do, as milkie suggests, by changing the query from {ts: {$gte: op.timestamp}} to {$or: [{t: {$gt: op.t}}, {$and: [{t: op.t}, {ts: {$gte: op.ts}}]}]} i.e. find the ops where either the term is greater than my last op, or the term is the same and the timestamp is greater. If we take this approach, then implementing SERVER-26253 becomes mandatory as well or we will still trigger rollback in these situations. My bigger concern with this approach is if it will negatively impact the performance of the oplog queries. david.storch, can you comment on what affect changing the query we use to fetch oplog entries for replication is likely to have? This is the main query we use for fetching oplog entries, which runs with tailable, oplogReplay, and awaitData all set to true. milkie commented on Tue, 13 Dec 2016 16:37:36 +0000: I would prefer, if possible, to construct the query such that data isn't returned if it is less than what we are looking for. As proposed, we would do a GTE query which would leave out some, but not all, of the documents we care about not seeing, and then after getting the data we would then do another filtering pass ourselves.