...
BugZero found this defect 2688 days ago.
The configuration client(C), router(R), replica set shard1(S1), replica set shard2(S2), Balancer(B) The scenario: The client submits the write command to the router and learns the operationTime T1 returned from this write performed on the shard1 Then the balancer initiates migration of the data from S1 to S2 that includes the written by the write. Client issues a read with (afterClusterTime = T1, readPreference=secondary) The read will end up on the another shard than the write but the T1 may not be old enough as it has different oplog that may have already data in the time T1. In this case the read will return the result that will not be causally consistent with the write as it will not have the written values. One possible solution to the issue will be modifying the afterClusterTime on the router if the requested time is less than the routing information change for the requested data. I.e. shards may have different clusterTime as we can not assume that they are communicating. However the routing data change is indirect communication that the client is unaware. So the afterClusterTime should be adjusted accordingly. An analogy is moving the clock in the different time zones. Implementation details: There are 2 parts: 1. keep the operationTime of the routing metadata refresh. 2. inspect all incoming messages that have afterClusterTime and if there is a chance that the requested data has been moved then update afterClusterTime to the operationTime of the routing metadata refresh.
dianna.hohensee commented on Wed, 27 Sep 2017 19:05:45 +0000: I created SERVER-31275 (linked) with the new scenario, so I'm closing this ticket. dianna.hohensee commented on Tue, 26 Sep 2017 19:38:44 +0000: Reading the ticket summary, the scenario described is indeed safe because the shard version protocol extends to secondaries. I had a different scenario in mind for causal consistency being broken than the one you're describing, however, so I'm going to have to revisit my notes/scribbles. esha.maharishi@10gen.com commented on Tue, 26 Sep 2017 18:14:18 +0000: I believe misha.tyulenev's comment is correct: if the read is versioned, the secondary will wait until it has received the routing table updates that correspond to the fresh version, at which point it must have already received the data that corresponds to the fresh version. I think there is only one known unversioned read, which is geoNear. The safe_secondary_reads_drop_recreate.js test specifies what kind of behavior (versioned, unversioned, or unsharded only) each read command has. misha.tyulenev commented on Tue, 26 Sep 2017 16:25:01 +0000: An update based on the offline discussion with redbeard0531 dianna.hohensee and esha.maharishi The scenario in the description should work with versioned reads, because secondary chunk aware will guarantee that the data on the secondary of S2 is consistent with the metadata on the R. I.e. if the R has updated the metadata and the targeted chunk is on the S2 then the S2 secondary also must have the data (i.e the replication has already performed) if R has the old metadata then it will have a stale sharding error will refresh metadata and retry. Hence there is no outstanding issue in this scenario.