...
The Ruby, Java, and Python drivers (at least) have noticed test failures when running an aggregation with $out against server version 8.1.0, in a replica-set topology. The failures do not happen against a sharded topology. The error message is: "PlanExecutor error during aggregation :: caused by :: indexes of target collection db0.coll1 changed during processing". The only indexes are those on _id, and there are no concurrent operations running. A failing task for the Java driver, with attached server logs, can be found here: https://spruce.mongodb.com/task/mongo_java_driver_tests_jdk_secure__version~latest_os~linux_topology~replicaset_auth~auth_ssl~ssl_jdk~jdk21_test_1816e3cc9bef5e2321505f1d2b087fe90996dad5_24_04_30_17_57_13/tests?execution=0&sortBy=STATUS&sortDir=ASC A client-side log for the Ruby driver, of issued commands during one of these failing tests, is attached to this ticket.
jeff.yemin commented on Tue, 21 May 2024 12:44:02 +0000: kaitlin.mahar@mongodb.com it seems like a viable workaround would just be to remove the initialData section for coll1, e.g. this bit: https://github.com/mongodb/specifications/blob/master/source/crud/tests/unified/aggregate-write-readPreference.yml#L54-L56. If we do that, then the test will not attempt to create the collection at all, and $out will take care of it. And since $out replaces any existing documents in the collection with the output of the $out stage, the outcome will be the same. What do you think? JIRAUSER1270969 commented on Tue, 21 May 2024 08:04:29 +0000: After PM-3489 is enabled on both master and 8.0 branch, return of a {w: majority} command only means that the command is durable on a majority number of nodes so the command may have not been applied on a majority number of nodes. Though reading from a secondary after a majority write without causal consistency is not guaranteed to read the majority write document, it is used in some of our tests. That says, the change from PM-3489 can cause some jstest failures and we have fixed a number of these tests as we saw them. However, we could still see some jstest failures coming in as cold BFs due to their lower frequency. We want to let the team know and pay more attention to this kind of failure pattern which may make your debugging process easier. Also, please keep this in mind when adding new jstests. The typical failure pattern in these jstests is: Start a replica set(most likely 2 nodes), execute a command using {w:majority}, then read from the secondary and expect to see the previous command. A few ways to fix these tests(order by preference): # Call ReplSet.awaitReplication() before read from the secondary. Change to use causal consistency session. Change the write command to use numbered write concern like {w:2}, {w:3} instead of {w:majority}. Numbered write concerns will wait the write to be applied on secondaries.