...
MongoDB version : v4.4.5 MongoDB Java Driver Version: mongodb-driver-sync 4.1.2 A transaction T1 failed to commit due to the WritableServerSelector-Timeout exception. However, another transaction T2 read the updates of T1. I use the withTransaction API and T1 retries for many times. The exception T1 throws are: 2022-08-25 00:35:04,237{GMT} WARN [jepsen worker 46] jepsen.mongodb.rw-register: #error { :cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}] :via [{:type com.mongodb.MongoTimeoutException :message Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}] :at [com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407]}] :trace [[com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407] [com.mongodb.internal.connection.BaseCluster selectServer BaseCluster.java 118] [com.mongodb.internal.connection.AbstractMultiServerCluster selectServer AbstractMultiServerCluster.java 52] [com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource ClusterBinding.java 103] [com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource ClusterBinding.java 99] [com.mongodb.internal.binding.ClusterBinding getWriteConnectionSource ClusterBinding.java 91] [com.mongodb.client.internal.ClientSessionBinding getWriteConnectionSource ClientSessionBinding.java 93] [com.mongodb.internal.operation.OperationHelper withReleasableConnection OperationHelper.java 618] [com.mongodb.internal.operation.CommandOperationHelper executeRetryableCommand CommandOperationHelper.java 705] [com.mongodb.internal.operation.TransactionOperation execute TransactionOperation.java 69] [com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 133] [com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 54] [com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor execute MongoClientDelegate.java 195] [com.mongodb.client.internal.ClientSessionImpl commitTransaction ClientSessionImpl.java 129] [com.mongodb.client.internal.ClientSessionImpl withTransaction ClientSessionImpl.java 210] [jdk.internal.reflect.GeneratedMethodAccessor29 invoke nil -1] [jdk.internal.reflect.DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl.java 43] [java.lang.reflect.Method invoke Method.java 566] [clojure.lang.Reflector invokeMatchingMethod Reflector.java 167] [clojure.lang.Reflector invokeInstanceMethod Reflector.java 102] [jepsen.mongodb.rw_register.Client$fn__2566 invoke rw_register.clj 137] [jepsen.mongodb.rw_register.Client invoke_BANG_ rw_register.clj 128] [jepsen.client.Validate invoke_BANG_ client.clj 83] [jepsen.generator.interpreter.ClientWorker invoke_BANG_ interpreter.clj 62] [jepsen.generator.interpreter$spawn_worker$fn__9422$fn__9423 invoke interpreter.clj 137] [jepsen.generator.interpreter$spawn_worker$fn__9422 invoke interpreter.clj 120] [clojure.core$binding_conveyor_fn$fn__5739 invoke core.clj 2030] [clojure.lang.AFn call AFn.java 18] [java.util.concurrent.FutureTask run FutureTask.java 264] [java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1128] [java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 628] [java.lang.Thread run Thread.java 829]]} The core message is Client view of cluster state is {type=REPLICA_SET, servers=[ {address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED} ] I have a mongo replica-set cluster consisting of 5 nodes. The primary is public-cd-a1.disalg.cn:37017. However, in this exception, they are all secondaries in the client's view. I have checked the mongod.log and there are no network issues or leader re-election. Node a1 is always the primary. Moreover, there are other types of nodes in the client's view except REPLICA_SET_SECONDARY. In this exception, I found REPLICA_SET_OTHER and UNKNOWN. :cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[ {address=public-cd-a5.disalg.cn:37017, type=, roundTripTime=37.0 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=UNKNOWN, state=CONNECTING}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=822.4 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.1 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=195.9 ms, state=CONNECTED}] This issue makes us can not determine whether a transaction is successfully committed. Why does the client see a wrong view of the cluster state?
JIRAUSER1265262 commented on Mon, 9 Jan 2023 16:27:27 +0000: We haven’t heard back from you for some time, so I’m going to close this ticket. If you're ever able to, please provide additional information and we will reopen the ticket. Christopher JIRAUSER1265262 commented on Thu, 29 Dec 2022 19:47:58 +0000: Ouyang, After getting your project running, we found that it defaults to running against MongoDB 4.2.8, which appears to run successfully and with no failures. However, it is unclear what parameters you ran the test with against 4.4.5. I am referring to the GitHub repository you linked. Per the repository's sample test: lein run test-all -w wr -d replica-set \ --nodes-file ~/nodes-replica -r 1000 --concurrency 3n --timeout-per-txn 10000 --max-txn-length 8 \ --time-limit 60 --max-writes-per-key 128 \ --txn-read-concern snapshot --txn-write-concern majority \ --nemesis-interval 1 --nemesis partition --test-count 1 --leave-db-running --version 4.4.5 Running against 4.4.5 seemed to require a change to db.clj. When running against 4.4.5 after modifying the /src/jepsen/mongodb/db.clj file to hit https://repo.mongodb.org/apt/debian/dists/buster/mongodb-org/4.4/main/binary-amd64/mongodb-org-mongos_4.4.5_amd64.deb instead of https://repo.mongodb.org/apt/debian/dists/buster/mongodb-org/4.2/main/binary-amd64/mongodb-org-mongos_4.2.8_amd64.deb The project gets stuck at INFO jepsen node n4 - jepsen.mongodb.db Starting mongod WARN [2022-12-29 19:32:55,930] main - jepsen.core Test crashed! clojure.lang.ExceptionInfo: throw+: {:type :jepsen.mongodb.client/timed-out-awaiting-connection, :node "n4", :port 27017} at slingshot.support$stack_trace.invoke(support.clj:201) at jepsen.mongodb.client$await_open.invokeStatic(client.clj:96) at jepsen.mongodb.client$await_open.invoke(client.clj:92) at jepsen.mongodb.db$join_BANG_.invokeStatic(db.clj:195) at jepsen.mongodb.db$join_BANG_.invoke(db.clj:188) at jepsen.mongodb.db$replica_set_db$reify__561.setup_BANG_(db.clj:245) at jepsen.mongodb.repl_db.ReplicaDB.setup_BANG_(repl_db.clj:25) at jepsen.db$fn__3288$G__3267__3292.invoke(db.clj:11) at jepsen.db$fn__3288$G__3266__3297.invoke(db.clj:11) at clojure.core$partial$fn__5824.invoke(core.clj:2625) at jepsen.control$on_nodes$fn__3161.invoke(control.clj:430) at clojure.lang.AFn.applyToHelper(AFn.java:154) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invokeStatic(core.clj:665) at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) at clojure.lang.RestFn.applyTo(RestFn.java:142) at clojure.core$apply.invokeStatic(core.clj:669) at clojure.core$bound_fn_STAR_$fn__5734.doInvoke(core.clj:2003) at clojure.lang.RestFn.invoke(RestFn.java:408) at dom_top.core$real_pmap_helper$build_thread__213$fn__214.invoke(core.clj:146) at clojure.lang.AFn.applyToHelper(AFn.java:152) at clojure.lang.AFn.applyTo(AFn.java:144) at clojure.core$apply.invokeStatic(core.clj:665) at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) at clojure.lang.RestFn.invoke(RestFn.java:425) at clojure.lang.AFn.applyToHelper(AFn.java:156) at clojure.lang.RestFn.applyTo(RestFn.java:132) at clojure.core$apply.invokeStatic(core.clj:669) at clojure.core$bound_fn_STAR_$fn__5734.doInvoke(core.clj:2003) at clojure.lang.RestFn.invoke(RestFn.java:397) at clojure.lang.AFn.run(AFn.java:22) at java.base/java.lang.Thread.run(Thread.java:829) Despite the fact it appears MongoDB is accepting connections from the logs: 0.11:32868","connectionId":157,"connectionCount":1}} {"t":{"$date":"2022-12-29T19:32:55.144+00:00"},"s":"I", "c":"NETWORK", "id":22943, "ctx":"listener","msg":"Connection accepted","attr":{"remote":"172.19.0.11:32884","connectionId":158,"connectionCount":2}} {"t":{"$date":"2022-12-29T19:32:55.144+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn158","msg":"client metadata","attr":{"remote":"172.19.0.11:32884","client":"conn158","doc":{"driver":{"name":"mongo-java-driver|sync","version":"4.0.2"},"os":{"type":"Linux","name":"Linux","architecture":"amd64","version":"5.4.0-1088-aws"},"platform":"Java/Debian/11.0.16+8-post-Debian-1deb10u1"}}} {"t":{"$date":"2022-12-29T19:32:55.144+00:00"},"s":"I", "c":"NETWORK", "id":51800, "ctx":"conn157","msg":"client metadata","attr":{"remote":"172.19.0.11:32868","client":"conn157","doc":{"driver":{"name":"mongo-java-driver|sync","version":"4.0.2"},"os":{"type":"Linux","name":"Linux","architecture":"amd64","version":"5.4.0-1088-aws"},"platform":"Java/Debian/11.0.16+8-post-Debian-1deb10u1"}}} {"t":{"$date":"2022-12-29T19:32:55.645+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn157","msg":"Connection ended","attr":{"remote":"172.19.0.11:32868","connectionId":157,"connectionCount":1}} {"t":{"$date":"2022-12-29T19:32:55.645+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn158","msg":"Connection ended","attr":{"remote":"172.19.0.11:32884","connectionId":158,"connectionCount":0}} I figured this was due to it originally using the java driver 4.0.2 which is not compatible with MongoDB 4.4, however when changing the version to 4.4.1 it continued to persist. tsunaouyang@gmail.com, Can you please clarify how you are running this project against 4.4.5? CLI parameters you use Are there separate branches / uncommitted changes you were using for this? Can you possibly help extract the contents of the the $dbpath/diagnostic.data directory (the contents are described here) from this test? I see the mongo.log but no diagnostic data from the output of these tests. The diagnostic data will be instrumental for us to investigate the issue, and without it we may not be able to draw conclusions from this. Christopher JIRAUSER1265262 commented on Mon, 26 Sep 2022 21:11:55 +0000: Hi Ouyang, We still need additional information to help diagnose the problem. If you're able, please attach some of the requested information to the ticket so we can investigate further. Christopher JIRAUSER1265262 commented on Thu, 8 Sep 2022 05:38:07 +0000: Hi Ouyang, Thanks for your report. To start investigating this further, first and foremost it would be super helpful to describe the exact events in sequence to form a clear timeline of what is taking place leading up to the cluster state issue, such as specifically when you are doing writes, when they're retrying, how you're confirming that T2 can read the updates of T1 and so on (and also how you cannot verify if they are committed) Any of the following would also be helpful and appreciated: Uploading the following to the ticket of the nodes in the replicaset (most interested in the ones that are reporting strange types/states) the mongod logs the $dbpath/diagnostic.data directory (the contents are described here) Directions for utilizing your project to demonstrate the issue you are mentioning (Or a standalone java reproduction for more clarity) Christopher JIRAUSER1269693 commented on Thu, 1 Sep 2022 08:55:21 +0000: Hi Chris, This error occurs in almost every test we conduct when the RTT between the clients and the servers is about 30 ms. In fact, we are actively testing MongoDB transactions and the project could be found in Tsunaou/mongodb at txn-checking-4.4.5 (github.com). The test framework is based on jepsen-io/jepsen: A framework for distributed systems verification, with fault injection (github.com). Since our test framework is implemented by Clojure, a functional program language based on JVM. If you are interested in reproducing this issue, I can write a detailed tutorial to guide you through it. If not, maybe we can try to reproduce it by Java, but it will take some time and I can not estimate it. Regrads, Young JIRAUSER1265262 commented on Mon, 29 Aug 2022 18:13:21 +0000: Hi Ouyang, Thank you for your report. I have seen this error arise due to network issues before on the community forums. I have also seen speculation that this is due to the bindIp configuration. However, that doesn't sound exactly the same here, since you are stating you're able to read the updates despite the error. I think it's possible there is some network issue at play here, but it is odd that your client is reporting issues if the writes are actually committing and able to be read later. To start investigating: Please describe how this is triggered; is it sporadic, or does it happen from a certain query? Can you provide a code example that reproduces your issue? Regards, Christopher JIRAUSER1269693 commented on Fri, 26 Aug 2022 09:18:44 +0000: The rrt from client to server is about 30 ms.