...
Since upgrading from 4.2 to 4.4, we are experiencing extremely long shutdown times for shard replica mongod processes. We are seeing this in our production clusters, as well as our much smaller test clusters. With 4.2, mongod shutdowns typically took less than a minute. Now they often take 30-60 minutes. When they do take such a long time, it is always because they are executing a WiredTiger rollback-to-stable. You can see logs like the following spewing out every 20 seconds: {"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I", "c":"STORAGE", "id":22322, "ctx":"conn1315","msg":"Shutting down checkpoint thread"} {"t":{"$date":"2023-02-23T02:43:14.628+00:00"},"s":"I", "c":"STORAGE", "id":22323, "ctx":"conn1315","msg":"Finished shutting down checkpoint thread"} {"t":{"$date":"2023-02-23T02:43:14.631+00:00"},"s":"I", "c":"STORAGE", "id":4795902, "ctx":"conn1315","msg":"Closing WiredTiger","attr":{"closeConfig":"leak_memory=true,"}} {"t":{"$date":"2023-02-23T02:43:20.914+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn1292","msg":"Connection ended","attr":{"remote":"10.160.171.190:37458","connectionId":1292,"connectionCount":18}} {"t":{"$date":"2023-02-23T02:43:21.988+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn262","msg":"Connection ended","attr":{"remote":"10.160.170.250:49662","connectionId":262,"connectionCount":17}} {"t":{"$date":"2023-02-23T02:43:34.634+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120214:634344][3628:0x7f70507e1700], file:cdpactivationtest/collection-0--6346230458143384831.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 20 seconds and has inspected 597611 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:43:47.381+00:00"},"s":"I", "c":"NETWORK", "id":22944, "ctx":"conn1290","msg":"Connection ended","attr":{"remote":"10.160.171.190:37008","connectionId":1290,"connectionCount":16}} {"t":{"$date":"2023-02-23T02:43:54.636+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120234:636642][3628:0x7f70507e1700], file:config/collection-112432-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 40 seconds and has inspected 601768 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:44:14.637+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120254:637117][3628:0x7f70507e1700], file:config/collection-13319-559579135213939646.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 60 seconds and has inspected 605383 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:44:34.638+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120274:638360][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 80 seconds and has inspected 609305 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:44:54.632+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120294:632279][3628:0x7f70507e1700], file:config/collection-153892-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 100 seconds and has inspected 613075 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:45:14.638+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120314:638068][3628:0x7f70507e1700], file:config/collection-196706-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 120 seconds and has inspected 616721 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:45:34.632+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120334:632897][3628:0x7f70507e1700], file:config/collection-220381-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 140 seconds and has inspected 620387 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:45:54.635+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120354:635610][3628:0x7f70507e1700], file:config/collection-244326-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 160 seconds and has inspected 624076 files. For more detailed logging, enable WT_VERB_RTS"}} {"t":{"$date":"2023-02-23T02:46:14.632+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"conn1315","msg":"WiredTiger message","attr":{"message":"[1677120374:632253][3628:0x7f70507e1700], file:config/collection-270505-5941089686580730609.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_PROGRESS] Rollback to stable has been running for 180 seconds and has inspected 627528 files. For more detailed logging, enable WT_VERB_RTS"}} ... I checked the system stats during the rollback-to-stable, and none of them were saturated. I also took a sampling of the system calls being made using strace: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 0.00 0.000000 0 910 read 0.00 0.000000 0 503 write 0.00 0.000000 0 461 open 0.00 0.000000 0 591 close 0.00 0.000000 0 585 stat 0.00 0.000000 0 6 writev 0.00 0.000000 0 85452 select 0.00 0.000000 0 708 sched_yield 0.00 0.000000 0 446 sendmsg 0.00 0.000000 0 998 recvmsg 0.00 0.000000 0 260 getdents 0.00 0.000000 0 6 rename 0.00 0.000000 0 3627 gettimeofday 0.00 0.000000 0 65 getrusage 0.00 0.000000 0 585 statfs 0.00 0.000000 0 89303 37032 futex 0.00 0.000000 0 65 sched_getaffinity 0.00 0.000000 0 10 10 restart_syscall 0.00 0.000000 0 44993 clock_gettime 0.00 0.000000 0 130 openat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000000 229704 37042 total I used db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "verbose=[rts]"}) to enable verbose logging for WiredTiger RTS. This produced a ton of output, but most of it consisted of the following 3 log messages: {"t":{"$date":"2023-02-16T21:46:07.465+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:465207][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] skip rollback to stable on file file:test2nto/index-4666-8107731100219616386.wt because its checkpoint address length is 0"}} {"t":{"$date":"2023-02-16T21:46:07.459+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:459380][24280:0x7f09ea138700], file:test2nto/index-1020619--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] file:test2nto/index-1021448--8977920817120252160.wt: tree skipped with durable timestamp: (0, 0) and stable timestamp: (1676583941, 44) or txnid: 0"}} {"t":{"$date":"2023-02-16T21:46:07.436+00:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"SignalHandler","msg":"WiredTiger message","attr":{"message":"[1676583967:436685][24280:0x7f09ea138700], file:test2nto/index-1020156--8977920817120252160.wt, txn rollback_to_stable: [WT_VERB_RECOVERY_RTS(session)] 0x5583f59294e0: page walk skipped"}} This snippet from https://jira.mongodb.org/browse/SERVER-52815 appears to explain why these rollback-to-stables are now happening on every shutdown in 4.4: This specifically became of interest because one of the costs of WT shutdown in 4.2 and earlier was typically bounded by how much data had come into the system since the last checkpoint (or how far the stable timestamp moved since the last checkpoint). But with durable history, the cost is now bounded by how much data exists ahead of the stable timestamp (which is much easier to grow). This is because WT now calls rollback_to_stable inside of WT_CONNECTION::close. But it's not clear if there's any way to change the behavior. We would like to go back to how it behaved in 4.2 and earlier, where we were able to shutdown our mongod processes in a minute or so. An hour long shutdown time is unacceptable. It makes rolling restarts for upgrades and other maintenance take 50 times longer. And even more importantly, it prevents us from expediently dealing with cluster availability and performance issues. I have tried the following tweaks to the shutdown process in hopes one of them might prevent the RTS, but none helped: make replica hidden before shutting it down (we always do this anyway) remove replica from replica set before shutting it down run db.fsyncLock() to disable writes prior to shutdown use db.shutdownServer({ force: true }) to shutdown rather than SIGTERM I also tried stopping the server with "kill -9". This is the one thing that actually worked. It resulted in the subsequent mongod restart taking longer - 5 minutes, rather than the typical 1 minute, but no RTS was performed. Nevertheless, using SIGKILL seems like a bad idea, since it gives Mongo no opportunity to gracefully close files, network connections, etc. We see this issue about 4 out of 5 times in our test cluster, which consists of: 2 shards 4 replicas per shard, 3 active w/ local SSD's and 1 hidden EBS for backup purposes 150 databases with about 100 collections each 250 GB total data There is a load generator running that ensures there are regular updates being made to one of the databases, but the load is light. Please let me know if there's any other information I can provide to help you diagnose this issue. Thank you!
JIRAUSER1270794 commented on Tue, 14 Mar 2023 19:01:17 +0000: Hi ian.springer@salesforce.com, Thank you for your report. We have investigated this issue previously and expect it to be resolved in v4.4 after the backport of WT-8652, which is estimated to be included in the upcoming release (4.4.20) - rc0 on April 4 (BACKPORT-14306). As a result, we are closing this ticket. If you have any further questions, please let us know and we will reopen this ticket. Regards, Yuan