BugZero | MongoDB BugID 1903239 - Introduce a TemporarilyUnavailable error type

MongoDB - Defect ID: 1903239

Introduce a TemporarilyUnavailable error type

MongoDB - Defect ID: 1903239

Introduce a TemporarilyUnavailable error type

Last updated on 10/29/2023

Overall: 6.56.5

Severity: 6.46.4

Community: 8.88.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.56.5

Severity: 6.46.4

Community: 8.88.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

The TemporarilyUnavailable error indicates that the operation has been aborted, likely due to excessive server load (e.g. transaction rolled back for eviction). This error is retried in the server with an increasingly larger backoff. Internal operations are retried indefinitely, user operations are retried up to a fixed number of attempts before returning TemporarilyUnavailable to the client. ------ Original title: Instead of WriteConflict, return a more specialized error when oldest transactions are rolled back for eviction Original description: Currently, when a write operation is hitting the wt dirty threshold limit, we take the error from WiredTiger, a WT_ROLLBACK, and up-convert to a WriteConflict. This is misleading and should print something more specific instead. Something that would indicate the actual reason.

Top User Comments

xgen-internal-githook commented on Tue, 15 Feb 2022 23:53:52 +0000: Author: {'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'} Message: SERVER-60839 Add TemporarilyUnavailable error Introduce a TemporarilyUnavailable error and exception type for load shedding. This error indicates that the operation has been aborted, likely due to excessive server load. Errors are retried with an increasingly larger backoff. Internal operations are retried indefinitely, user operations up to a fixed number of attempts. Branch: master https://github.com/mongodb/mongo/commit/581c58c475a872e25b2e3bf7cf5ccd52425ef7c7 louis.williams commented on Mon, 7 Feb 2022 09:13:21 +0000: kevin.jernigan, there are 2 cases to consider: Tests that use multi-document transactions handle WriteConflictExceptions as a TransientTransactionError and retry indefinitely. This is what we tell users to do, and in fact, newer drivers do this automatically for users. For non-multi-document transactions, this error is currently being retried indefinitely inside the server. The proposed behavior is to retry a finite number of times before eventually letting it escape. The problem here is that our multi-document transactions tests were designed to handle this type of error, but the rest of our tests (i.e. most of them) are not. JIRAUSER1258778 commented on Fri, 4 Feb 2022 17:52:37 +0000: When this condition happens today, i.e. when a write operation hits the Wired Tiger dirty threshold limit, we convert to a WriteConflict. How do we handle this in our test infrastructure - don't we fail entire tests for commands that aren't retryable? If so, then what changes if we return a more specialized error for this condition - won't the same tests fail that would fail without the changes in this ticket? xgen-internal-githook commented on Wed, 2 Feb 2022 13:56:40 +0000: Author: {'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'} Message: SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer This is groundwork for further differentiating WT return codes. Branch: master https://github.com/mongodb/mongo/commit/f4aaa34d623e7385b2ac5b332ee07ece1f22c428 xgen-internal-githook commented on Wed, 2 Feb 2022 13:56:38 +0000: Author: {'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'} Message: SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer Branch: master https://github.com/10gen/mongo-enterprise-modules/commit/7cfa78a4e20eb59c4d592bb12b6493c451b8dd13 milkie commented on Fri, 14 Jan 2022 20:38:28 +0000: Thanks for the clarifications; I modified the title of this ticket for better specificity. Should we close SERVER-61454 as a duplicate? louis.williams commented on Fri, 14 Jan 2022 09:06:25 +0000: milkie, after discussing with keith.smith, he confirmed that there is only one scenario for a transaction being rolled-back due to pinning cache space, and that is the "oldest pinned transaction ID rolled back for eviction". The "synchronous" case you described is just a generalization of the asynchronous case. When a very large transaction pins cache space and is unable to evict pages, WiredTiger will start to roll-back transactions, starting from the oldest, until it gets to the large one. So these two cases that you described are not distinguishable from WiredTiger's perspective. milkie commented on Thu, 13 Jan 2022 13:26:26 +0000: It sounds like this ticket is starting to overlap with SERVER-61454. There are actually two similar cases for transaction rollback; one is asynchronous via other threads performing eviction and is based on transaction id age, and one I believe is synchronous within the transaction thread itself once that transaction pins too many pages with uncommitted writes, regardless of transaction age. I was assuming this ticket SERVER-60839 was dealing with the latter situation. In any event, I think we should treat these two cases differently with respect to retry logic. louis.williams commented on Thu, 13 Jan 2022 10:17:16 +0000: We should consider retrying internally once or twice in the existing writeConflictRetry path before ultimately letting this error escape. Additionally, we considering labeling this error code as retryable so that drivers can retry once on their end. We won't be able to let this error escape internal threads. We can only let the error escape for user-originating operations. louis.williams commented on Wed, 5 Jan 2022 09:29:09 +0000: Using the work from WT-8290, we can now call WT_SESSION::get_rollback_reason after receiving a WT_ROLLBACK. If the reason is "oldest pinned transaction ID rolled back for eviction", we will return an error code indicating that the operation exceeded a memory limit. Perhaps the existing ExceededMemoryLimit would be a good error code to use.

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 1903239

Introduce a TemporarilyUnavailable error type

MongoDB - Defect ID: 1903239

Introduce a TemporarilyUnavailable error type

Last updated on 10/29/2023

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?