BugZero | MongoDB BugID 473674 - Rollback can time out if oplog entries are large

MongoDB - Defect ID: 473674

Rollback can time out if oplog entries are large

MongoDB - Defect ID: 473674

Rollback can time out if oplog entries are large

Last updated on October 30th, 2023

BugZero Risk Score
6.4 Medium

Overall: 6.4

Severity: 6.4

Community: 7.4

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed

Description

Info

During rollback we query the remote oplog, fetching only a couple of small fields from each oplog entry, and we return up to 16 MB per batch, or about 600 k entries. This requires reading up to 600 k entire oplog entries on the remote end, and if the oplog entries are large and not in cache this can be a very substantial amount of data to be read from disk (tens or hundreds of GB), and may require more than the hard-coded 10-minute timeout to complete. In this case the rollback times out and cannot complete.

Top User Comments

xgen-internal-githook commented on Tue, 26 Jun 2018 15:52:08 +0000: Author: {'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'} Message: SERVER-32382 set a default rollback batch size (cherry picked from commit e716867bb5c36f7ad4686cf020f5f35b9cd9636e) Branch: v3.6 https://github.com/mongodb/mongo/commit/1ba1a9fad2d065243a704b6338812406ac445eb0 xgen-internal-githook commented on Tue, 26 Jun 2018 15:52:07 +0000: Author: {'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'} Message: SERVER-32382 add rollback remote oplog batch size (cherry picked from commit 9a112a8cb260bfc65bb2bfa3118044744e91a8cb) Branch: v3.6 https://github.com/mongodb/mongo/commit/ba312234b81a51a397833d1438c2f83aa2a90aa1 xgen-internal-githook commented on Wed, 16 May 2018 15:47:13 +0000: Author: {'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'} Message: SERVER-32382 set a default rollback batch size Branch: master https://github.com/mongodb/mongo/commit/e716867bb5c36f7ad4686cf020f5f35b9cd9636e xgen-internal-githook commented on Tue, 15 May 2018 13:44:11 +0000: Author: {'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'} Message: SERVER-32382 add rollback remote oplog batch size Branch: master https://github.com/mongodb/mongo/commit/9a112a8cb260bfc65bb2bfa3118044744e91a8cb spencer commented on Tue, 6 Mar 2018 22:45:58 +0000: A 10 minute socket timeout already seems ridiculously large, but you're right that in this case increasing it would work around the real issue and be a very small code change, so I'm open to it. EDIT, although it's probably just as easy to make batch size configurable as to make the timeout configurable, and changing the batch size is probably the better solution. greg.mckeon commented on Tue, 6 Mar 2018 22:37:17 +0000: Back to triage to consider Cailin's comment. cailin.nelson@10gen.com commented on Sat, 3 Mar 2018 16:42:14 +0000: I don't think it's super important. According to Judah's investigation on HELP-5504, we are can probably avoid falling into this situation by switching to w:majority writes - therefore improving what happens in this situation is not a high priority. That said.... why not set make the magic 10 minutes a setParameter? So that customers have an emergency "out" if they need it? greg.mckeon commented on Fri, 2 Mar 2018 21:35:28 +0000: cailin.nelson Could you comment on the impact of this ticket for Cloud? We're planning to prioritize based on how much pain this will cause you. bruce.lucas@10gen.com commented on Mon, 18 Dec 2017 15:34:06 +0000: The issue as I understand it is that we already project only the needed fields, but don't set a batch size so we may have to read a very large amount of oplog data in order to return a batch of 16 MB of very small documents extracted from those large oplog entries. Can we just set a small enough batch size on this query to limit the amount of data that has to be read on the remote end in order to return each batch? judah.schvimer commented on Mon, 18 Dec 2017 15:17:58 +0000: All of this code is used in the new rollback algorithm for 3.8. I expect this problem to exist in 3.6 and continue to exist in 3.8 until addressed. We certainly can project out the needed fields very easily. I'm not sure of the best solution to the socket timeout problem.

Steps to Reproduce

Change history

2025-03-21 Added: 3.4.10

Links

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.4.10

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:3.4.10

Fixed versions: No known fixed versions

Top MongoDB Defects by Risk Score

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 473674

Rollback can time out if oplog entries are large

MongoDB - Defect ID: 473674

Rollback can time out if oplog entries are large

Last updated on October 30th, 2023

BugZero Risk Score6.4 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB Defects by Risk Score

Ready to prevent the next vendor outage?

BugZero Risk Score
6.4 Medium