BugZero | MongoDB BugID 1671813 - Request timed out while host is locked

MongoDB - Defect ID: 1671813

Request timed out while host is locked

MongoDB - Defect ID: 1671813

Request timed out while host is locked

Last updated on April 23rd, 2021

BugZero Risk Score
6.2 Medium

Overall: 6.2

Severity: 6.4

Community: 5.5

Lifecycle: 9.1

What is the BugZero Risk Score?

MongoDB Integration

Learn more about where this data comes from

MongoDB Integration

Learn more

Bug Scrub Advisor

Streamline upgrades with automated vendor bug scrubs

Bug Scrub Advisor

Learn more

BugZero Enterprise

Wish you caught this bug sooner? Get proactive today.

BugZero Enterprise

Learn more

Bug Details

Priority: Major - P3
Status: Closed
Views: 5

Description

Info

I try to setup a backup procedure according to Back Up a Sharded Cluster with File System Snapshots I have 3-member Config Replica Set with 4 shards. Each shard is a PSA Replica Set. While backup my application is running of course. I run db.fsyncLock() on one Config SECONDARY. While the config RS member is locked my aggregation pipeline fails: db.getSiblingDB('lau01mipmed03').getCollection('sessions#0.096-1500').aggregate([ { '$match': { 'n': { '$ne': 'dummy' }, 't': { '$gte': ISODate('2021-04-06T13:00:00.000Z'), '$lt': ISODate('2021-04-06T19:00:00.000Z') } } }, { '$unset': '_id' }, { '$merge': { 'into': { 'db': 'lau01mipmed03', 'coll': 'sessions#0' } } } ], { 'allowDiskUse': true }) Error: "Request 13963276 timed out, deadline was 2021-04-06T15:00:42.947+02:00, op was RemoteCommand 13963276 -- target:[d-mipmdb-cfg-01.xxx.xxxx.xxx:27019] db:config expDate:2021-04-06T15:00:42.947+02:00 cmd:{ find: \"collections\", filter: { _id: \"lau01mipmed03.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1617714012, 1), t: 2 } }, limit: 1, maxTimeMS: 30000 }" Host d-mipmdb-cfg-01.xxx.xxxx.xxx is the locked Config SECONDARY. Looks like, the error is not related to aggregation pipeline itself, the same error appears also on other commands, e.g. db.getSiblingDB('t-mipmed-as-01').getCollection('sessions.096-1500.stats').aggregate([ { '$project': { 'stats': 1 } }, { '$unwind': '$stats' }, { '$replaceRoot': { 'newRoot': '$stats' } }, { '$merge': { 'into': { 'db': 'mip', 'coll': 'session.stats.raw' } } } ], { allowDiskUse: true })

Top User Comments

JIRAUSER1257066 commented on Thu, 22 Apr 2021 19:45:25 +0000: Hi wernfried.domscheit@sunrise.net, Thanks for following up. I will go ahead and close this ticket. Best, Edwin JIRAUSER1257089 commented on Wed, 21 Apr 2021 09:05:15 +0000: Hi Edwin Please see attached logfile. Yes, the aggregation runs on mongos. My application logging is this one, it should help you to find relevant messages from logfile. { "name" : "mergeSessions", "command" : "db.getSiblingDB('pzur01mipmed0').getCollection('sessions#0.111-1000').aggregate([{'$match':{'n':{'$ne':'dummy'},'t':{'$gte':'2021-04-21T08:00:00.000Z','$lt':'2021-04-21T14:00:00.000Z'}}},{'$unset':'_id'},{'$merge':{'into':{'db':'pzur01mipmed0','coll':'sessions#0'}}}], {'allowDiskUse':true})", "ts" : ISODate("2021-04-21T10:00:41.858+0200"), "state" : "error", "error" : "Request 107938050 timed out, deadline was 2021-04-21T10:00:41.851+02:00, op was RemoteCommand 107938050 -- target:[d-mipmdb-cfg-01.swi.srse.net:27019] db:config expDate:2021-04-21T10:00:41.851+02:00 cmd:{ find: \"collections\", filter: { _id: \"pzur01mipmed0.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1618992011, 21576), t: 4 } }, limit: 1, maxTimeMS: 30000 }", "duration" : 30.13 } Host "d-mipmdb-cfg-01.swi.srse.net" is the primary CSRS, host "d-mipmdb-cfg-02.swi.srse.net" is the locked secondary CSRS. Please note, by proper scheduling of my backup I can prevent this error in my application. It's up to you whether you like to investigate this issue further. Of course, I will further support you in this case. Best Regards Wernfried [^mongos-exp.log.gz] JIRAUSER1257066 commented on Tue, 20 Apr 2021 15:41:46 +0000: Hi wernfried.domscheit@sunrise.net, Thanks for following up with the added clarity. It's certainly strange that these commands are only timing out intermittently. Could you provide some more information for me? Are you running your aggregation on the mongos, and are you finding these log lines in the mongos.log? Can you share the log file that covers the log line you shared, and when the aggregation is run? If the log file doesn't cover the version of MongoDB you're using, can you please provide that information as well? We look forward to hearing from you! Best, Edwin JIRAUSER1257089 commented on Tue, 20 Apr 2021 06:15:25 +0000: Hi Edwin I locked only one (out of 3) CSRS secondary, all the shards are also unlocked. I don't understand why a $merge command requires any write operation on CSRS, the target collections are existing already. Why does it fail when 2 out of 3 CSRS members are up and running unlocked? Just note again: most of the time these commands are running fine, the timeout occurs occasionally only. Best Regards Wernfried JIRAUSER1257066 commented on Mon, 19 Apr 2021 21:07:50 +0000: Hi wernfried.domscheit@sunrise.net, I was able to reproduce this error message by running db.fsyncLock() on all shard secondaries and CSRS secondaries. I suspect this behavior is intended when using db.fsyncLock() and attempting to perform a write using $merge. db.fsyncLock() will prevent additional writes until the user releases the lock with a corresponding db.fsyncUnlock() command. Since $merge is a write operation, we'd expect issues to arise when attempting to write after using db.fsyncLock(). Upon performing db.fsyncUnlock(), the error messages ceased. Does my description clear up this behavior? Best, Edwin

Steps to Reproduce

I am not able to reproduce the error in reliable way. My application runs about 20 of these jobs in parallel, they are all the same, just the db and collection names differ. On 5-10 jobs it fails, the others are running fine.

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Relevant Products

Click on a version to see all relevant bugs

Affected versions:No known affected versions

Fixed versions: No known fixed versions

Top MongoDB Defects

8.4Defect ID: 3423956
Config server crashes with invariant failure in QueryAnalysisCoordinator::onSamplerDelete when documents are deleted from config.mongos
8.4Defect ID: 3431051
MongoDB config server will crash in a cluster which is upgrade from 6.0 version
6.8Defect ID: 3422474
$project silently drops root-level fields after $lookup + $unwind when multiple nested documents contain a type field with different value types
6.8Defect ID: 3426651
Clustered collections do not correctly filter out results with $lt on _id
5.5Defect ID: 3437084
agg_list_cluster_catalog.js times out if config.system.profile exists from a previous test

Ready to prevent the next vendor outage?

Get a demo

MongoDB - Defect ID: 1671813

Request timed out while host is locked

MongoDB - Defect ID: 1671813

Request timed out while host is locked

Last updated on April 23rd, 2021

BugZero Risk Score6.2 Medium

Bug Details

Info

Top User Comments

Steps to Reproduce

Top MongoDB Defects

Ready to prevent the next vendor outage?

Links

BugZero Risk Score
6.2 Medium