Loading...
Loading...
I try to setup a backup procedure according to Back Up a Sharded Cluster with File System Snapshots I have 3-member Config Replica Set with 4 shards. Each shard is a PSA Replica Set. While backup my application is running of course. I run db.fsyncLock() on one Config SECONDARY. While the config RS member is locked my aggregation pipeline fails: db.getSiblingDB('lau01mipmed03').getCollection('sessions#0.096-1500').aggregate([ { '$match': { 'n': { '$ne': 'dummy' }, 't': { '$gte': ISODate('2021-04-06T13:00:00.000Z'), '$lt': ISODate('2021-04-06T19:00:00.000Z') } } }, { '$unset': '_id' }, { '$merge': { 'into': { 'db': 'lau01mipmed03', 'coll': 'sessions#0' } } } ], { 'allowDiskUse': true }) Error: "Request 13963276 timed out, deadline was 2021-04-06T15:00:42.947+02:00, op was RemoteCommand 13963276 -- target:[d-mipmdb-cfg-01.xxx.xxxx.xxx:27019] db:config expDate:2021-04-06T15:00:42.947+02:00 cmd:{ find: \"collections\", filter: { _id: \"lau01mipmed03.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1617714012, 1), t: 2 } }, limit: 1, maxTimeMS: 30000 }" Host d-mipmdb-cfg-01.xxx.xxxx.xxx is the locked Config SECONDARY. Looks like, the error is not related to aggregation pipeline itself, the same error appears also on other commands, e.g. db.getSiblingDB('t-mipmed-as-01').getCollection('sessions.096-1500.stats').aggregate([ { '$project': { 'stats': 1 } }, { '$unwind': '$stats' }, { '$replaceRoot': { 'newRoot': '$stats' } }, { '$merge': { 'into': { 'db': 'mip', 'coll': 'session.stats.raw' } } } ], { allowDiskUse: true })
JIRAUSER1257066 commented on Thu, 22 Apr 2021 19:45:25 +0000: Hi wernfried.domscheit@sunrise.net, Thanks for following up. I will go ahead and close this ticket. Best, Edwin JIRAUSER1257089 commented on Wed, 21 Apr 2021 09:05:15 +0000: Hi Edwin Please see attached logfile. Yes, the aggregation runs on mongos. My application logging is this one, it should help you to find relevant messages from logfile. { "name" : "mergeSessions", "command" : "db.getSiblingDB('pzur01mipmed0').getCollection('sessions#0.111-1000').aggregate([{'$match':{'n':{'$ne':'dummy'},'t':{'$gte':'2021-04-21T08:00:00.000Z','$lt':'2021-04-21T14:00:00.000Z'}}},{'$unset':'_id'},{'$merge':{'into':{'db':'pzur01mipmed0','coll':'sessions#0'}}}], {'allowDiskUse':true})", "ts" : ISODate("2021-04-21T10:00:41.858+0200"), "state" : "error", "error" : "Request 107938050 timed out, deadline was 2021-04-21T10:00:41.851+02:00, op was RemoteCommand 107938050 -- target:[d-mipmdb-cfg-01.swi.srse.net:27019] db:config expDate:2021-04-21T10:00:41.851+02:00 cmd:{ find: \"collections\", filter: { _id: \"pzur01mipmed0.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1618992011, 21576), t: 4 } }, limit: 1, maxTimeMS: 30000 }", "duration" : 30.13 } Host "d-mipmdb-cfg-01.swi.srse.net" is the primary CSRS, host "d-mipmdb-cfg-02.swi.srse.net" is the locked secondary CSRS. Please note, by proper scheduling of my backup I can prevent this error in my application. It's up to you whether you like to investigate this issue further. Of course, I will further support you in this case. Best Regards Wernfried [^mongos-exp.log.gz] JIRAUSER1257066 commented on Tue, 20 Apr 2021 15:41:46 +0000: Hi wernfried.domscheit@sunrise.net, Thanks for following up with the added clarity. It's certainly strange that these commands are only timing out intermittently. Could you provide some more information for me? Are you running your aggregation on the mongos, and are you finding these log lines in the mongos.log? Can you share the log file that covers the log line you shared, and when the aggregation is run? If the log file doesn't cover the version of MongoDB you're using, can you please provide that information as well? We look forward to hearing from you! Best, Edwin JIRAUSER1257089 commented on Tue, 20 Apr 2021 06:15:25 +0000: Hi Edwin I locked only one (out of 3) CSRS secondary, all the shards are also unlocked. I don't understand why a $merge command requires any write operation on CSRS, the target collections are existing already. Why does it fail when 2 out of 3 CSRS members are up and running unlocked? Just note again: most of the time these commands are running fine, the timeout occurs occasionally only. Best Regards Wernfried JIRAUSER1257066 commented on Mon, 19 Apr 2021 21:07:50 +0000: Hi wernfried.domscheit@sunrise.net, I was able to reproduce this error message by running db.fsyncLock() on all shard secondaries and CSRS secondaries. I suspect this behavior is intended when using db.fsyncLock() and attempting to perform a write using $merge. db.fsyncLock() will prevent additional writes until the user releases the lock with a corresponding db.fsyncUnlock() command. Since $merge is a write operation, we'd expect issues to arise when attempting to write after using db.fsyncLock(). Upon performing db.fsyncUnlock(), the error messages ceased. Does my description clear up this behavior? Best, Edwin
I am not able to reproduce the error in reliable way. My application runs about 20 of these jobs in parallel, they are all the same, just the db and collection names differ. On 5-10 jobs it fails, the others are running fine.
Click on a version to see all relevant bugs
MongoDB Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.