...
Intermittently my logs fill with "ClientCursor::staticYield can't unlock b/c of recursive lock ns ..."[millions of entries] . It happens on a lot of different collections but always on a findAndModify. When it's at it's worse - the lock lasts 6-7 seconds but the entire database is unresponsive during this time. I've read that it may may be related to insufficient indexing but I've checked that with a .explain and the cursor is BtreeCursor. The issue is similar to this SERVER-15583 but an upgrade to 2.6.6 solved his whereas I'm already on 2.6.6.
ramon.fernandez commented on Fri, 29 May 2015 19:28:53 +0000: ricei, I believe you may be running into the same issue as described in SERVER-17054; please see this comment for an explanation of that issue. To be sure that your issue has the same root cause as that of SERVER-17054 I'd like to ask you to run the commands below and send the output: // Save current logLevel, bump logLevel to 2 var oldLogLevel = db.adminCommand({getParameter:1, logLevel:1}).logLevel db.adminCommand( { setParameter: 1, logLevel: 2 } ) db.issues.getPlanCache().clear() db.col.find({ yppie: "pQ4c" }, {}).sort({ _id : 1}).explain(true) db.col.getPlanCache().getPlansByQuery({ yppie: "pQ4c" }, {}, { _id : 1}) db.col.getPlanCache().listQueryShapes() // Reset logLevel db.adminCommand( { setParameter: 1, logLevel: oldLogLevel } ) You'll note that these commands increase the log level, so I'd also like to see the logs generated while this commands are run to better diagnose this issue. Thanks, Ramón. ramon.fernandez commented on Thu, 21 May 2015 18:10:49 +0000: Thanks for the additional information ricei. The query plan cache is our main suspect indeed, the harder part is to trigger the exact behavior with a suitable dataset. Will try with unindexed queries as you suggest and report back with my findings. ricei commented on Thu, 21 May 2015 16:26:22 +0000: As an update to this - I discovered that the issue was triggered by a previous unindexed query. That is - a query is run on the collection that does not use an index but rather a scan and subsequently the findAndModify which should use an index instead uses a scan. It seems the findAndModify has a race to select which method to use and the results of the previous query influence the outcome of the race resulting in the incorrect method being selected ramon.fernandez commented on Thu, 21 May 2015 16:17:24 +0000: ricei, apologies for the long delay in getting back to you. While I have not been able to reproduce this behavior locally, I wanted to let you know that this behavior has been removed in the v3.0 version and now findAndModify() queries can yield. We'll continue to investigate why findAndModify() seems to ignore available indexes to satisfy the query, but if upgrading to 3.0.3 was an option for you it would be helpful for us if you could let us know how does 3.0.3 behave compared to 2.6.9. Thanks, Ramón. gforget commented on Sun, 29 Mar 2015 17:19:10 +0000: We are on 2.6.9 and seeing a very similar issue. It's causing write lock contention and affecting the performance of our app. 2015-03-28T04:47:25.818-0500 [conn147815] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { opid: 292929752, active: true, secs_running: 0, microsecs_running: 372, op: "query", ns: "community.userProductActivities", query: { findandmodify: "userProductActivities", query: { pid: 470509812, uid: 26502826 }, fields: { view.cViews: 1 }, update: { $set: { lm: new Date(1427536045818), view.lm: new Date(1427536045818) }, $inc: { view.cViews: 1 } }, upsert: true }, client: "10.171.74.4:41770", desc: "conn147815", threadId: "0x7e800772a700", connectionId: 147815, locks: { ^: "w", ^community: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 2 } } } I have attached the outputs of the commands requested earlier ricei commented on Wed, 25 Feb 2015 12:05:54 +0000: some more logs 2015-02-23T21:25:31.797+0000 [conn85394] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { opid: 23228747, active: true, secs_running: 6, microsecs_running: 6598838, op: "query", ns: "mapandmore", query: { findandmodify: "morethan160", query: { yppie: "pJ11" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726725198), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } }, client: "46.137.175.200:32780", desc: "conn85394", threadId: "0x7fa593451700", connectionId: 85394, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 4 } } } 2015-02-23T21:25:31.797+0000 [conn85394] query mapandmore.morethan160 query: { query: { yppie: "pJ11" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2145159 nscannedObjects:2145159 keyUpdates:0 numYields:0 nreturned:1 reslen:3256 6598ms 2015-02-23T21:25:31.803+0000 [conn85394] command mapandmore.$cmd command: findAndModify { findandmodify: "morethan160", query: { yppie: "pJ11" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726725198), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } } keyUpdates:0 numYields:0 locks(micros) w:6603612 reslen:3540 6604ms 2015-02-23T21:29:00.053+0000 [conn85398] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { opid: 23232763, active: true, secs_running: 6, microsecs_running: 6379877, op: "query", ns: "mapandmore", query: { $msg: "query not recording (too large)" }, client: "46.137.175.200:32784", desc: "conn85398", threadId: "0x7fa593552700", connectionId: 85398, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 5 } } } 2015-02-23T21:29:00.053+0000 [conn85398] query mapandmore.morethan160 query: { query: { yppie: "pP4c" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2163399 nscannedObjects:2163399 keyUpdates:0 numYields:0 nreturned:1 reslen:1911 6380ms 2015-02-23T21:29:58.975+0000 [conn85397] warning: ClientCursor::staticYield can't unlock b/c of recursive lock ns: top: { opid: 23234034, active: true, secs_running: 6, microsecs_running: 6412930, op: "query", ns: "mapandmore", query: { findandmodify: "morethan160", query: { yppie: "pQlY" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726992562), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } }, client: "46.137.175.200:32783", desc: "conn85397", threadId: "0x7fa59324f700", connectionId: 85397, locks: { ^: "w", ^mapandmore: "W" }, waitingForLock: false, numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros: { r: 0, w: 3 } } } 2015-02-23T21:29:58.976+0000 [conn85397] query mapandmore.morethan160 query: { query: { yppie: "pQlY" }, orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } ntoreturn:1 ntoskip:0 nscanned:2167279 nscannedObjects:2167279 keyUpdates:0 numYields:0 nreturned:1 reslen:3481 6413ms 2015-02-23T21:29:58.989+0000 [conn85397] command mapandmore.$cmd command: findAndModify { findandmodify: "morethan160", query: { yppie: "pQlY" }, sort: { _id: 1 }, new: 1, remove: 0, upsert: 0, update: { $inc: { clickcount: 1 }, $push: { when: new Date(1424726992562), clickbucket: [ 1424649600000.0, "23-2-2015 21-hour", "23-2-2015-day", "2-2015-month", "2015-1-quarter", "2015-year" ] } } } keyUpdates:0 numYields:0 locks(micros) w:6425980 reslen:3765 6426ms It seems to be an issue with the findandmodify choosing to ignore the available index. After each of the long runs of "recursive lock" there is a statement that the planSummary is IXSCAN and nscanned:2167279 nscannedObjects:2167279 but db..find({ yppie: "pQlY" }).explain(true) { "cursor" : "BtreeCursor yppie_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 1, "nscanned" : 1, "nscannedObjectsAllPlans" : 1, "nscannedAllPlans" : 1, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, "indexBounds" : { "yppie" : [ [ "pQlY", "pQlY" ] ] }, "allPlans" : [ { "cursor" : "BtreeCursor yppie_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 1, "nscanned" : 1, "scanAndOrder" : false, "indexOnly" : false, "nChunkSkips" : 0, "indexBounds" : { "yppie" : [ [ "pQlY", "pQlY" ] ] } } ], "server" : "apollo:27017", "filterSet" : false, "stats" : { "type" : "FETCH", "works" : 2, "yields" : 0, "unyields" : 0, "invalidates" : 0, "advanced" : 1, "needTime" : 0, "needFetch" : 0, "isEOF" : 1, "alreadyHasObj" : 0, "forcedFetches" : 0, "matchTested" : 0, "children" : [ { "type" : "IXSCAN", "works" : 2, "yields" : 0, "unyields" : 0, "invalidates" : 0, "advanced" : 1, "needTime" : 0, "needFetch" : 0, "isEOF" : 1, "keyPattern" : "{ yppie: 1.0 }", "isMultiKey" : 0, "boundsVerbose" : "field #0['yppie']: [\"pQlY\", \"pQlY\"]", "yieldMovedCursor" : 0, "dupsTested" : 0, "dupsDropped" : 0, "seenInvalidated" : 0, "matchTested" : 0, "keysExamined" : 1, "children" : [ ] } ] } } ricei commented on Tue, 24 Feb 2015 16:09:17 +0000: As some additional background. Some of the collections have exists for ~ 4+ years. Mongodb has been upgraded through each of the major releases. Additionally in January the storage was swapped over to SSD. ramon.fernandez commented on Tue, 24 Feb 2015 15:02:17 +0000: ricei, we're going to need more information to understand what's going on. My hunch is that, even if your findAndModify() queries are using an index, they may be using a sub-optimal index in some situations. In addition from server logs around the time when this happens (spanning at least a few seconds before and after), can you please send us the output of the following commands? db.collection.getPlanCache().listQueryShapes() db.collection.getPlanCache().getPlansByQuery({query-from-findAndModify-here}) db.collection.getIndexes(); db.collection.find({query-from-findAndModify-here}).explain(true);
it's intermittent and I can't replicate it. It's not even specific to times of heavy load.