...
I found the following two relevant stacks: Thread 68 (Thread 0x7f4b2bfdd700 (LWP 27095)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f4b66f0edbd in __GI___pthread_mutex_lock (mutex=0x55a85f914d38) at ../nptl/pthread_mutex_lock.c:80 #2 0x000055a85643e372 in __gthread_mutex_lock (__mutex=0x55a85f914d38) at /usr/include/x86_64-linux-gnu/c++/5/bits/gthr-default.h:748 #3 std::mutex::lock (this=0x55a85f914d38) at /usr/include/c++/5/mutex:135 #4 std::lock_guard::lock_guard (__m=..., this=) at /usr/include/c++/5/mutex:386 #5 mongo::ViewCatalog::lookup (this=0x55a85f914d38, txn=txn@entry=0x55a85feb80e0, ns=...) at src/mongo/db/views/view_catalog.cpp:376 #6 0x000055a855cdf92f in mongo::AutoGetCollectionOrViewForRead::AutoGetCollectionOrViewForRead (this=0x7f4b2bfd6c80, txn=0x55a85feb80e0, nss=...) at src/mongo/db/db_raii.cpp:164 #7 0x000055a855fe54ac in mongo::runQuery[abi:cxx11](mongo::OperationContext*, mongo::QueryMessage&, mongo::NamespaceString const&, mongo::Message&) (txn=txn@entry=0x55a85feb80e0, q=..., nss=..., result=...) at src/mongo/db/query/find.cpp:530 #8 0x000055a855ae4a79 in mongo::(anonymous namespace)::receivedQuery (c=..., m=..., dbResponse=..., nss=..., txn=0x55a85feb80e0) at src/mongo/db/assemble_response.cpp:312 #9 mongo::assembleResponse (txn=0x55a85feb80e0, m=..., dbresponse=..., remote=...) at src/mongo/db/assemble_response.cpp:551 #10 0x000055a855cdfcfc in mongo::DBDirectClient::call (this=0x55a8610b2f00, toSend=..., response=..., assertOk=, actualServer=) at src/mongo/db/dbdirectclient.cpp:128 #11 0x000055a855a7e72e in mongo::DBClientCursor::init (this=this@entry=0x55a860103800) at src/mongo/client/dbclientcursor.cpp:145 #12 0x000055a855a5f0c6 in mongo::DBClientBase::query (this=0x55a8610b2f00, ns="test.system.js", query=..., nToReturn=0, nToSkip=0, fieldsToReturn=0x0, queryOptions=4, batchSize=0) at src/mongo/client/dbclient.cpp:1020 #13 0x000055a855cdfb60 in mongo::DBDirectClient::query (this=, ns=..., query=..., nToReturn=, nToSkip=, fieldsToReturn=0x0, queryOptions=4, batchSize=0) at src/mongo/db/dbdirectclient.cpp:152 #14 0x000055a85661352c in mongo::Scope::loadStored (this=0x55a85fb55640, txn=0x55a85feb80e0, ignoreNotConnected=) at src/mongo/scripting/engine.cpp:222 #15 0x000055a85661265f in mongo::ScriptEngine::getPooledScope (this=0x55a85912c040, txn=txn@entry=0x55a85feb80e0, db="test", scopeType="where") at src/mongo/scripting/engine.cpp:548 #16 0x000055a855e4d1cc in mongo::WhereMatchExpression::init (this=this@entry=0x55a8612a1580, dbName=...) at src/mongo/db/matcher/expression_where.cpp:75 #17 0x000055a855e4d4ad in mongo::WhereMatchExpression::shallowClone (this=0x55a8612a0780) at src/mongo/db/matcher/expression_where.cpp:116 #18 0x000055a856044fa7 in mongo::QueryPlannerAccess::makeCollectionScan (query=..., tailable=tailable@entry=false, params=...) at src/mongo/db/query/planner_access.cpp:150 #19 0x000055a856057996 in mongo::buildCollscanSoln (query=..., tailable=tailable@entry=false, params=...) at src/mongo/db/query/query_planner.cpp:243 #20 0x000055a85605d927 in mongo::QueryPlanner::plan (query=..., params=..., out=out@entry=0x7f4b2bfd9a60) at src/mongo/db/query/query_planner.cpp:956 #21 0x000055a855ff1b0c in mongo::(anonymous namespace)::prepareExecution (opCtx=opCtx@entry=0x55a85feb80e0, collection=collection@entry=0x55a8612d7900, ws=0x55a8603cac80, canonicalQuery=std::unique_ptr containing 0x55a8607b70e0, plannerOptions=plannerOptions@entry=0) at src/mongo/db/query/get_executor.cpp:378 #22 0x000055a855ff5c88 in mongo::getExecutorDelete (txn=txn@entry=0x55a85feb80e0, opDebug=opDebug@entry=0x7f4b2bfda6f8, collection=0x55a8612d7900, parsedDelete=parsedDelete@entry=0x7f4b2bfda400) at src/mongo/db/query/get_executor.cpp:781 #23 0x000055a855e97238 in mongo::performSingleDeleteOp (op=..., ns=..., txn=0x55a85feb80e0) at src/mongo/db/ops/write_ops_exec.cpp:640 #24 mongo::performDeletes (txn=txn@entry=0x55a85feb80e0, wholeOp=...) at src/mongo/db/ops/write_ops_exec.cpp:692 #25 0x000055a855cba81f in mongo::CmdDelete::runImpl (this=, txn=0x55a85feb80e0, dbname=..., cmdObj=..., result=...) at src/mongo/db/commands/write_commands/write_commands.cpp:363 #26 0x000055a855cb5f9c in mongo::(anonymous namespace)::WriteCommand::run (this=, txn=0x55a85feb80e0, dbname=..., cmdObj=..., options=, errmsg=..., result=...) at src/mongo/db/commands/write_commands/write_commands.cpp:217 #27 0x000055a855beecb4 in mongo::Command::run (this=this@entry=0x55a857d870a0 , txn=txn@entry=0x55a85feb80e0, request=..., replyBuilder=replyBuilder@entry=0x7f4b2bfdb590) at src/mongo/db/commands/dbcommands.cpp:1396 #28 0x000055a855befce3 in mongo::execCommandDatabase (txn=0x55a85feb80e0, command=, request=..., replyBuilder=) at src/mongo/db/commands/dbcommands.cpp:1595 #29 0x000055a855bbdefc in std::function::operator()(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) const (__args#3=0x7f4b2bfdb590, __args#2=..., __args#1=0x55a857d870a0 , __args#0=0x55a85feb80e0, this=0x55a857d803e0 ) at /usr/include/c++/5/functional:2267 #30 mongo::Command::execCommand (txn=txn@entry=0x55a85feb80e0, command=command@entry=0x55a857d870a0 , request=..., replyBuilder=replyBuilder@entry=0x7f4b2bfdb590) at src/mongo/db/commands.cpp:401 #31 0x000055a85621f70d in mongo::runCommands (txn=txn@entry=0x55a85feb80e0, request=..., replyBuilder=replyBuilder@entry=0x7f4b2bfdb590) at src/mongo/db/run_commands.cpp:73 #32 0x000055a855ae4c5b in mongo::(anonymous namespace)::receivedRpc (client=..., message=..., dbResponse=..., txn=0x55a85feb80e0) at src/mongo/db/assemble_response.cpp:217 #33 mongo::assembleResponse (txn=0x55a85feb80e0, m=..., dbresponse=..., remote=...) at src/mongo/db/assemble_response.cpp:554 #34 0x000055a8559f074d in mongo::ServiceEntryPointMongod::_sessionLoop (this=this@entry=0x55a85939f300, session=std::shared_ptr (count 2, weak 2) 0x55a85d376140) at src/mongo/db/service_entry_point_mongod.cpp:135 #35 0x000055a8559f108d in mongo::ServiceEntryPointMongod::::operator() (session=..., __closure=) at src/mongo/db/service_entry_point_mongod.cpp:103 #36 std::_Function_handler&), mongo::ServiceEntryPointMongod::startSession(mongo::transport::SessionHandle):: >::_M_invoke(const std::_Any_data &, const std::shared_ptr &) (__functor=..., __args#0=...) at /usr/include/c++/5/functional:1871 #37 0x000055a856681382 in std::function const&)>::operator()(std::shared_ptr const&) const (__args#0=std::shared_ptr (count 2, weak 2) 0x55a85d376140, this=0x55a8603c6870) at /usr/include/c++/5/functional:2267 #38 mongo::(anonymous namespace)::runFunc (ptr=0x55a8603c6860) at src/mongo/transport/service_entry_point_utils.cpp:78 #39 0x00007f4b66f0c6ba in start_thread (arg=0x7f4b2bfdd700) at pthread_create.c:333 #40 0x00007f4b66c4282d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 And Thread 71 (Thread 0x7f4b2b6ff700 (LWP 27099)): #0 pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x000055a855cc7fe8 in __gthread_cond_timedwait (__abs_timeout=0x7f4b2b6fbb90, __mutex=, __cond=0x55a85d54e5d8) at /usr/include/x86_64-linux-gnu/c++/5/bits/gthr-default.h:871 #2 std::condition_variable::__wait_until_impl > > (__atime=..., __lock=, this=0x55a85d54e5d8) at /usr/include/c++/5/condition_variable:165 #3 std::condition_variable::wait_until > > (__atime=..., __lock=, this=0x55a85d54e5d8) at /usr/include/c++/5/condition_variable:105 #4 std::condition_variable::wait_for > (__rtime=..., __lock=, this=0x55a85d54e5d8) at /usr/include/c++/5/condition_variable:137 #5 mongo::CondVarLockGrantNotification::wait (this=this@entry=0x55a85d54e5a8, timeoutMs=timeoutMs@entry=500) at src/mongo/db/concurrency/lock_state.cpp:238 #6 0x000055a855ccc95e in mongo::LockerImpl::lockComplete (this=0x55a85d54e000, resId=..., mode=, timeoutMs=4294967295, checkDeadlock=false) at src/mongo/db/concurrency/lock_state.cpp:749 #7 0x000055a85643551b in mongo::DurableViewCatalogImpl::iterate(mongo::OperationContext*, std::function) (this=0x55a85f914d28, txn=0x55a85d7bd260, callback=...) at src/mongo/db/views/durable_view_catalog.cpp:75 #8 0x000055a85643da81 in mongo::ViewCatalog::_reloadIfNeeded_inlock (this=this@entry=0x55a85f914d38, txn=txn@entry=0x55a85d7bd260) at src/mongo/db/views/view_catalog.cpp:98 #9 0x000055a85643e274 in mongo::ViewCatalog::_lookup_inlock (this=this@entry=0x55a85f914d38, txn=0x55a85d7bd260, ns=...) at src/mongo/db/views/view_catalog.cpp:360 #10 0x000055a85643e38d in mongo::ViewCatalog::lookup (this=0x55a85f914d38, txn=txn@entry=0x55a85d7bd260, ns=...) at src/mongo/db/views/view_catalog.cpp:377 #11 0x000055a855cdf24f in mongo::AutoGetCollection::AutoGetCollection (this=0x7f4b2b6fc608, txn=0x55a85d7bd260, nss=..., modeDB=, modeColl=mongo::MODE_IX, viewMode=) at src/mongo/db/db_raii.cpp:59 #12 0x000055a855e994a8 in mongo::AutoGetCollection::AutoGetCollection (modeColl=, modeDB=mongo::MODE_IX, nss=..., txn=0x55a85d7bd260, this=0x7f4b2b6fc608) at src/mongo/db/db_raii.h:87 #13 boost::optional_detail::optional_base::emplace_assign(mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode&&, mongo::LockMode&&) (this=0x7f4b2b6fc600) at src/third_party/boost-1.60.0/boost/optional/optional.hpp:494 #14 boost::optional::emplace(mongo::OperationContext*&, mongo::NamespaceString const&, mongo::LockMode&&, mongo::LockMode&&) (this=0x7f4b2b6fc600) at src/third_party/boost-1.60.0/boost/optional/optional.hpp:981 #15 mongo::performSingleUpdateOp (op=..., ns=..., txn=0x55a85d7bd260) at src/mongo/db/ops/write_ops_exec.cpp:515 #16 mongo::performUpdates (txn=txn@entry=0x55a85d7bd260, wholeOp=...) at src/mongo/db/ops/write_ops_exec.cpp:585 #17 0x000055a855cba64f in mongo::CmdUpdate::runImpl (this=, txn=0x55a85d7bd260, dbname=..., cmdObj=..., result=...) at src/mongo/db/commands/write_commands/write_commands.cpp:295 #18 0x000055a855cb5f9c in mongo::(anonymous namespace)::WriteCommand::run (this=, txn=0x55a85d7bd260, dbname=..., cmdObj=..., options=, errmsg=..., result=...) at src/mongo/db/commands/write_commands/write_commands.cpp:217 #19 0x000055a855beecb4 in mongo::Command::run (this=this@entry=0x55a857d87180 , txn=txn@entry=0x55a85d7bd260, request=..., replyBuilder=replyBuilder@entry=0x7f4b2b6fd590) at src/mongo/db/commands/dbcommands.cpp:1396 #20 0x000055a855befce3 in mongo::execCommandDatabase (txn=0x55a85d7bd260, command=, request=..., replyBuilder=) at src/mongo/db/commands/dbcommands.cpp:1595 #21 0x000055a855bbdefc in std::function::operator()(mongo::OperationContext*, mongo::Command*, mongo::rpc::RequestInterface const&, mongo::rpc::ReplyBuilderInterface*) const (__args#3=0x7f4b2b6fd590, __args#2=..., __args#1=0x55a857d87180 , __args#0=0x55a85d7bd260, this=0x55a857d803e0 ) at /usr/include/c++/5/functional:2267 #22 mongo::Command::execCommand (txn=txn@entry=0x55a85d7bd260, command=command@entry=0x55a857d87180 , request=..., replyBuilder=replyBuilder@entry=0x7f4b2b6fd590) at src/mongo/db/commands.cpp:401 #23 0x000055a85621f70d in mongo::runCommands (txn=txn@entry=0x55a85d7bd260, request=..., replyBuilder=replyBuilder@entry=0x7f4b2b6fd590) at src/mongo/db/run_commands.cpp:73 #24 0x000055a855ae4c5b in mongo::(anonymous namespace)::receivedRpc (client=..., message=..., dbResponse=..., txn=0x55a85d7bd260) at src/mongo/db/assemble_response.cpp:217 #25 mongo::assembleResponse (txn=0x55a85d7bd260, m=..., dbresponse=..., remote=...) at src/mongo/db/assemble_response.cpp:554 #26 0x000055a8559f074d in mongo::ServiceEntryPointMongod::_sessionLoop (this=this@entry=0x55a85939f300, session=std::shared_ptr (count 2, weak 2) 0x55a85d3756c0) at src/mongo/db/service_entry_point_mongod.cpp:135 #27 0x000055a8559f108d in mongo::ServiceEntryPointMongod::::operator() (session=..., __closure=) at src/mongo/db/service_entry_point_mongod.cpp:103 #28 std::_Function_handler&), mongo::ServiceEntryPointMongod::startSession(mongo::transport::SessionHandle):: >::_M_invoke(const std::_Any_data &, const std::shared_ptr &) (__functor=..., __args#0=...) at /usr/include/c++/5/functional:1871 #29 0x000055a856681382 in std::function const&)>::operator()(std::shared_ptr const&) const (__args#0=std::shared_ptr (count 2, weak 2) 0x55a85d3756c0, this=0x55a8603c5610) at /usr/include/c++/5/functional:2267 #30 mongo::(anonymous namespace)::runFunc (ptr=0x55a8603c5600) at src/mongo/transport/service_entry_point_utils.cpp:78 #31 0x00007f4b66f0c6ba in start_thread (arg=0x7f4b2b6ff700) at pthread_create.c:333 #32 0x00007f4b66c4282d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 max.hirschhorn and I took a look at this for a bit and tried many different configurations to narrow down the scenario. We're not sure why this doesn't repro on a single mongod with 2 threads. Max did bring up the possibility that this could be related to a long opened server ticket SERVER-10747. Based on one of the stacks, we do know that the $where operator is involved in some obscure way. Also, the system.views collection needs to be refreshed from disk in order to trigger this deadlock. So far, two ways I know to reproduce this. I've attached the 8ef4-mdb_3aef-ent_8392-qa_c627a-1485290758685-63.js file to be placed in the /jstestfuzz/out directory. There is a higher probability to reproduce if you have a larger number of clients per fixture, like 10. 1. buildscripts/resmoke.py --suite=jstestfuzz_sharded --storageEngine=mmapv1 --numClientsPerFixture=10 2. buildscripts/resmoke.py --suite=jstestfuzz_replication --storageEngine=mmapv1 --numClientsPerFixture=10 I was able to determine that the following cycle exists. LWP 27095 ----> (waiting for mutex) 0x55a85f914d38 ----> (held by) LWP 27099 ----> (waiting for MongoDB lock) 0x55a8612a1380 ----> (held by) LWP 27095
james.wahlin@10gen.com commented on Fri, 21 Apr 2017 19:19:13 +0000: To summarize the conditions required to trigger this deadlock: mongod must be running a collection-level-lock storage engine (MMAPv1 in this case) A direct write must be performed on the system.views collection. This write must: Use $where to find the entry to modify Be executed when the system.views collection doesn't exist In parallel to the write, a second operation must be performed within the same database against either a non-existent collection or a view. This operation must encounter a dirty view catalog, forcing a reload. james.wahlin@10gen.com commented on Fri, 21 Apr 2017 19:13:05 +0000: Created SERVER-28915 to address the deadlock triggered by the above FSM test. This ticket handle the original jstestfuzz deadlock scenario. james.wahlin@10gen.com commented on Fri, 21 Apr 2017 18:19:49 +0000: The second deadlock scenario found is due to a deadlock on the MMAPv1 flush lock. While this code expects to encounter deadlock in some scenarios and has a mechanism for deadlock detection, it does not consider ViewCatalog::_mutex which is a stdx::mutex rather than a LockManager managed mutex. james.wahlin@10gen.com commented on Thu, 20 Apr 2017 19:57:36 +0000: A similar deadlock involving the ViewCatalog::mutex and collection lock is possible without the use of $where. As with the above, triggering this requires direct writes to the system.views collection (rather than writing via the collection CRUD commands). The following FSM test will trigger a deadlock when run under the mmapv1 storage engine: 'use strict'; load('jstests/concurrency/fsm_workload_helpers/drop_utils.js'); // for dropCollections var $config = (function() { var data = { // Use the workload name as a prefix for the view name, since the workload name is assumed // to be unique. prefix: 'system_views_write' }; var states = (function() { function init(db, collName) { this.threadViewName = this.prefix + '_' + this.tid; this.threadViewNS = db.getName() + '.' + this.threadViewName; this.collNS1 = db[collName].getFullName(); this.collNS2 = db[collName].getFullName() + "_2"; } function insert(db, collName) { const pipeline = []; assertAlways.writeOK(db.system.views.insert({_id: this.threadViewNS, viewOn: this.collNS1, pipeline: pipeline})); } function update(db, collName) { assertAlways.writeOK(db.system.views.update({_id: this.threadViewNS}, {$set: {viewOn: this.collNS2}})); } function remove(db, collName) { assertAlways.writeOK(db.system.views.remove({_id: this.threadViewNS})); } function read(db, collName) { // This is a read on an non-existent collection, which will require ViewCatalog lookup. assertAlways.commandWorked(db.runCommand({find: collName})); } return {init: init, insert: insert, update: update, read: read, remove: remove}; })(); var transitions = { init: {insert: 1}, insert: {update: 0.5, read: 0.5}, update: {remove: 0.5, read: 0.5}, read: {remove: 0.5, read: 0.5}, remove: {insert: 1} }; function setup(db, collName, cluster) { const coll = db[collName]; const viewsColl = db["system.views"]; coll.drop(); viewsColl.drop(); } function teardown(db, collName, cluster) { var pattern = new RegExp('^system.views$'); dropCollections(db, pattern); } return { threadCount: 10, iterations: 10000, data: data, states: states, transitions: transitions, setup: setup, teardown: teardown }; })(); james.wahlin@10gen.com commented on Tue, 18 Apr 2017 18:52:05 +0000: The remove operation that triggered the deadlock (from the attached fuzzer file) is this one: invalidDB.system.views.remove('NumberDecimal(NaN)') james.wahlin@10gen.com commented on Tue, 18 Apr 2017 13:39:13 +0000: This is an ABBA deadlock condition, where 2 threads are grabbing ViewCatalog::_mutex and the Collection lock in reverse order. This occurs only under MMAPv1 as MODE_S / MODE_X collection locks are required to cause the deadlock. The 2 threads deadlocked under this test are performing the following operations: A user thread performs a delete on a system.views document, using a $where match expression. This is a direct removal rather than a view namespace drop. CmdDelete acquires MODE_IX lock on the system.views collection. As MMAPv1 provides collection-level locking, this lock is upgraded to MODE_X. The $where is evaluated and executed as an OP_QUERY. Lookup of the system.views document under OP_QUERY requires namespace resolution. In order to access the ViewCatalog, the ViewCatalog::_mutex is acquired. User thread performs an insert into a collection within the same database as user thread 1. When resolving the namespace for the insert, the collection is not found. We then look to determine whether it is a view. On access of the ViewCatalog, the ViewCatalog::_mutex is acquired. The ViewCatalog is found to be stale and requires reload. This requires a MODE_IS collection lock on the system.views collection, which is upgraded to a MODE_S. In the above scenario, the deadlock occurs when lock acquisition occurs in the following order: Thread 1 acquires a MODE_X lock on the system.views namespace. Thread 2 acquires the ViewCatalog::_mutex for the same database. Thread 1 attempts to acquire the ViewCatalog::_mutex and is blocked on thread 2. Thread 2 attempts to acquire a MODE_S lock on the system.views namespace and is blocked on thread 1.
I tried to minimize the steps required to reproduce this hang, but was not successful. It seems like running two threads in parallel on a single mongod does not readily reproduce this hang, even though it appears like only 2 threads are required to generate this deadlock. However, running this in a sharded cluster on mmapv1 with journaling does reproduce it consistently if I use the original fuzzer generated file that I've attached.