...
Problem: On a long run, open file limit was reached, with this in the logs 2014-10-15T02:47:26.388-0400 I NETWORK [conn19835] end connection 127.0.0.1:43859 (37 connections now open) 2014-10-15T02:47:35.350-0400 E STORAGE WiredTiger (24) [1413355655:350274][981310047f7b8187f0000], lsm-worker: /home/ec2-user/db/collection-8096527837003006657-2064-000019.lsm: open: Too many open files 2014-10-15T02:47:35.350-0400 E STORAGE WiredTiger (24) [1413355655:350431][981310047f7b8187f0000], lsm-worker: Failed doing LSM switch: Too many open files 2014-10-15T02:47:35.350-0400 E STORAGE WiredTiger (0) [1413355655:350449][981310047f7b8187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.350-0400 E STORAGE WiredTiger (-31804) [1413355655:350472][981310047f7b8187f0000], lsm-worker: Error in LSM worker thread 3: WT_PANIC: WiredTiger library panic 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (0) [1413355655:351496][98131005777b9187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (0) [1413355655:351543][98131005777b9187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (0) [1413355655:351560][98131005777b9187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (-31804) [1413355655:351570][98131005777b9187f0000], lsm-worker: Failed doing LSM switch: WT_PANIC: WiredTiger library panic 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (0) [1413355655:351580][98131005777b9187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.351-0400 E STORAGE WiredTiger (-31804) [1413355655:351596][98131005777b9187f0000], lsm-worker: Error in LSM worker thread 2: WT_PANIC: WiredTiger library panic 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (0) [1413355655:352124][981310067f7b9187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (0) [1413355655:352166][981310067f7b9187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (0) [1413355655:352179][981310067f7b9187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (-31804) [1413355655:352188][981310067f7b9187f0000], lsm-worker: Failed doing LSM switch: WT_PANIC: WiredTiger library panic 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (0) [1413355655:352197][981310067f7b9187f0000], lsm-worker: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:47:35.352-0400 E STORAGE WiredTiger (-31804) [1413355655:352206][981310067f7b9187f0000], lsm-worker: Error in LSM worker thread 1: WT_PANIC: WiredTiger library panic 2014-10-15T02:48:08.150-0400 E STORAGE WiredTiger (0) [1413355688:150467][98131009777bb187f0000], cursor.set_key: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:48:08.150-0400 E STORAGE WiredTiger (0) [1413355688:150550][98131009777bb187f0000], file:WiredTiger.wt, cursor.search: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:48:08.150-0400 E STORAGE WiredTiger (0) [1413355688:150564][98131009777bb187f0000], file:WiredTiger.wt, cursor.close: the WiredTiger library cannot continue; the process must exit and restart 2014-10-15T02:48:08.150-0400 E STORAGE WiredTiger (-31804) [1413355688:150575][98131009777bb187f0000], sweep-server: handle sweep server error: WT_PANIC: WiredTiger library panic 2014-10-15T12:06:37.356-0400 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43896 #19872 (38 connections now open) 2014-10-15T12:06:37.356-0400 I NETWORK [conn19872] end connection 127.0.0.1:43896 (37 connections now open) 2014-10-15T12:06:37.591-0400 I NETWORK [conn3] end connection 127.0.0.1:52133 (36 connections now open) However, mongod left running.
eliot commented on Mon, 10 Nov 2014 13:54:52 +0000: Keith - take a look at this: https://github.com/mongodb/mongo/commit/4ab029e3e68ea5f7aa89b90a47fd814d17b142bf Assuming I did what you meant correctly, then yes, I think we're good to go. keith@wiredtiger.com commented on Mon, 10 Nov 2014 13:41:49 +0000: Mark, we made a minor change to pass WT_PANIC as the error to the error handler, which means applications can figure out if a panic has happened by testing the error value. Is that sufficient for your needs? Or are there additional advantages to having a separate panic handler? keith@wiredtiger.com commented on Mon, 10 Nov 2014 12:11:19 +0000: We've pushed changes into the WiredTiger develop branch that panic the engine if a thread fails, causing all future calls to return failure, reference https://github.com/wiredtiger/wiredtiger/pull/1356. xgen-internal-githook commented on Mon, 10 Nov 2014 05:08:10 +0000: Author: {u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'} Message: SERVER-16023: abort mongod on WT_PANIC Branch: master https://github.com/mongodb/mongo/commit/4ab029e3e68ea5f7aa89b90a47fd814d17b142bf keith@wiredtiger.com commented on Fri, 7 Nov 2014 22:17:39 +0000: > A handler_panic function is what we are looking for. OK, let us think about that a little and get back to you. > In this state, would you expect us to call storage engine shutdown, or would it be a problem if we just called _exit()? It's safer not to call storage engine shutdown, something very, very bad has happened and continuing to run is risky at best. mark.benvenuto commented on Fri, 7 Nov 2014 22:11:42 +0000: A handler_panic function is what we are looking for. Something like (mocked up from handle_error) int(* WT_EVENT_HANDLER::handle_panic)(WT_EVENT_HANDLER *handler, WT_SESSION *session, int error, const char *message) We just want the callback for logging purposes, and so that we can go through our own abrupt shutdown routines. We were not going to continue the process at this point. In this state, would you expect us to call storage engine shutdown, or would it be a problem if we just called _exit()? keith@wiredtiger.com commented on Fri, 7 Nov 2014 22:05:54 +0000: Generally, there's no choice about what to do (for example, a panic might indicate corrupted memory, unexpected data format, or lack of resources); the only option is to exit and restart. We could probably offer a special "we're going to panic" notification that would allow the application to immediately throw up its hands, but once we decide to panic, no future calls to the engine will succeed, they will all immediately return WT_PANIC. milkie commented on Fri, 7 Nov 2014 21:53:33 +0000: Could we have a special fatal or severe message callback function, in which we could abort the process? Then you could allow engine consumers the choice of what to do when such things happen. keith@wiredtiger.com commented on Fri, 7 Nov 2014 21:49:07 +0000: I suspect that for now, the right approach for us is to panic the engine if a thread fails so that all future calls return failure. It sounds to me like that is acceptable to you? Working threads that see resource failure won't fail the engine (for example, an attempt to open a new table might fail), this is only the backing server threads. Does that make sense? mark.benvenuto commented on Fri, 7 Nov 2014 20:58:16 +0000: A simple repro ulimit -n 100 ./mongod --storageEngine=wiredtiger In mongo shell, run the following command for(var i = 0; i < 200; i++ ) { db.createCollection("coll" + i); } Here are the errors I see in the log: 2014-11-07T15:11:18.491-0500 I QUERY [conn1] command test.$cmd command: create { create: "coll42" } keyUpdates:0 numYields:0 reslen:94 5040ms 2014-11-07T15:11:18.851-0500 E STORAGE WiredTiger (24) [1415391078:851854][1951:0x7fab72791700], archive-server: /tmp/foo/journal: opendir: Too many open files 2014-11-07T15:11:18.851-0500 E STORAGE WiredTiger (24) [1415391078:851979][1951:0x7fab72791700], archive-server: dirlist journal prefix WiredTigerLog: Too many open files 2014-11-07T15:11:18.852-0500 E STORAGE WiredTiger (24) [1415391078:852026][1951:0x7fab72791700], archive-server: log archive server error: Too many open files 2014-11-07T15:11:23.497-0500 E STORAGE [conn1] WiredTiger (24) [1415391083:497585][1951:0x7fab7df47700], session.create: /tmp/foo/collection-85-2268107986619751322.wt: open: Too many open files 2014-11-07T15:11:23.497-0500 I - [conn1] Assertion: 8:24: Too many open files 2014-11-07T15:11:23.532-0500 I CONTROL [conn1] There are errors in both the archive thread, and the thread allocating the table that backs the collection.