...
not sure how to reproduce, it happened on dan's test server after running a heavy workload. dan:PRIMARY> db.serverStatus().wiredTiger.cache { "tracked dirty bytes in the cache" : 108113698, "bytes currently in the cache" : NumberLong("2305843128198539"), "maximum bytes configured" : 2147483648, "bytes read into cache" : 74306876990, "bytes written from cache" : 55231308853, "pages evicted by application threads" : 100627896064, "checkpoint blocked page eviction" : 1930, "unmodified pages evicted" : 812357, "page split during eviction deepened the tree" : 4, "modified pages evicted" : 1753265, "pages selected for eviction unable to be evicted" : 1651488, "pages evicted because they exceeded the in-memory maximum" : 1619, "pages evicted because they had chains of deleted items" : 253949, "failed eviction of pages that exceeded the in-memory maximum" : 3363, "hazard pointer blocked page eviction" : 3565, "internal pages evicted" : 385, "maximum page size at eviction" : 147633879, "eviction server candidate queue empty when topping up" : 28707, "eviction server candidate queue not empty when topping up" : 255279, "eviction server evicting pages" : 185, "eviction server populating queue, but not evicting pages" : 283801, "eviction server unable to reach eviction goal" : 1, "pages split during eviction" : 6618, "pages walked for eviction" : 159135276, "in-memory page splits" : 545, "tracked dirty pages in the cache" : 588, "pages currently held in the cache" : 1145, "pages read into cache" : 2401943, "pages written from cache" : 1840049 } monogstat shows that the cache is totally full of dirty bytes: 2723 7960 5648 *0 0 0|0 72.8 100.0 0 3.4G 2.8G 0|0 8|84 16m 14m 103 dan PRI 13:28:10 3779 10952 7308 *0 0 1|0 89.3 99.9 0 3.4G 2.8G 0|0 23|66 21m 19m 103 dan PRI 13:28:12 insert query update delete getmore command % dirty % used flushes vsize res qr|qw ar|aw netIn netOut conn set repl time 1787 5054 3349 *0 0 0|0 100.1 100.9 0 3.4G 2.8G 0|0 28|72 10m 9m 103 dan PRI 13:28:14 *0 *0 *0 *0 0 1|0 100.1 100.9 0 3.4G 2.8G 0|0 27|73 160b 9k 103 dan PRI 13:28:16 *0 *0 *0 *0 0 0|0 100.1 100.9 0 3.4G 2.8G 0|0 27|73 1k 9k 103 dan PRI 13:28:18 *0 *0 *0 *0 0 0|0 100.1 100.9 0 3.4G 2.8G 0|0 27|73 39b 8k 103 dan PRI 13:28:20 7 12 12 *0 0 1|0 100.2 100.9 0 3.4G 2.8G 0|0 20|80 31k 35k 103 dan PRI 13:28:22 6 20 13 *0 0 0|0 100.2 101.0 0 3.4G 2.8G 0|0 22|78 39k 42k 103 dan PRI 13:28:24 *0 7 4 *0 0 1|0 100.2 101.0 0 3.4G 2.8G 0|0 20|80 9k 22k 103 dan PRI 13:28:26 *0 *0 *0 *0 0 0|0 100.2 101.0 0 3.4G 2.8G 0|0 21|79 921b 8k 103 dan PRI 13:28:28 *0 *0 *0 *0 0 0|0 100.2 101.0 0 3.4G 2.8G 0|0 21|79 39b 8k 103 dan PRI 13:28:30 *0 *0 *0 *0 0 1|0 100.2 101.0 0 3.4G 2.8G 0|0 21|79 95b 8k 103 dan PRI 13:28:32
alexander.gorrod commented on Fri, 23 Jan 2015 05:30:13 +0000: > I wonder whether this hang is due to the new code to handle WT_ROLLBACK rather than the memory accounting issue. This has been reproduced with code that doesn't contain those changes, so they aren't the cause. dan@10gen.com commented on Fri, 23 Jan 2015 05:18:31 +0000: Attaching another view of stats gathered earlier this afternoon. I'm able to trigger this with just 101.5% cache utilization – doesn't need to be a very high amount over. alexander.gorrod commented on Fri, 23 Jan 2015 03:28:32 +0000: Stats graph dan@10gen.com commented on Thu, 22 Jan 2015 23:04:59 +0000: To set up the workload that I'm running which triggers this: yum install -y git gcc automake autoconf libtool git clone https://github.com/mongodb/mongo-c-driver cd mongo-c-driver ./autogen.sh make make install git clone https://github.com/johnlpage/WorkLoad cd WorkLoad make Then run: ./loadsrv -h hostname:port -p 100 -d 600 dan@10gen.com commented on Thu, 22 Jan 2015 20:12:25 +0000: Able to make this happen much faster by dropping wiredTigerCacheSizeGB to 1GB. keith.bostic commented on Thu, 22 Jan 2015 19:30:20 +0000: My guess is this isn't a deadlock. The bytes in the cache has gone wrong: "bytes currently in the cache" : NumberLong("2305843128198539"), there aren't any pages to evict because they've all been evicted, and the system is just thrashing. dan@10gen.com commented on Thu, 22 Jan 2015 19:11:50 +0000: Second occurrence attached as gdb2.txt geert.bosch commented on Thu, 22 Jan 2015 18:22:55 +0000: We have hundreds of threads waiting on the evict_waiter_cond while an evict_server is waiting in __wt_thread_join at evict_lru.c:181. How is conn->evict_workers protected there? dan@10gen.com commented on Thu, 22 Jan 2015 18:09:04 +0000: Attaching backtrace as gdb.txt dan@10gen.com commented on Thu, 22 Jan 2015 17:46:50 +0000: Server version: 612f66c61c12b705fe3c5ed251d63525f1bf650a Startup: mongod --slowms 5000 --dbpath /home/dan/wt-data/ --storageEngine wiredTiger --replSet dan --oplogSize 100 --wiredTigerCacheSizeGB 2 run rs.initiate() from the shell John Page's Workload (using commit 81338e3d5589fb5e1cd5a066cc50c9a72a92def6): ./loadsrv -h localhost:27017 -p 100 -d 6000