...
Issue Status as of Sep 30, 2016 ISSUE SUMMARY MongoDB with WiredTiger may experience excessive memory fragmentation. This was mainly caused by the difference between the way dirty and clean data is represented in WiredTiger. Dirty data involves smaller allocations (at the size of individual documents and index entries), and in the background that is rewritten into page images (typically 16-32KB). In 3.2.10 and above (and 3.3.11 and above), the WiredTiger storage engine only allows 20% of the cache to become dirty. Eviction works in the background to write dirty data and keep the cache from being filled with small allocations. That changes in WT-2665 and WT-2764 limit the overhead from tcmalloc caching and fragmentation to 20% of the cache size (from fragmentation) plus 1GB of cached free memory with default settings. USER IMPACT Memory fragmentation caused MongoDB to use more memory than expected, leading to swapping and/or out-of-memory errors. WORKAROUNDS Configure a smaller WiredTiger cache than the default. AFFECTED VERSIONS MongoDB 3.0.0 to 3.2.9 with WiredTiger. FIX VERSION The fix is included in the 3.2.10 production release. This ticket is a spin-off from SERVER-17456, relating to the last issue discussed there. Under certain workloads a large amount of memory in excess of allocated memory is used. This appears to be due to fragmentation, or some related memory allocation inefficiency. Repro consists of: mongod running with 10 GB cache (no journal to simplify the situation) create a 10 GB collection of small documents called "ping", filling the cache create a second 10 GB collection, "pong", replacing the first in the cache issue a query to read the first collection "ping" back into the cache, replacing "pong" Memory stats over the course of the run: from A-B "ping" is being created, and from C-D "pong" is being created, replacing "ping" in the cache starting at D "ping" is being read back into the cache, evicting "pong". As "pong" is evicted from cache in principle the memory so freed should be usable for reading "ping" into the cache. however from D-E we see heap size and central cache free bytes increasing. It appears that for some reason the memory freed by evicting "pong" cannot be used to hold "ping", so it is being returned to the central free list, and instead new memory is being obtained from the OS to hold "ping". at E, while "ping" is still being read into memory, we see a change in behavior: free memory appears to have been moved from the central free list to the page heap. WT reports number of pages is no longer increasing. I suspect that at this point "ping" has filled the cache and we are successfully recycling memory freed by evicting older "ping" pages to hold newer "ping" pages. but the net is still about 7 GB of memory in use by the process beyond the 9.5 GB allocated and 9.2 GB in the WT cache, or about a 75% excess. Theories: smaller buffers freed by evicting "pong" are discontiguous and cannot hold larger buffers required for reading in "ping" the buffers freed by evicting "pong" are contiguous, but adjacent buffers are not coalesced by the allocator buffers are eventually coalesced by the allocator, but not in time to be used for reading in "ping"
bruce.lucas@10gen.com commented on Tue, 12 Jan 2016 04:30:14 +0000: Would need to check this, but one thing that may be exacerbating the issue is that I believe when inserting to a collection there will be writes to at least two tables - collection and index(es) - interspersed in memory, so that tcmalloc spans will be tied up by small data structures until the corresponding WT pages for both tables are freed. Would it be possible to allocate WT_INSERT and WT_UPDATE structures in per-page, or at least per-table, batches, to avoid this interspersing, so that tcmalloc spans will tend to have only WT_INSERT and WT_UPDATE structures for a single WT page? That might allow tcmalloc spans to be freed for re-use (for the required larger memory allocations) sooner, when the WT_INSERT and/or WT_UPDATE structures for a given WT page that occupy a particular tcmalloc span are freed. alexander.gorrod commented on Mon, 11 Jan 2016 23:05:40 +0000: bruce.lucas Thanks for the insightful information. It would be interesting to get your diagnostic code into the WiredTiger tree somehow. Your analysis looks correct to me. I can give some context on the different structure types you reference: Structure type Description Life cycle summary WT_INSERT Allocated when a new item is inserted into a collection or index Will be freed when the page the insert was included in is reconciled (evicted) WT_UPDATE Allocated when an existing item is updated Will be freed when the page is reconciled, or if a newer update is present and this update is no longer required WT_REF Allocated when a new page is created Freed only when an internal page is evicted from cache In terms of helping to reduce fragmentation of allocations when a workload switches from insert/update to read I think we can improve the current situation, though some fragmentation will be inevitable. The following table are things I think could help: Change Potential benefits Potential penalties Have eviction more aggressively trickle out dirty pages When inserts stop we will continue to free WT_INSERT and WT_UPDATE structures. The closer to 0 the fewer allocations will remain, so less spans should be pinned Potentially increases write amplification (the number of times each page is written). Potentially degrades on disk fill factor of pages. Allocate WT_REF structures from a separate allocation pool. The WT_REF structures are generally long lived, so having them allocated from a different pool, will avoid sparsely populated pools where allocations of WT_REF structures are interleaved with other allocations. Additional code complexity and different behavior with different allocators bruce.lucas@10gen.com commented on Mon, 11 Jan 2016 19:36:55 +0000: Instrumented the code to record a string tag with each allocated block to identify its origin (specifically filename and line number). Wrote some code to scan the heap to collect statistics related to allocated blocks. Running the repro described above, where we create a tree in memory, and then replace it with a tree read from disk, at the peak fragmentation just before the old tree is completely evicted, selected stats show worst offenders: MALLOC: + 2951831800 ( 2815.1 MiB) Bytes in central cache freelist ============= Total size of freelists class 1 [ 8 bytes ] : 1964 objs; 0.0 MiB; 0.0 cum MiB class 2 [ 16 bytes ] : 789 objs; 0.0 MiB; 0.0 cum MiB class 3 [ 32 bytes ] : 41356 objs; 1.3 MiB; 1.3 cum MiB class 4 [ 48 bytes ] : 24726216 objs; 1131.9 MiB; 1133.2 cum MiB class 5 [ 64 bytes ] : 25030513 objs; 1527.7 MiB; 2660.9 cum MiB class 6 [ 80 bytes ] : 1133994 objs; 86.5 MiB; 2747.4 cum MiB class 7 [ 96 bytes ] : 87440 objs; 8.0 MiB; 2755.4 cum MiB ... ============= tagged allocation info class 4 (48 bytes); 39131 spans, 641122304 bytes, 611.422 MiB tag src/third_party/wiredtiger/src/btree/row_modify.c:274: alloc: 1255532 objs, 60265536 bytes, 57.474 MiB; 38391 spans, 628998144 span bytes, 599.859 MiB tag src/third_party/wiredtiger/src/btree/row_key.c:479: alloc: 30910 objs, 1483680 bytes, 1.415 MiB; 7922 spans, 129794048 span bytes, 123.781 MiB class 5 (64 bytes); 51450 spans, 842956800 bytes, 803.906 MiB tag src/third_party/wiredtiger/src/btree/row_modify.c:246: alloc: 1173382 objs, 75096448 bytes, 71.618 MiB; 50009 spans, 819347456 span bytes, 781.391 MiB tag src/third_party/wiredtiger/src/btree/bt_split.c:765: alloc: 87277 objs, 5585728 bytes, 5.327 MiB; 19882 spans, 325746688 span bytes, 310.656 MiB In other words, there are 2815 MB in central free list, mostly in tiny objects in class 4 (48 bytes) and class 5 (64 bytes). This is because as we read in the new tree from disk and evict the old tree currently in memory, we need to allocate 32 kB buffers, but can't use the memory that held the tiny buffers basically until they are all freed, due to fragmentation of that memory by remaining small buffers. The tagged allocation info gives us file name and line number, and tells us for each allocation site how many bytes of that allocation are currently active, and also tells us how sparsely allocated they are, that is, how many bytes of spans those allocated blocks are spread out among. Converting filename and line number to name of data structure and reformatting the last four lines above a bit: 71 MiB of WT_INSERT are spread out among 781 MiB of spans 57 MiB WT_UPDATE 599 MiB 5 MiB WT_REF 310 MiB 1 MiB WT_IKEY 123 MiB In other words, we have freed almost all the WT_INSERT, WT_UPDATE, WT_REF, and WT_IKEY structures at this point - the amounts of memory those buffers are using is small - but they are spread out among a large amount of memory, tying it up, preventing it from being used for 32 kB allocations. abcfy2 commented on Thu, 12 Nov 2015 01:48:35 +0000: Mongodb 3.2 will release soon, is there any new progress in this issue? alexander.gorrod commented on Mon, 12 Oct 2015 04:02:31 +0000: For the record, I was hoping that jemalloc would show a different performance profile than tcmalloc for this workload. It did not - the timeseries plot was almost identical. bruce.lucas@10gen.com commented on Fri, 9 Oct 2015 18:43:24 +0000: I think it helps some, but it doesn't completely solve the problem: From a memory pressure performance perspective the fact that it eventually releases the memory is good, but to some extent by that point the harm has already been done: the kernel will already have had to evict stuff from the file cache due to the increased memory pressure, even if transient. The amount of time that it's in that state may not be that brief - in my test below with a 10 GB cache it was a couple of minutes, and would expect it to be longer with the larger caches that are more typical. If my analysis below is correct, it helps with this particular test because it is an extreme test, eventually evicting all of the old collection, allowing it to decommit the memory. In a less extreme workload it may never evict an entire collection leaving memory fragmented by the partial contents of that collection. From an OOM perspective it doesn't help - OOM is OOM, no matter how brief. Here are some stats from a 10 GB run with aggressive decommit enabled: At A we begin reading in the collection from disk, evicting the collection that is filling the cache. From A to B we see a lot of frees and the central free list builds to 7 GB, but that space is not reused for the data being read from disk, I guess because it is dedicated to buffer sizes too small for that purpose? is not decommitted, I assume because the pages in the free list are not yet completely empty. At B however this changes: the central free cache begins to drop and umappped bytes correspondingly rises, presumably because entire pages are now becoming empty and so are decommitted. Coinciding with this starting at B we see that we are evicting internal pages. Theory: as we allocate internal pages, they end up interspersed among the leaf pages. When we begin evicting, we first evict a lot of leaf pages, and only at B begin to evict the internal pages. This means that from A to B we accumulate a lot of pages that are mostly empty except for some sparse internal pages (last stat shows that of the 10 GB of cache only 44 MB are internal pages), and can neither be decommitted nor reused for the new collection. If that is correct, some possibly naive thoughts about how it could be fixed: Use a memory allocator that supports separate heaps, and allocate the internal pages from a separate heap. (Unknown how that would interact with separate thread heaps as in tcmalloc...) Use some kind of sub-allocation scheme for internal pages where large buffers are obtained for internal pages, which are then subdivided by WT, in order to keep all the internal pages together. More aggressively evict internal pages. alexander.gorrod commented on Fri, 9 Oct 2015 06:02:24 +0000: I've done some testing with this use case, and I believe there is a TCMalloc setting that can help. I reduced the size of the test case to 3GB, and with the current MongoDB, I see similar memory growth to that reported with a 10GB test case. The issue is that towards the end of the test memory use spikes up well above the configured 3GB cache size. In my testing to 4.84GB. That additional memory appears in the tcmalloc pageheap_free_bytes statistic. The TCMalloc page heap is the heap used to service memory allocations greater than 32k. There is a configuration option to TCMalloc called "aggressive decommit" that causes the page heap to not hold memory. I ran the reproducer with the head of MongoDB master branch, and generated the following timeseries: You can see that towards the end of the run the resident memory bumps up to 4.8GB and stays there. I did another run where I turned on the aggressive reclaim flag for TCMalloc, and it generated the following timeseries graph: The resident set size still bumped up at the end of the run, but it quickly returned to the baseline level. The change to enable the aggressive reclaim flag is simple: --- a/src/mongo/util/tcmalloc_set_parameter.cpp +++ b/src/mongo/util/tcmalloc_set_parameter.cpp @@ -125,6 +125,10 @@ MONGO_INITIALIZER_GENERAL(TcmallocConfigurationDefaults, if (getenv("TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES")) { return Status::OK(); } + Status status = tcmallocAggressiveMemoryDecommit.setFromString("1"); + if (!status.isOK()) { + return status; + } return tcmallocMaxTotalThreadCacheBytesParameter.setFromString("0x40000000" /* 1024MB */); } The aggressive reclaim flag is also enabled by default in the newer release of TCMalloc (gperftools 2.4). bruce.lucas Do you think that enabling the aggressive release flag would be enough to resolve this issue? alexander.gorrod commented on Fri, 9 Oct 2015 05:49:05 +0000: TCMalloc aggressive reclaim on/off bruce.lucas@10gen.com commented on Sun, 6 Sep 2015 21:30:27 +0000: Issue also occurs under 3.1.7. bruce.lucas@10gen.com commented on Sun, 6 Sep 2015 20:24:16 +0000: Repro script: db=/ssd/db gb=10 threads=50 function start { killall -9 -w mongod rm -rf $db $db.log mkdir -p $db mongod --dbpath $db --logpath $db.log --storageEngine wiredTiger --nojournal \ --wiredTigerCacheSizeGB $gb --fork } function monitor { mongo >ss.log --eval \ "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000*1)}" & } # generate a collection function make { cn=$1 ( for t in $(seq $threads); do mongo --eval " c = db['$cn'] c.insert({}) every = 10000 for (var i=0; c.stats().size < $gb*1000*1000*1000; i++) { var bulk = c.initializeUnorderedBulkOp(); for (var j=0; j<every; j++, i++) bulk.insert({}) bulk.execute(); if ($t==1) print(c.stats(1024*1024).size) } " & done wait ) } # scan a collection to load it function load { cn=$1 mongo --eval " c = db['$cn'] print(c.find({x:0}).itcount()) " } start # start mongod monitor # monitor serverStatus make ping # generate a 10 GB collection make pong # generate another 10 GB collection, filling 10 GB cache sleep 120 # sleep a bit to wait for writes; makes stats clearer load ping # scan first 10 GB collection to load it back into cache