...
#0 tcmalloc::SLL_Next (t=0x0) at src/third_party/gperftools-2.5/src/linked_list.h:45 #1 tcmalloc::SLL_PopRange (end=, start=, N=128, head=0x559aa7bbc7a8) at src/third_party/gperftools-2.5/src/linked_list.h:76 #2 tcmalloc::ThreadCache::FreeList::PopRange (end=, start=, N=128, this=0x559aa7bbc7a8) at src/third_party/gperftools-2.5/src/thread_cache.h:225 #3 tcmalloc::ThreadCache::ReleaseToCentralCache (this=this@entry=0x559aa7bbc700, src=src@entry=0x559aa7bbc7a8, cl=, N=N@entry=128) at src/third_party/gperftools-2.5/src/thread_cache.cc:195 #4 0x000055966f8bdd8c in tcmalloc::ThreadCache::ListTooLong (this=this@entry=0x559aa7bbc700, list=0x559aa7bbc7a8, cl=) at src/third_party/gperftools-2.5/src/thread_cache.cc:157 #5 0x000055966f8c6a0a in tcmalloc::ThreadCache::Deallocate (cl=, ptr=0x55b849f2a840, this=0x559aa7bbc700) at src/third_party/gperftools-2.5/src/thread_cache.h:393 #6 (anonymous namespace)::do_free_helper (invalid_free_fn=0x55966f8bf4b0 , size_hint=0, use_hint=false, heap_must_be_valid=true, heap=0x559aa7bbc700, ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1383 #7 (anonymous namespace)::do_free_with_callback (invalid_free_fn=0x55966f8bf4b0 , size_hint=0, use_hint=false, ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1415 #8 (anonymous namespace)::do_free (ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1423 #9 tc_free (ptr=0x55b849f2a840) at src/third_party/gperftools-2.5/src/tcmalloc.cc:1688 #10 0x000055966dfe4864 in __wt_free_int (session=session@entry=0x559671e720b0, p_arg=p_arg@entry=0x7f6949b8b6b8) at src/third_party/wiredtiger/src/os_common/os_alloc.c:327 #11 0x000055966e0449c8 in __wt_free_ref (session=session@entry=0x559671e720b0, ref=0x0, page_type=6, free_pages=free_pages@entry=false) at src/third_party/wiredtiger/src/btree/bt_discard.c:292 #12 0x000055966e043b4d in __wt_free_ref_index (session=session@entry=0x559671e720b0, page=page@entry=0x55ab30eff540, pindex=0x55bdaff1aa00, free_pages=free_pages@entry=false) at src/third_party/wiredtiger/src/btree/bt_discard.c:309 #13 0x000055966e043ef6 in __free_page_int (page=, session=0x559671e720b0) at src/third_party/wiredtiger/src/btree/bt_discard.c:234 #14 __wt_page_out (session=session@entry=0x559671e720b0, pagep=pagep@entry=0x55c4ad5cd940) at src/third_party/wiredtiger/src/btree/bt_discard.c:119 #15 0x000055966e04481a in __wt_ref_out (session=session@entry=0x559671e720b0, ref=ref@entry=0x55c4ad5cd940) at src/third_party/wiredtiger/src/btree/bt_discard.c:44 #16 0x000055966dfd87df in __evict_page_dirty_update (closing=false, ref=0x55c4ad5cd940, session=0x559671e720b0) at src/third_party/wiredtiger/src/evict/evict_page.c:433 #17 __wt_evict (session=session@entry=0x559671e720b0, ref=ref@entry=0x55c4ad5cd940, closing=closing@entry=false, previous_state=previous_state@entry=5) at src/third_party/wiredtiger/src/evict/evict_page.c:222 #18 0x000055966dfd05eb in __evict_page (session=session@entry=0x559671e720b0, is_server=is_server@entry=false) at src/third_party/wiredtiger/src/evict/evict_lru.c:2334 #19 0x000055966dfd0b43 in __evict_lru_pages (session=session@entry=0x559671e720b0, is_server=is_server@entry=false) at src/third_party/wiredtiger/src/evict/evict_lru.c:1185 #20 0x000055966dfd3957 in __wt_evict_thread_run (session=0x559671e720b0, thread=0x5596755760a0) at src/third_party/wiredtiger/src/evict/evict_lru.c:318 #21 0x000055966e02a9b9 in __thread_run (arg=0x5596755760a0) at src/third_party/wiredtiger/src/support/thread_group.c:31 #22 0x00007f694f18a6ba in start_thread (arg=0x7f6949b8c700) at pthread_create.c:333 #23 0x00007f694eec04dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb)
dmitry.agranat commented on Wed, 27 Jan 2021 14:28:44 +0000: Hi surajn.vnit@gmail.com, I will go ahead and close this ticket. Please reopen if the issue reoccurs after upgrading to 4.4.3 dmitry.agranat commented on Wed, 20 Jan 2021 12:45:35 +0000: Hi surajn.vnit@gmail.com we believe the reported issue is related to WT-7049, and based on the preliminary results, does not reproduce in 4.4. We suggest testing with 4.4.3 and reporting back with the results. sergey@stripe.com commented on Tue, 19 Jan 2021 20:44:51 +0000: @Dmitry Thanks! Re reproduction: we do not yet have a synthetic repro, and it's not possible to deploy an unpatched version into product where we're able to repro this group of crashes consistently. (However I'm 95% certain our patches are not relevant here – we do not touch any WT code and other high-load situations have not turned up any similar issues.) dmitry.agranat commented on Tue, 19 Jan 2021 17:50:18 +0000: surajn.vnit@gmail.com, we are still investigating this and I expect to have an update for you tomorrow dmitry.agranat commented on Mon, 18 Jan 2021 10:43:33 +0000: Thanks surajn.vnit@gmail.com for the provided data and all the explanation. Given you are using a custom MongoDB binary, can you reproduce the same under the default and unchanged MongoDB binary? sergey@stripe.com commented on Sat, 16 Jan 2021 01:52:58 +0000: Hi Dima, I've uploaded and archive of the relevant data to the portal: the diagnostic.data archive (sadly the files for the date of the crash seem to have been truncated so not sure if this is useful) the crash backtrace in mongod.log (I can attach any other specific info from the log if you can specify it) the `bt full` backtrace from gdb on the core file a directory of other gdb backtraces ('other-stacks') that I've collected from other crashes that appear related. Also, some context that we now have: This is a build of mongo 3.6.20 (with some small local patches that are known to be safe & stable) We think all of the segfaults are happening on machines that are experiencing a high amount of deletes. (The crash this ticket was opened was a one-off delete workload. The 'other-stacks' are from a pool of machines that has a regular deletion heavy workload.) The machine with the crashed mongod is a 48 core instance. For the crashes in 'other-stacks', we noticed the segfaults started when the instance size was changed from 16 CPU machines to 36 CPU machines (with no other configuration), and are happening at a steady rate in the cluster. (We do not have a synthetic repro yet.) dmitry.agranat commented on Tue, 12 Jan 2021 12:15:58 +0000: Hi surajn.vnit@gmail.com, Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? In addition, please also attach syslog covering the time of the reported event. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, Dima
Unknown at this point, but happening multiple times in production.