...
Observed in v3.6. Unable to reproduce on v3.4. $sample command never returns when run with explain(true). db.currentOps() shows: "planSummary": "MULTI_ITERATOR", "numYields": 18619890, numYields continuously increases.
xgen-internal-githook commented on Mon, 30 Jul 2018 22:37:42 +0000: Author: {'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'} Message: SERVER-35967 Ban explain() on $sample when storage engine supports random cursors Branch: v3.6 https://github.com/mongodb/mongo/commit/9c3f90d24a36f07072a45dbc8c4694dbe1fc664b david.storch commented on Thu, 5 Jul 2018 19:02:09 +0000: One more note: In the process of investigating this issue, I found a separate problem with explaining $sample that affects 4.0 and master. See SERVER-35973. david.storch commented on Thu, 5 Jul 2018 18:31:57 +0000: After investigating, I can confirm that this problem was introduced by SERVER-19758, which added "executionStats" and "allPlansExecution" explain modes for aggregation explain. I can also confirm that it was fixed by SERVER-29421, meaning that this issue affects only the 3.6 branch. The problem occurs when the explain code attempts to run its underlying PlanExecutor in order to gather execution statistics: https://github.com/mongodb/mongo/blob/6405d65b1d6432e138b44c13085d0c2fe235d6bd/src/mongo/db/pipeline/document_source_cursor.cpp#L190 This PlanExecutor encapsulates an execution plan which simply reads from a random cursor—a RecordCursor obtained by calling RecordStore::getRandomCursor(). Existing random cursor implementations sample via a pseudorandom btree walk. They also never return EOF, but rather will sample infinitely (with replacement) as a caller iterates the cursor. The explain code is doing just this—in its attempt to gather stats from the cursor, it is stuck in an infinite loop here checking for an EOF that will never come. The problem is that the explain code path iterates the underlying random cursor directly, circumventing the code in DocumentSourceSampleFromRandomCursor which is responsible for duplicate elimination, and, importantly, for signaling EOF for $sample queries. SERVER-29421 fixes the problem by changing our strategy for explain execution. Specifically, this fix consolidates the explain code path with the regular execution code path, such that exec stats are gathered by running the query in much the same way as the regular, non-explain code path. david.storch commented on Thu, 5 Jul 2018 17:17:57 +0000: I can reproduce this issue on the 3.6 branch but not on the master branch. My initial guess is that this is a regression due to SERVER-19758 that was subsequently fixed by SERVER-29421. Assigning to myself to produce a more concrete root cause diagnosis. ramon.fernandez commented on Thu, 5 Jul 2018 16:58:51 +0000: I can confirm this behavior: in 3.4.4 the explain() command returns, in 3.6.5 doesn't.
db.c.drop() for (i = 0; i < 1000; i++ ) { db.c.insertOne({size:1})} db.c.explain(true).aggregate({$sample: {size: 10}})