...
Under an update only workload on the primary it is possible to starve out readers on a secondary and cause large replication delays. Workload: On Primary: 1000 small updates (integer sets, increments and an unset) 10 massive (500kb) updates On Secondary: 10000 findOne's
bruce.lucas@10gen.com commented on Mon, 14 Dec 2015 19:38:57 +0000: david.hows, there was a bug (SERVER-21305) in the computation of timeAcquiringMicros that makes the actual number incorrect; fixed in 3.0.8. However you can still use it to determine if there is queuing on the global lock, you just won't be able to easily determine the length of the queue. SERVER-21859 means that you can't use the globalLock.currentQueue stat to determine that, because time spent queueing on every lock but the global lock shows up in globalLock.currentQueue. zhifan, you are correct that it is expected that continuous replication workload can starve secondary reads. I think the goal is to determine why, under conditions when it seems there should be excess capacity at the secondary to do secondary reads, it may not be the case because, as it appears, sometimes the secondary struggles to keep up with the primary, as indicated by growing replication lag, at which time secondary reads become starved. zhifan commented on Mon, 14 Dec 2015 08:05:46 +0000: Hi Hows: Why don't we think it is expected since whenever the slave is applying an oplog, it will block all the reads on slaves? In your test case, there are continuously update on primary, so probably there are continuously update on slave, the the read will be blocked (waiting for global lock) in a extremely high possibility. david.hows commented on Mon, 14 Dec 2015 06:59:56 +0000: Just to untangle things and provide a checkpoint as to what I've found; There are currently two metrics that I have been using to track the issue at hand. I will focus on the single threaded case that I have been working with today and running with MongoDB 3.2; When load is being applied to the primary repl lag is seen of 30+ seconds When load is being applied to the primary, secondary reads will sporadically take very long times to execute. Long Operations on Secondary under 3.2 2015-12-14T17:11:49.567+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 5865940.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 814498 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 814ms 2015-12-14T17:11:50.205+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 5348929.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1003 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 569ms 2015-12-14T17:11:51.086+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 9562842.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1399 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 785ms 2015-12-14T17:11:52.624+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 6098631.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1530898 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1531ms 2015-12-14T17:11:53.366+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 8450642.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 741045 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 741ms 2015-12-14T17:11:54.843+1100 I COMMAND [conn8] command test.test command: find { find: "test", filter: { _id: 3164850.0 }, limit: 1.0, singleBatch: true } planSummary: IDHACK keysExamined:1 docsExamined:1 idhack:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:143 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1472757 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_command 1472ms The most interesting thing about these operations is that 3/5 show that almost all of operational execution is spent in timeAcquiringMicros. I'm uncertain if the other two would be due to another bug in how timeAcquiringMicros is calculated, if those operations are legitimately taking hundreds of milliseconds for an _id lookup or is a variation on SERVER-21859 or something similar. I also tested with the latest MongoDB master and develop and this showed some initial improvements, but over several runs it turns out there was little difference from 3.2. The only noticeable difference was that after a longer run-time the repl lag seems to stabilise under the Master+Develop version. Under the single threaded version, the slowdown on the secondaries is very strongly associated with checkpoints in WT (as per the attached ss-singlethread.html), this may also explain the lack of high timeAcquiringMicros on some of the above queries. As it stands I'm not certain if this is SERVER-18908/SERVER-18983, but my suspicions are becoming stronger that bruce.lucas is correct as the high timeAcquiringMicros is for a global lock and consistent with the explanation above. david.hows commented on Mon, 14 Dec 2015 03:45:57 +0000: alexander.gorrod, I have amended my comment to make my thoughts more clear, I suspect it is less likely that SERVER-18983 is involved as that tickets current focus (as I read it) is on adding a method to apply the oplog on secondaries in parallel. Other than the size of the batch jobs that I am working with there should be no other parallelisation within the workload I tested with. alexander.gorrod commented on Mon, 14 Dec 2015 03:25:19 +0000: Thanks for digging deeper david.hows Does the above information make it more or less likely that this is the same problem being tracked by SERVER-18983? david.hows commented on Mon, 14 Dec 2015 01:49:14 +0000: Modified the script today to do some further diving. Even with only the one thread inserting, I can see times where it takes over 60 seconds to finish the small burst of 10K reads (baseline is between 1-2 seconds) and I can see repl lag spiking greatly during that time. It does seem that the system is able to catch up eventually, with at most somewhere near a minutes delay, but this load is one primary receiving one threads worth of update operations only and the secondary still having trouble keeping pace, which means that this may not exclusively be an issue with parallel on primaries vs serial on secondaries. bruce.lucas@10gen.com commented on Fri, 11 Dec 2015 12:34:06 +0000: Replication lag builds under the insert workload, as indicated by the repl buffer stats. (The 3.2 ftdc data will have a more direct indication of this.) This means replication is struggling, and that is what is probably what is starving the reads, because we "stop the world" using the global lock during replication. This does not show up as read queues because of SERVER-21859, but the global lock time acquiring stat supports this as you point out. The WT concurrentTransactions metric is also consistent with this - the reads are not queuing in WT. The main operations done by replication are applying the ops and inserting them into the oplog. Time spent applying the ops is reported in "repl apply batches totalMillis", but this is not large, so the replication lag is most likely related to inserting ops into the oplog, and that is also consistent with the customer's observation. If the above is correct, this is a known issue, and there are a couple tickets tracking work in this area - SERVER-18908, SERVER-18983. In this case for example you have 20 threads on the primary inserting into the oplog in parallel whereas that is serialized on the secondary, so that may account for the lag. In any case stack traces would help diagnose this of course. david.hows commented on Fri, 11 Dec 2015 05:39:17 +0000: Current working hypothesis is that there is some delay in the processing of replication operations that is causing the delay. The aim with the reproduction workload was to generate a large number of deletes at once when truncating the oplog to allow for the insertion of a new very large document in an update only workload. Initially, I suspected this may have been related to tombstones within the oplog collection in WiredTiger, thus the large/small update workload in a replica set. However, as SERVER-21027 is fixed within 3.2.0, I am uncertain if this is the cause. david.hows commented on Fri, 11 Dec 2015 05:18:26 +0000: Attaching the results of timeseries on the serverStatus captures on 2 repro runs. The 3.0.7 run is far longer as it was the initial reproduction and some changes were made to reproducer to get things going. You can also see a point when I stopped the reproduction to confirm that execution eventually returned to normal (it did). The 3.2.0 run is far more indicative, as it was just load and reproduce. Probably the most telling symptom at this stage is the Global timeAcquiringMicros r in the 3.2.0 timeseries attachment, it begins spiking as soon as the ability to read falls behind. david.hows commented on Fri, 11 Dec 2015 05:08:55 +0000: The repro script requires the MongoDB ruby driver v2.2.0.
Launch mongod's: ./mongod --replSet foo --storageEngine wiredTiger --dbpath /mnt/work/data/rs0 --oplogSize 1000 --wiredTigerCacheSizeGB 4 --logpath /mnt/work/data/rs0/mongod.log --fork ./mongod --replSet foo --storageEngine wiredTiger --dbpath /mnt/work/data/rs1 --oplogSize 1000 --wiredTigerCacheSizeGB 4 --port 27018 --logpath /mnt/work/data/rs1/mongod.log --fork Setup the replset: rs.initiate() rs.add("localhost:27018") Run javascript against the primary to seed documents var INT_MAX=10000000; for(x=0;x<INT_MAX/1000;x++){ var bulk = db.test.initializeUnorderedBulkOp() for(i=0;i<1000;i++){ var num = (x*1000)+i; bulk.insert({_id:num,x:num,y:num}); } bulk.execute(); } Execute the attached repro.rb Run the following javascript on the secondary to monitor performance. You should see an initially stable number of query times, which will then blow out to seconds and minutes: while(true){ var INT_MAX=10000000; start = new Date() rs.slaveOk() for(i=0;i<10000;i++){ var docNum = Math.floor(Math.random()*INT_MAX); db.test.findOne({_id:docNum}); } print("Took: " +(new Date() - start) + "ms") sleep(1000); }