...
While running the load phase of a YCSB test, wiredTiger encountered an error and panic. This has happened 3 out of 4 times so it seems fairly reproducible. The workload is doing a large number of inserts of documents that are 1K each. The mongod log reports some slow inserts (>100 msec) and then showed the following line about wiredTiger. This was seen on an EC2 instance running Enterprise version of mongod (3.1.7-pre, git version: 1b5b55cb48ff9daca74177b86d336613d207c5d1). 015-08-11T08:53:47.184+0000 E STORAGE [thread1] WiredTiger (-31802) [1439283227:184443][18776:0x7fdf70a65700], file:WiredTiger.wt, WT_SESSION.checkpoint: WiredTiger.wt read error: failed to read 4096 bytes at offset 32768: WT_ERROR: non-specific WiredTiger error 2015-08-11T08:53:47.184+0000 E STORAGE [thread1] WiredTiger (-31802) [1439283227:184629][18776:0x7fdf70a65700], checkpoint-server: checkpoint server error: WT_ERROR: non-specific WiredTiger error 2015-08-11T08:53:47.184+0000 E STORAGE [thread1] WiredTiger (-31804) [1439283227:184661][18776:0x7fdf70a65700], checkpoint-server: the process must exit and restart: WT_PANIC: WiredTiger library panic 2015-08-11T08:53:47.184+0000 I - [conn9] Fatal Assertion 28559 2015-08-11T08:53:47.184+0000 I - [conn16] Fatal Assertion 28559 One the console where the YCSB test were run, these error repeated many times around the problem and are mostly related. Part of this error message that is interesting is where it says "Caused by: java.io.EOFException), which could be related to the error in mongod log. com.mongodb.MongoException$Network: Operation on server localhost:27017 failed at com.mongodb.DBTCPConnector.doOperation(DBTCPConnector.java:215) at com.mongodb.DBCollectionImpl.writeWithCommandProtocol(DBCollectionImpl.java:461) at com.mongodb.DBCollectionImpl.insertWithCommandProtocol(DBCollectionImpl.java:426) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:188) at com.mongodb.DBCollectionImpl.insert(DBCollectionImpl.java:167) at com.mongodb.DBCollection.insert(DBCollection.java:93) at com.mongodb.DBCollection.insert(DBCollection.java:78) at com.mongodb.DBCollection.insert(DBCollection.java:120) at com.yahoo.ycsb.db.MongoDbClient.insert(MongoDbClient.java:246) at com.yahoo.ycsb.DBWrapper.insert(DBWrapper.java:148) at com.yahoo.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:461) at com.yahoo.ycsb.ClientThread.run(Client.java:269) Caused by: java.io.EOFException at org.bson.io.Bits.readFully(Bits.java:75) at org.bson.io.Bits.readFully(Bits.java:50) at org.bson.io.Bits.readFully(Bits.java:37) at com.mongodb.Response.(Response.java:42) at com.mongodb.DBCollectionImpl.receiveWriteCommandMessage(DBCollectionImpl.java:532) at com.mongodb.DBCollectionImpl.access$300(DBCollectionImpl.java:48) at com.mongodb.DBCollectionImpl$2.execute(DBCollectionImpl.java:474) at com.mongodb.DBCollectionImpl$2.execute(DBCollectionImpl.java:461) at com.mongodb.DBPort.doOperation(DBPort.java:187) at com.mongodb.DBTCPConnector.doOperation(DBTCPConnector.java:208) ... 11 more
chung-yen.chang commented on Tue, 18 Aug 2015 20:23:23 +0000: I tried the suggestion by michael.cahill to use "killall -w" in the script. After that I have rerun the same script multiple times without hitting this issue again. At this point, I think it is safe to assume that the problem was a duplicate of the earlier issue that Michael has mentioned, that the mongod being killed didn't die cleanly and somehow messed up the new one that is being started. chung-yen.chang commented on Fri, 14 Aug 2015 13:12:41 +0000: michael.cahill Mm, that sounds possible but I do have some sleep time between steps because of some other issues I ran into before. But let me give this a try and hopefully that's it. I will convert the killall in my script to include "-w". Fingers crossed and I will update when I am back on line. Thanks. michael.cahill commented on Fri, 14 Aug 2015 10:33:02 +0000: chung-yen.chang, we have seen errors similar to this if the old mongod has not actually shut down before the new one starts. An old mongod can overwrite WiredTiger.turtle because the manual rm -rf has removed the lock file that normally protects against this kind of thing. Can you check if killall -w avoids this error? keith.bostic commented on Fri, 14 Aug 2015 02:06:42 +0000: chung-yen.chang, are you running a standard YCSB workload, or is this something a little different? Also, would it be possible for me to get ssh access to your instance, run it myself? Thanks! michael.cahill commented on Thu, 13 Aug 2015 23:41:19 +0000: Thanks for the additional information, chung-yen.chang. From the log, it looks like WiredTiger recovery ran for 15 minutes on startup, but MongoDB didn't detect an unclean shutdown. That is a little odd, but doesn't explain what is happening later. Then the workload runs for over two hours (meaning probably ~100 checkpoints complete successfully), before a checkpoint fails. This failure indicates that as part of a checkpoint, WiredTiger is expecting to read a block from its metadata (the "WiredTiger.wt" file). The file is 32KB and we're trying to read a block of 4KB that starts at the 32KB offset, so the read fails. This indicates that either WiredTiger's metadata has become corrupted (specifically, WiredTiger.turtle is out of sync with WiredTiger.wt) or that the WiredTiger.wt file has been truncated incorrectly since the last checkpoint completed. keith.bostic, how would you suggest tracking this down? Could chung-yen.chang turn on some categories of verbose message while reproducing? chung-yen.chang commented on Thu, 13 Aug 2015 23:05:43 +0000: michael.cahill, this turned out to be uglier than I thought. I was about to give up because I haven't seen this problem for 5 tries in a row, then it hit again. These transient problems are the worst to debug. Anyways, I have attached the mongod log from this recent occurrence. You have also asked for the size of the files and here is the "ls -l" output of the database directory. There is about 64G of space left on the volume. [ec2-user@ip-10-2-1-226 16th_1b5b55c_1]$ ls -l ~/data/testdb total 4481244 -rw-rw-r-- 1 ec2-user ec2-user 16384 Aug 13 00:17 collection-0-7422038257504144638.wt -rw-rw-r-- 1 ec2-user ec2-user 4468166656 Aug 13 00:18 collection-2-7422038257504144638.wt -rw-rw-r-- 1 ec2-user ec2-user 16384 Aug 13 00:17 index-1-7422038257504144638.wt -rw-rw-r-- 1 ec2-user ec2-user 112893952 Aug 13 00:18 index-3-7422038257504144638.wt drwxrwxr-x 2 ec2-user ec2-user 4096 Aug 13 00:18 journal -rw-rw-r-- 1 ec2-user ec2-user 16384 Aug 13 00:17 _mdb_catalog.wt -rwxrwxr-x 1 ec2-user ec2-user 6 Aug 13 00:15 mongod.lock -rw-rw-r-- 1 ec2-user ec2-user 3090617 Aug 13 00:18 mongod.log -rw-rw-r-- 1 ec2-user ec2-user 32768 Aug 13 00:18 sizeStorer.wt -rw-rw-r-- 1 ec2-user ec2-user 95 Aug 13 00:15 storage.bson -rw-rw-r-- 1 ec2-user ec2-user 45 Aug 13 00:15 WiredTiger -rw-rw-r-- 1 ec2-user ec2-user 533 Aug 13 00:15 WiredTiger.basecfg -rw-rw-r-- 1 ec2-user ec2-user 21 Aug 13 00:15 WiredTiger.lock -rw-rw-r-- 1 ec2-user ec2-user 924 Aug 13 00:17 WiredTiger.turtle -rw-rw-r-- 1 ec2-user ec2-user 32768 Aug 13 00:18 WiredTiger.wt My next step is to try and reproduce this again, and try to capture a core dump when it happens (just learned how to enable that after this last attempt). If there are other information that can help, please let me know. chung-yen.chang commented on Wed, 12 Aug 2015 17:23:19 +0000: michael.cahill, after hitting this 3 out of 4 times, it stopped showing. I tired to create a simple insert script for a repro. It didn't hit. I re-ran the original YCSB load twice, and it didn't hit. When I hit it 3 out of 4 times, I thought it is easy to reproduce but now it's proving otherwise. I will try it again and see if I can reproduce, and then capture the log and check the file size when it happens. michael.cahill commented on Wed, 12 Aug 2015 11:15:44 +0000: chung-yen.chang, if you can reproduce this, the full mongod.log would be helpful to start with. In particular, was there ever an unclean shutdown? It would also be good to know the size of the WiredTiger.wt file after this failure: is it 32KB or less in size?
Run YCSB load phase (I suspect a workload that does a large number of inserts with similar sized documents: ~1K would do the same) on a standalone mongod.