...
The multiversion tests on the mongo v4.4 waterfall page kept failing after the recent v4.2 WiredTiger drop (with WT-5892) and the merge of SERVER-47006. WT-5926 was created to address the WiredTiger error captured by do_upgrade_downgrade.js in the latest commit build. There are other 10+ tests failed in the same task which we should investigate to understand.
sue.loverso commented on Thu, 26 Mar 2020 21:40:30 +0000: FYI, milkie says it really creates under 4.2, upgrades to 4.4 and does a write or two, then downgrades back to 4.2. sue.loverso commented on Thu, 26 Mar 2020 21:12:43 +0000: The suggested verbose should show what 4.2 (i.e. V4) log file it creates and what log files it archives/removes. The point of the downgrade, switching log files and forcing the checkpoint is to make all "other" version log files able to be removed. alexander.gorrod mentioned that we need log before the checkpoint now with DH. I don't know if that is there is something disallowing archiving other log files. So, the verbose plus the copied directory should let us know what's going on. alex.cameron commented on Thu, 26 Mar 2020 21:04:54 +0000: Sorry sue.loverso, my comment was missing some information. milkie is correct, we appear to successfully downgrade. And then when we start back up with 4.2, we see a new log that we can't handle yet. Thanks for the tips, I'll continue looking today. sue.loverso commented on Thu, 26 Mar 2020 18:20:41 +0000: alex.cameron once we understand what MongoDB is doing differently than my WT-only testing, we should make a test that reproduces this. The CONFIG I've been using is using timestamps and uses a small cache so the HS store is in play. sue.loverso commented on Thu, 26 Mar 2020 17:45:28 +0000: alex.cameron I have a couple suggestions for debugging this through MongoDB: First, turn on verbose output for logging immediately prior to the downgrade so that we can see what WiredTiger is doing. You need something like this (untested) diff: diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp index 714e35a356..831b1c4c45 100644 --- a/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp +++ b/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp @@ -948,6 +948,7 @@ void WiredTigerKVEngine::cleanShutdown() { 1, "Downgrade compatibility configuration: {fileVersion_getDowngradeString}", "fileVersion_getDowngradeString"_attr = _fileVersion.getDowngradeString()); + invariantWTOK(_conn->reconfigure(_conn, "verbose=(log)")); invariantWTOK(_conn->reconfigure(_conn, _fileVersion.getDowngradeString().c_str())); } You need to add that additional line (as opposed to appending it the the downgrade string) because in reconfigure the verbose options are parsed last. So if it is part of one reconfigure call, the compatibility stuff will be complete before verbose logging is enabled. Second, modify (the js test? the resmoke script?) to save a copy of the database directory before the startup that gives the log version error. Then we can look at the verbose output and compare it to the log files as they were after shutdown and maybe get more understanding about the issue. sue.loverso commented on Thu, 26 Mar 2020 17:21:01 +0000: Ok, I found the original error you're seeing in some output: [js_test:3_upgrade_replset] 2020-03-26T14:45:42.178+0000 d20272| 2020-03-26T14:45:42.177+0000 E STORAGE [initandlisten] WiredTiger error (-31802) [1585233942:177917][13961:0x7fbcf2107bc0], connection: __log_open_verify, 926: unsupported WiredTiger file version: this build only supports versions up to 4, and the file is version 5: WT_ERROR: non-specific WiredTiger error Looking closer, I see this in the log for WiredTiger open. [js_test:3_upgrade_replset] 2020-03-26T14:45:41.505+0000 d20272| 2020-03-26T14:45:41.505+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=3234M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress],compatibility=(release=3.2,require_max="3.3.0"), Trying that myself it still works for me from the mongodb-4.2 tree: mongodb-4.2 % ./wt -h /tmp/RUNDIR -C "compatibility=(release=3.2,require_max=3.3.0),verbose=(log,recovery)" list -v [1585242709:16980][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:WiredTiger.wt with id 0 @ (39, 768384) [1585242709:17235][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:WiredTigerHS.wt with id 1 @ (39, 768384) [1585242709:17264][15512:0x7fff8dc13380], wt, txn-recover: Recovering file:wt.wt with id 2 @ (39, 768384) [1585242709:17339][15512:0x7fff8dc13380], wt, txn-recover: Set global recovery timestamp: (0, 0) metadata: access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=0,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=S,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=S,version=(major=1,minor=1),checkpoint=(WiredTigerCheckpoint.6=(addr="018581e4d74a769f8681e4424107b48781e4f27b2c96808080e26fc0cfc0",order=6,time=1585242709,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=0,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=15)),checkpoint_backup_info=,checkpoint_lsn=(4294967295,2147483647) ... sue.loverso commented on Thu, 26 Mar 2020 16:44:48 +0000: There isn't any information about what errors you're actually seeing in here. I will describe here in detail what I did and what I'm seeing (via WiredTiger only mechanisms). Essentially, a simple printlog and metadata list test works. I have two trees: Current checkout of WiredTiger mongodb-4.2 branch, changeset 4c2230d9. Current checkout of WiredTiger develop, changeset b2e25e5. The mongodb-4.2 branch is unchanged. In the develop branch, I only modified test/format/wts.c to issue a reconfigure call just before connection close: diff --git a/test/format/wts.c b/test/format/wts.c index a01210b88..901b8bbc6 100644 --- a/test/format/wts.c +++ b/test/format/wts.c @@ -451,6 +451,7 @@ wts_close(void) const char *config; conn = g.wts_conn; + testutil_check(conn->reconfigure(conn, "compatibility=(release=3.3)")); config = g.c_leak_memory ? "leak_memory" : NULL; Build both trees. I then run test/format with a select CONFIG, but here's one generated CONFIG that yields a useful directory (a superset of mine with other values randomly generated): ############################################ # RUN PARAMETERS ############################################ abort=0 alter=0 assert_commit_timestamp=0 assert_read_timestamp=0 auto_throttle=1 backups=1 backup_incremental=log bitcnt=6 bloom=1 bloom_bit_count=15 bloom_hash_count=13 bloom_oldest=0 cache=50 cache_minimum=0 checkpoints=on checkpoint_log_size=37 checkpoint_wait=14 checksum=uncompressed chunk_size=4 compaction=0 compression=none data_extend=0 data_source=table delete_pct=23 dictionary=1 direct_io=0 encryption=none evict_max=2 file_type=row-store firstfit=0 huffman_key=0 huffman_value=0 independent_thread_rng=1 in_memory=0 insert_pct=0 internal_key_truncation=1 internal_page_max=10 isolation=snapshot key_gap=1 key_max=118 key_min=21 leaf_page_max=16 leak_memory=0 logging=1 logging_archive=0 logging_compression=none logging_file_max=1024 logging_prealloc=1 lsm_worker_threads=3 major_timeout=0 memory_page_max=8 merge_max=5 mmap=1 mmap_all=0 modify_pct=69 ops=0 prefix_compression=1 prefix_compression_min=1 prepare=0 quiet=1 random_cursor=0 read_pct=0 rebalance=0 repeat_data_pct=77 reverse=0 rows=230271 runs=1 salvage=0 split_pct=96 statistics=0 statistics_server=0 threads=15 timer=1 timing_stress_aggressive_sweep=0 timing_stress_checkpoint=0 timing_stress_hs_sweep=0 timing_stress_split_1=0 timing_stress_split_2=0 timing_stress_split_3=0 timing_stress_split_4=0 timing_stress_split_5=0 timing_stress_split_6=0 timing_stress_split_7=0 timing_stress_split_8=0 transaction_timestamps=1 transaction-frequency=100 truncate=1 value_max=1003 value_min=6 verify=0 wiredtiger_config=config_base=false,verbose=(log) write_pct=8 ############################################ Here's the verbose logging output from the first line when we do the reconfigure call until the test completes: 0x110b1e800:[1585239796:76782][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_remove: remove log ./WiredTigerPreplog.0000000061 0x110b1e800:[1585239796:77108][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_remove: remove log ./WiredTigerPreplog.0000000060 0x110b1fb60:[1585239796:77453][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000063 0x110b1fb60:[1585239796:97169][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000063 to ./WiredTigerLog.0000000061 0x110b1fb60:[1585239796:100801][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerLog.0000000061 0x110b1e800:[1585239796:101151][15029:0x7fff8dc13380], t, WT_CONNECTION.reconfigure: log_printf: COMPATIBILITY: Version now 4 0x110b1fb60:[1585239796:101477][15029:0x70000d9d4000], t, log-server: Missed 1. Now pre-allocating up to 2 0x110b1fb60:[1585239796:101518][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000064 0x110b1fb60:[1585239796:123450][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000064 to ./WiredTigerPreplog.0000000062 0x110b1fb60:[1585239796:123627][15029:0x70000d9d4000], t, log-server: opening log ./WiredTigerTmplog.0000000065 0x110b1fb60:[1585239796:145577][15029:0x70000d9d4000], t, log-server: log_allocfile: rename ./WiredTigerTmplog.0000000065 to ./WiredTigerPreplog.0000000063 0x110b20ec0:[1585239796:206326][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: log_force_sync: sync directory . to LSN 61/384 0x110b20ec0:[1585239796:214491][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: opening log ./WiredTigerLog.0000000061 0x110b20ec0:[1585239796:214538][15029:0x7fff8dc13380], t, WT_SESSION.checkpoint: log_force_sync: sync ./WiredTigerLog.0000000061 to LSN 61/384 0x110b20ec0:[1585239796:323461][15029:0x7fff8dc13380], t, compatibility-reconfig: log_truncate_files: Archive once up to 0 0x110b20ec0:[1585239796:323498][15029:0x7fff8dc13380], t, compatibility-reconfig: log_archive: archive to log number 61 0x110b20ec0:[1585239796:323901][15029:0x7fff8dc13380], t, compatibility-reconfig: log_remove: remove log ./WiredTigerLog.0000000038 ... [Other log_remove messages deleted for brevity.] 0x110b20ec0:[1585239796:337128][15029:0x7fff8dc13380], t, compatibility-reconfig: log_remove: remove log ./WiredTigerLog.0000000043 0x110b209e8:[1585239796:337943][15029:0x7fff8dc13380], t, close_ckpt: opening log ./WiredTigerLog.0000000061 0x110b209e8:[1585239796:337956][15029:0x7fff8dc13380], t, close_ckpt: log_force_sync: sync ./WiredTigerLog.0000000061 to LSN 61/3584 0x110b1e800:[1585239796:393911][15029:0x7fff8dc13380], t, WT_CONNECTION.close: closing log ./WiredTigerLog.0000000061 0x110b1e800:[1585239796:402255][15029:0x7fff8dc13380], t, WT_CONNECTION.close: closing log directory . 1: table, row-store (60 seconds) t: successful run completed The database directory looks like this: develop % ls RUNDIR CONFIG WiredTiger.wt WiredTigerPreplog.0000000063 WiredTiger WiredTigerHS.wt rand WiredTiger.lock WiredTigerLog.0000000061 wt.wt WiredTiger.turtle WiredTigerPreplog.0000000062 The log files all look correct to me. The only existing log file is correct. It is log version 4: develop % hexdump -C RUNDIR/WiredTigerLog.0000000061 00000000 80 00 00 00 7a fa 3c 46 00 00 00 00 00 00 00 00 |....z.<F........| 00000010 64 10 10 00 04 00 00 00 00 00 10 00 00 00 00 00 |d...............| <--- The 04 here is the version. 00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000080 80 00 00 00 1d 87 b4 d8 00 00 00 00 00 00 00 00 |................| 00000090 84 88 87 b8 e3 05 ee c0 00 00 00 00 00 00 00 00 |................| 000000a0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 00000100 80 00 00 00 c9 54 d8 8c 00 00 00 00 00 00 00 00 |.....T..........| 00000110 83 43 4f 4d 50 41 54 49 42 49 4c 49 54 59 3a 20 |.COMPATIBILITY: | 00000120 56 65 72 73 69 6f 6e 20 6e 6f 77 20 34 00 00 00 |Version now 4...| 00000130 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * A hexdump of the two WiredTigerPreplog files also show that both have version 4 in the header. Now, run the wt command from the mongodb-4.2 tree I built: mongodb-4.2 % ./wt -h /tmp/RUNDIR printlog [ { "lsn" : [61,128], "hdr_flags" : "", "rec_len" : 128, "mem_len" : 128, "type" : "system", "ops": [ { "optype": "prev_lsn", "prev_lsn": [60, 544256] } ] }, { "lsn" : [61,256], "hdr_flags" : "", "rec_len" : 128, "mem_len" : 128, "type" : "message", "message" : "COMPATIBILITY: Version now 4" }, { "lsn" : [61,384], "hdr_flags" : "", "rec_len" : 128, "mem_len" : 128, "type" : "system", "ops": [ { "optype": "checkpoint_start", } ] }, ... And listing the metadata also works: mongodb-4.2 % ./wt -h /tmp/RUNDIR list -v metadata: access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=0,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=S,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=S,version=(major=1,minor=1),checkpoint=(mine.2=(addr="018581e4f40fd00a8b81e4a5e727028881e4b5a9f07e808080e27fc0cfc0",order=2,time=1585239766,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=2,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=4),WiredTigerCheckpoint.12=(addr="018081e4fe1f53728181e425dc3ab28381e49d7ef4d98281e46a90d4c0e2cfc0cfc0",order=12,time=1585240078,size=8192,newest_durable_ts=0,oldest_start_ts=0,oldest_start_txn=0,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=30)),checkpoint_backup_info=,checkpoint_lsn=(4294967295,2147483647) colgroup:wt app_metadata=,collator=,columns=,source="file:wt.wt",type=file file:WiredTigerLAS.wt access_pattern_hint=none,allocation_size=4KB,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=snappy,cache_resident=false,checkpoint=,checkpoint_backup_info=,checkpoint_lsn=,checksum=uncompressed,collator=,columns=,dictionary=0,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=4,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=0,internal_key_truncate=true,internal_page_max=4KB,key_format=QIQu,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=64MB,log=(enabled=true),memory_page_image_max=0,memory_page_max=5MB,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=true,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=QQQBBu,version=(major=1,minor=1) file:wt.wt access_pattern_hint=none,allocation_size=512,app_metadata=,assert=(commit_timestamp=none,durable_timestamp=none,read_timestamp=none),block_allocation=best,block_compressor=,cache_resident=false,checksum="uncompressed",collator=,columns=,dictionary=203,encryption=(keyid=,name=),format=btree,huffman_key=,huffman_value=,id=2,ignore_in_memory_cache_size=false,internal_item_max=0,internal_key_max=24,internal_key_truncate=true,internal_page_max=1024,key_format=u,key_gap=1,leaf_item_max=0,leaf_key_max=1438,leaf_page_max=65536,leaf_value_max=0,log=(enabled=true),memory_page_image_max=0,memory_page_max=8388608,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=false,prefix_compression_min=1,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=96,value_format=u,version=(major=1,minor=1),checkpoint=(mine.2=(addr="01be82e49300b726c01284e45ac68481c00888e4a20f7def808080e4033efbc0e401a259c0",order=3,time=1585239766,size=27426304,start_durable_ts=159538,oldest_start_ts=0,oldest_start_txn=0,stop_durable_ts=159532,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=125061),WiredTigerCheckpoint.5=(addr="01c56382e40310b4a7c56584e4b657a835c56d8ee4d84f2cf4c56984e468f8f01ce4042713c0e4017a4bc0",order=5,time=1585239796,size=24801280,start_durable_ts=336282,oldest_start_ts=0,oldest_start_txn=0,stop_durable_ts=336184,newest_stop_ts=-1,newest_stop_txn=-11,write_gen=240975)),checkpoint_backup_info=,checkpoint_lsn=(61,384) table:wt app_metadata=,colgroups=,collator=,columns=,key_format=u,value_format=u So it does appear the basic mechanism is working. milkie commented on Thu, 26 Mar 2020 15:07:06 +0000: I think it's the latter: WT does not do all the right things, yet isn't returning an error. sue.loverso commented on Thu, 26 Mar 2020 14:34:50 +0000: WiredTiger definitely realises that we're in a downgrade situation and attempts to downgrade the log file versions (the version numbers it thinks it is downgrading to and from look correct). I'm assuming that something in that logging machinery is interacting badly with the new durable history changes and we're not doing this successfully. So are you saying that the reconfigure is returning an error? Or that WiredTiger does not do (all) the right things yet still returns success? I will take a look at this today. It seems like it should be easy to reproduce with just WiredTiger calls/tests and then trying to start up on an earlier branch. I will let you know what I find. alex.cameron commented on Thu, 26 Mar 2020 08:48:57 +0000: vamsi.krishna alexander.gorrod daniel.gottlieb Just checking in where I'm at with this. The scenario that is hitting most of the issues here is downgrading from 4.4 to 4.2. I can confirm that MongoD 4.4 reconfigures to compatibility_version=3.3 when shutting down with 4.2 FCV as expected. However, when we restart with MongoD 4.2, WiredTiger detects log files with a version of 5 (which corresponds to 4.4's compatibility_version=10.0) and errors out. I've mostly been testing with: buildscripts/resmoke.py --suites=multiversion --storageEngine=wiredTiger --jobs=4 --repeat=1 --shuffleMode=on --storageEngineCacheSizeGB=1 jstests/multiVersion/3_upgrade_replset.js I've added some logging to trace what is happening when shutting down with 4.2 FCV. The reconfigure call appears to do the right thing: we realise that we're downgrading log versions and go all the way down to __logmgr_force_archive and appear to do work to force a checkpoint, write a new log file with version number 4 and blow away any old log files. So WiredTiger definitely realises that we're in a downgrade situation and attempts to downgrade the log file versions (the version numbers it thinks it is downgrading to and from look correct). I'm assuming that something in that logging machinery is interacting badly with the new durable history changes and we're not doing this successfully. I'm going to continue looking at this however, I don't anticipate that I'll find get to the bottom of this tonight since it will involve learning more about the logging subsystem. sue.loverso, if you get the chance could you please take a look at this? I suspect that you'll have a much better idea of what is going on. I also had a try of just running a MongoD instance manually but I didn't see any issues. I just started 4.4, executed: db.adminCommand({setFeatureCompatibilityVersion: "4.2"})} Shut it down, then successfully started up 4.2 with the same dbpath. I also hacked wt printlog to print the log version number and it also confirms that it is 4 after shutting down. ./wt -h ~/work/mongo/data/db -C "log=(compressor=snappy)" printlog | less