...
Problem Description We have found that after unexpected MongoDB service shutdown, the database is getting corrupted and it happens randomly MongoDB Version 4.0.4 2008R2Plus OS: Windows 10 After logging data for a couple of days we are getting data corruption issues. This is resulting in data loss for us. Error in log file: 2021-05-11T12:27:48.920+0530 I CONTROL [main] Trying to start Windows service 'MongoDB' 2021-05-11T12:27:49.577+0530 I CONTROL [initandlisten] MongoDB starting : pid=4064 port=27017 dbpath=D:\ItantaDB 64-bit host=ESOS-2 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] db version v4.0.4 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] git version: f288a3bdf201007f3693c58e140056adf8b04839 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] allocator: tcmalloc 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] modules: none 2021-05-11T12:27:49.578+0530 I CONTROL [initandlisten] build environment: 2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] distmod: 2008plus-ssl 2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] distarch: x86_64 2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] target_arch: x86_64 2021-05-11T12:27:51.394+0530 I CONTROL [initandlisten] options: { net: { bindIpAll: true }, security: { authorization: "enabled" }, service: true, storage: { dbPath: "D:\ItantaDB", wiredTiger: { engineConfig: { cacheSizeGB: 1.0 } } }, systemLog: { destination: "file", path: "C:\ProgramData\ItantaAnalytics\mongodb.log" } } 2021-05-11T12:27:51.765+0530 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1024M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress), 2021-05-11T12:27:54.334+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:334351][4064:140735794009424], txn-recover: Main recovery loop: starting at 39/36112640 to 40/256 2021-05-11T12:27:54.436+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:435293][4064:140735794009424], txn-recover: Recovering log 39 through 40 2021-05-11T12:27:54.644+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:644163][4064:140735794009424], txn-recover: Recovering log 40 through 40 2021-05-11T12:27:54.706+0530 I STORAGE [initandlisten] WiredTiger message [1620716274:706126][4064:140735794009424], txn-recover: Set global recovery timestamp: 0 2021-05-11T12:27:55.782+0530 I RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(0, 0) 2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] WiredTiger error (2) [1620716276:740127][4064:140735794009424], file:collection-107--7568313936654431168.wt, WT_SESSION.open_cursor: __win_open_file, 565: D:\ItantaDB\collection-107--7568313936654431168.wt: handle-open: CreateFileW: The system cannot find the file specified. : No such file or directory Raw: [1620716276:740127][4064:140735794009424], file:collection-107--7568313936654431168.wt, WT_SESSION.open_cursor: __win_open_file, 565: D:\ItantaDB\collection-107--7568313936654431168.wt: handle-open: CreateFileW: The system cannot find the file specified. : No such file or directory 2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] Failed to open a WiredTiger cursor: table:collection-107--7568313936654431168 ~2021-05-11T12:27:56.769+0530 E STORAGE [initandlisten] This may be due to data corruption. Please read the documentation for starting MongoDB with --repair here: http://dochub.mongodb.org/core/repair~ 2021-05-11T12:27:56.769+0530 F - [initandlisten] Fatal Assertion 50882 at src\mongo\db\storage\wiredtiger\wiredtiger_session_cache.cpp 137 2021-05-11T12:27:56.769+0530 F - [initandlisten] ***aborting after fassert() failure Additional Notes We tried to repair the data using repair command. After repair, we are able to access the MongoDB. We have uploaded diagnostic data on Google Drive. Location: https://drive.google.com/file/d/1aMnqwqXSAUhiGbvALLboIXJFr3-jXt_-/view?usp=sharing Additional MongoDB service logs are added in attachment section. We would like to know the reason why this issue might occurs and Is there any thing which we are missing in configuration ? Regards, Nandkishor
eric.sedor commented on Wed, 28 Jul 2021 22:29:47 +0000: Hi ndchavan4289@gmail.com, I'm going to close this ticket out for now but please reach out again if the issue reoccurs and chkdsk reveals nothing. Thank you! JIRAUSER1260335 commented on Tue, 6 Jul 2021 06:01:00 +0000: Hi Eric, till now, we have not faced a similar issue again on any site, I will get back to you once I face the issue again. Also, I am not sure how much Procmon will be effective in this case, as we usually don't have the access to sites all the time, so won't be able to run the Procmon at the time when the issue occurs. Once the event has occurred, Procmon won't be able to give us the details on the process which may be causing this issue. Also, if the issue is occurring because of physical disk corruption, then how does it work after repairing. We will still check with chkdsk, if the issue occurs again. eric.sedor commented on Thu, 1 Jul 2021 17:51:47 +0000: Hi ndchavan4289@gmail.com, have you had any success ruling out file access issues via Procmon, or identifying disk issues via chkdsk? eric.sedor commented on Thu, 10 Jun 2021 17:10:28 +0000: Hi ndchavan4289@gmail.com, It's not precisely clear what deployments are experiencing what symptoms, but it's possible that an attempt to --repair would result in the loss of user metadata if corruption affected a system.users collection. For the latest logs and dbpath you uploaded, the errors to continue to suggest an issue with physical disk corruption or with other processes interfering with files in the MongoDB dbPath while it is running: 2021-06-09T10:07:27.319+0530 E STORAGE [WTCheckpointThread] WiredTiger error (-31802) [1623213447:318829][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __win_fs_rename, 105: C:\ProgramData\ItantaDB\WiredTiger.turtle.set to C:\ProgramData\ItantaDB\WiredTiger.turtle: file-rename: MoveFileExW: The disk structure is corrupted and unreadable. : WT_ERROR: non-specific WiredTiger error Raw: [1623213447:318829][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __win_fs_rename, 105: C:\ProgramData\ItantaDB\WiredTiger.turtle.set to C:\ProgramData\ItantaDB\WiredTiger.turtle: file-rename: MoveFileExW: The disk structure is corrupted and unreadable. : WT_ERROR: non-specific WiredTiger error 2021-06-09T10:07:27.319+0530 E STORAGE [WTCheckpointThread] WiredTiger error (-31802) [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_turtle_update, 397: WiredTiger.turtle: fatal turtle file update error: WT_ERROR: non-specific WiredTiger error Raw: [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_turtle_update, 397: WiredTiger.turtle: fatal turtle file update error: WT_ERROR: non-specific WiredTiger error 2021-06-09T10:07:27.320+0530 E STORAGE [WTCheckpointThread] WiredTiger error (-31804) [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic Raw: [1623213447:319863][10832:140729928406352], file:WiredTiger.wt, WT_SESSION.checkpoint: __wt_panic, 523: the process must exit and restart: WT_PANIC: WiredTiger library panic 2021-06-09T10:07:27.347+0530 F - [WTCheckpointThread] Fatal Assertion 50853 at src\mongo\db\storage\wiredtiger\wiredtiger_util.cpp 409 2021-06-09T10:07:27.347+0530 F - [WTCheckpointThread] In at least one case in the past, one of our users had success using Procmon to identify a process that was causing similar (but not the same) errors. Do you happen to have run chkdsk after the 2021-06-09T10:07:27.319+0530 incident on IntantaDB? JIRAUSER1260335 commented on Wed, 9 Jun 2021 07:36:49 +0000: Hi Eric, We have not encountered the same issue again on the same site. But, we have encountered similar kind of issue on another site, hence I am attaching the requested files from this new site. We have uploaded archived files for db folder and log files to given location for troubleshooting purpose. Additionally, we have faced one more issue in recent scenario as - We run Mongodb service in authentication mode. However, when we repaired the data, we found that the user we use for authentication was not present in our db. So we had to create it again manually. Please let me know if you need any more information from our side. -Nandkishor eric.sedor commented on Fri, 4 Jun 2021 22:55:08 +0000: Hi ndchavan4289@gmail.com, Unfortunately we aren't able to readily identify a specific process that could be responsible. Would you be willing to provide the data files for this cluster for examination? This will help us obtain more information about what state MongoDB expects the data files to be in at the time of error. The process would be: 1) After an incident occurs, ensure the server is stopped and archive a copy of the dbpath (including the log files) 2) Upload that archive to this location: https://mongodb.app.box.com/f/7c2667be0f4248188fab332f2a27659f. Files shared to this location are private and will be accessible only by MongoDB employees involved with the investigation. Eric JIRAUSER1260335 commented on Mon, 31 May 2021 08:50:58 +0000: Hi Eric, Thanks for your response. As per your comment, the virus scanner might be accessing the the files in dbpath. However, that's not the case. The antivirus might be scanning the files in dbpath as usual, but it should not affect/corrupt our db as they are not writing or editing the files. Can you share the process name which is accessing the files in dbpath (with screenshot), so that we can check for that process. eric.sedor commented on Thu, 27 May 2021 15:27:17 +0000: Hi ndchavan4289@gmail.com, I am responding here and closing SERVER-57183 as a duplicate. The logs you've provided suggest that another process on the machine, such as a virus scanner, is accessing files in the dbpath. Can you confirm with certainty that another process is not accessing these files? You can also upgrade to MongoDB 4.0.7 to obtain the benefits of an improvement in WT-4528, but that improvement is not a guarantee against issues with virus scanners or other processes like them on windows file systems. Sincerely, Eric EDIT: Sorry for introducing confusion. WT-4528 only introduces retries when removing files, not when accessing, as is the case here