...
Server - AWS EC2 t2.small, CentOS Linux release 7.2.1511 (Core) DB Version - MongoDB community 3.2.6 Topology: Replica Set When running mongodump command on the secondary node, the db crashes in 10% of the times. I stressed the memory to reproduce, and looks like it happens due to low memory (though I have enough free memory when running/during the dump). The line when dump fails: 2016-06-22T12:19:18.871+0300 Failed: error reading collection: EOF && Mux ending but selectCases still open 2 The last messages in mongod.log: 2016-06-22T12:19:13.162+0300 D STORAGE [conn10] WT begin_transaction 2016-06-22T12:19:13.187+0300 D QUERY [conn12] Running getMore, cursorid: 225913259574 2016-06-22T12:19:13.187+0300 D STORAGE [conn12] WT begin_transaction 2016-06-22T12:19:13.224+0300 D STORAGE [WTJournalFlusher] flushed journal 2016-06-22T12:19:13.324+0300 D STORAGE [WTJournalFlusher] flushed journal 2016-06-22T12:19:13.394+0300 D EXECUTOR [ReplicationExecutor] Scheduling remote request: RemoteCommand 115 -- target:production-app03:27017 db:admin cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.394+0300 D ASIO [ReplicationExecutor] startCommand: RemoteCommand 115 -- target:production-app03:27017 db:admin expDate:2016-06-22T12:19:23.393+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.394+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Initiating asynchronous command: RemoteCommand 115 -- target:production-app03:27017 db:admin expDate:2016-06-22T12:19:23.393+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.394+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Starting asynchronous command 115 on host production-app03:27017 2016-06-22T12:19:13.396+0300 D EXECUTOR [NetworkInterfaceASIO-Replication-0] Received remote response: RemoteResponse -- cmd:{ ok: 1.0, state: 7, v: 3, hbmsg: "", set: "prod", term: 5, primaryId: 0, durableOpTime: { ts: Timestamp 1466525772000|2012, t: 5 }, opTime: { ts: Timestamp 1466525772000|2012, t: 5 } } 2016-06-22T12:19:13.396+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Failed to time operation 115 out: Operation aborted. 2016-06-22T12:19:13.396+0300 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:2, msg: 2016-06-22T12:19:13.396+0300 D REPL [ReplicationExecutor] Scheduling heartbeat to production-app03:27017 at 2016-06-22T09:19:18.396Z 2016-06-22T12:19:13.415+0300 D EXECUTOR [ReplicationExecutor] Scheduling remote request: RemoteCommand 117 -- target:production-db01:27017 db:admin cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.415+0300 D ASIO [ReplicationExecutor] startCommand: RemoteCommand 117 -- target:production-db01:27017 db:admin expDate:2016-06-22T12:19:23.415+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.415+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Initiating asynchronous command: RemoteCommand 117 -- target:production-db01:27017 db:admin expDate:2016-06-22T12:19:23.415+0300 cmd:{ replSetHeartbeat: "prod", configVersion: 3, from: "production-db02:27017", fromId: 1, term: 5 } 2016-06-22T12:19:13.415+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Starting asynchronous command 117 on host production-db01:27017 2016-06-22T12:19:13.417+0300 D EXECUTOR [NetworkInterfaceASIO-Replication-0] Received remote response: RemoteResponse -- cmd:{ ok: 1.0, electionTime: new Date(6293366302669012993), state: 1, v: 3, hbmsg: "", set: "prod", term: 5, primaryId: 0, durableOpTime: { ts: Timestamp 1466525772000|2012, t: 5 }, opTime: { ts: Timestamp 1466525772000|2012, t: 5 } } 2016-06-22T12:19:13.417+0300 D ASIO [NetworkInterfaceASIO-Replication-0] Failed to time operation 117 out: Operation aborted. 2016-06-22T12:19:13.417+0300 D REPL [ReplicationExecutor] Canceling election timeout callback at 2016-06-22T12:19:19.650+0300 2016-06-22T12:19:13.417+0300 D REPL [ReplicationExecutor] Scheduling election timeout callback at 2016-06-22T12:19:24.755+0300 2016-06-22T12:19:13.833+0300 D REPL [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 2016-06-22T12:19:13.833+0300 D REPL [ReplicationExecutor] Scheduling heartbeat to production-db01:27017 at 2016-06-22T09:19:18.417Z 2016-06-22T12:19:13.833+0300 D STORAGE [WTJournalFlusher] flushed journal 2016-06-22T12:19:13.833+0300 D STORAGE [conn13] WT rollback_transaction 2016-06-22T12:19:13.833+0300 D STORAGE [conn13] WT begin_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn10] WT rollback_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn10] WT begin_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn12] WT rollback_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn12] WT begin_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn12] WT rollback_transaction 2016-06-22T12:19:13.846+0300 D STORAGE [conn12] WT begin_transaction 2016-06-22T12:19:13.859+0300 D STORAGE [conn12] WT rollback_transaction 2016-06-22T12:19:13.859+0300 D STORAGE [conn12] WT begin_transaction 2016-06-22T12:19:13.859+0300 D STORAGE [conn13] WT rollback_transaction 2016-06-22T12:19:13.859+0300 D STORAGE [conn13] WT begin_transaction
mango commented on Tue, 4 Apr 2017 09:09:54 +0000: I also have memory problems while dumping the database. But in difference to Or Khafi I am already using 3.4 , but with the MMAPv1 engine: db version v3.4.3 git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1 OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016 allocator: tcmalloc modules: none build environment: distmod: ubuntu1604 distarch: x86_64 target_arch: x86_64 The VM has 3.9GB RAM and mongo While dumping the database no other database operations were running. The system has 2GB (nearly unused) swap. I uploaded the "metrics.2017-04-04T07-56-28Z-00000" using your upload portal. thomas.schubert commented on Mon, 13 Mar 2017 16:54:42 +0000: Hi KafKafOwn, Unfortunately, we have not yet been able to conclusively determine the root cause of this behavior from the diagnostic.data you provided. However, we have made significant improvements to MongoDB since 3.2.6 that may correct this issue. If this is still a concern for you, would you please upgrade to MongoDB 3.2.12? MongoDB 3.2.12 also includes a new parameter, which will allow us to collect additional diagnostic.data to better understand what is happening. If you are encounter this issue on MongoDB 3.2.12, would you please reproduce with the following parameter set? mongod --setParameter heapProfilingEnabled=true Then, please upload the complete log files and the diagnostic.data to this new secure upload portal. Thank you again for your help, Thomas thomas.schubert commented on Thu, 8 Dec 2016 18:25:14 +0000: Hi Paolo, I'd recommend upgrading to MongoDB 3.4 to take advantage of SERVER-23391. Kind regards, Thomas paolo commented on Thu, 8 Dec 2016 04:25:21 +0000: Hi guys, any update on this? I'm having the same issue on Digital Ocean instance with SSD and 1GB RAM, and running: db version v3.2.10 git version: 79d9b3ab5ce20f51c272b4411202710a082d0317 OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014 allocator: tcmalloc modules: none build environment: distmod: ubuntu1404 distarch: x86_64 target_arch: x86_64 kafkafown commented on Wed, 22 Jun 2016 15:18:01 +0000: Uploaded. thomas.schubert commented on Wed, 22 Jun 2016 15:08:43 +0000: Certainly, I've created this secure portal for you to use. Kind regards, Thomas kafkafown commented on Wed, 22 Jun 2016 14:52:31 +0000: Hey Thomas, I know you are not collecting any secret data in your diagnostics, but can you please give me a private link to upload the data? Thanks, Or thomas.schubert commented on Wed, 22 Jun 2016 14:30:36 +0000: Hi KafKafOwn, Thanks for the additional information. So we can continue to investigate, would you please archive (tar or zip) the $dbpath/diagnostic.data directory and attach it to this ticket? Also, would please attach the output of top so we can see what other processes are running on your machine? Thank you, Thomas kafkafown commented on Wed, 22 Jun 2016 13:12:34 +0000: Out of memory: Kill process 16402 (mongod) score 765 or sacrifice child Killed process 16402 (mongod) total-vm:2232172kB, anon-rss:1437888kB, file-rss:0kB A few more details to help understand linux oom killer decision: 1) We have no swap(ec2 instance default), 2 GB of memory. 2) We are running only mongod and mongodump on the server. 3) /proc/sys/vm/overcommit_memory = 0. ramon.fernandez commented on Wed, 22 Jun 2016 12:20:43 +0000: KafKafOwn, can you check your system logs for the following string? Out of memory: Kill process XXXXX (mongod)
Run mongodump and then stress the memory of the OS.