...
cat /var/log/mongodb/mongod.log | grep "Successfully connected to mongo1.production.internal:27017" | sort | uniq 2020-10-16T00:06:57.310+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (527 connections now open to mongo1.production.internal:27017 with a 0 second timeout) 2020-10-16T04:06:57.422+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (528 connections now open to mongo1.production.internal:27017 with a 0 second timeout) 2020-10-16T08:06:56.913+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (529 connections now open to mongo1.production.internal:27017 with a 0 second timeout) mongod --version db version v4.0.5 git version: 3739429dd92b92d1b0ab120911a23d50bf03c412 OpenSSL version: OpenSSL 1.0.2n 7 Dec 2017 allocator: tcmalloc modules: none build environment: distmod: ubuntu1404 distarch: x86_64 target_arch: x86_64 mongo --version git version: 3739429dd92b92d1b0ab120911a23d50bf03c412 OpenSSL version: OpenSSL 1.0.2n 7 Dec 2017 allocator: tcmalloc modules: none build environment: distmod: ubuntu1404 distarch: x86_64 target_arch: x86_64
JIRAUSER1257302 commented on Tue, 27 Oct 2020 06:03:40 +0000: Hi @edwin.zhou@mongodb.com Thanks for sharing the analysis, and guiding me in the right direction Will look into it Regards, Prakhar JIRAUSER1257066 commented on Fri, 23 Oct 2020 16:46:08 +0000: Hi pmakhija@ask-fast.com, Thank you for sharing your diagnostics with us. After some investigating, we suspect that an external process is increasing your system's reported CPU load. Here is an analysis of the timeline as evidence of this: Point A System memory and CPU usage begins to increase at exactly 00:00 indicated by system memory anon. The mongod does not perform scheduled tasks like this. As system memory increases, memory allocated by the mongod stays unchanged as indicated by current_allocated_bytes. There is also an increase in CPU iowait, which isn't exactly CPU usage. Instead, this means that the CPU is needing to wait for i/o, as confirmed by the disk usage. However, looking at block-manager read/write, the mongod isn't doing more i/o. Point B Resident memory used by the mongod begins to decrease but allocated memory remains unchanged. We suspect an external process has allocated memory which caused paging and memory used by the mongod is forced out. Point C Memory used by the mongod returns to normal and it appears the external process has completed. The mongod does not perform any tasks that occur at an exact time, i.e. exactly at midnight and every 4 hours. Since this appears to be a scheduled task, you should check to see if there are any running cron jobs that may be affecting your system. You can also run ps or top during the times that CPU iowait is high to identify running processes that increase load. Best, Edwin JIRAUSER1257302 commented on Tue, 20 Oct 2020 14:16:05 +0000: Hi @Bruce Lucas @Eric Sedor My bad, the link was not public then, it is now diagnostic.data.tar.gz _____________________________________________ Regarding TimeZones I fall in IST (GMT+5:30) The SNMP alert email specifies CEST (GMT+2:00) The mongod server is residing in Europe (Ireland) - Amazon Region (eu-west-1) The mongod server replica node logs are being generated in GMT _____________________________________________ SNMP Email Alerts Date/Time: Tue Oct 20 14:09:03 CEST 2020 CRITICAL - CPU Load (5 min average): 128 % Date/Time: Tue Oct 20 10:10:51 CEST 2020 CRITICAL - CPU Load (5 min average): 162 % Date/Time: Tue Oct 20 02:13:28 CEST 2020 CRITICAL - CPU Load (5 min average): 138 % Date/Time: Mon Oct 19 22:10:16 CEST 2020 CRITICAL - CPU Load (5 min average): 154 % Date/Time: Mon Oct 19 18:12:04 CEST 2020 CRITICAL - CPU Load (5 min average): 182 % Date/Time: Mon Oct 19 14:08:52 CEST 2020 CRITICAL - CPU Load (5 min average): 130 % Date/Time: Mon Oct 19 10:10:42 CEST 2020 CRITICAL - CPU Load (5 min average): 176 % Date/Time: Mon Oct 19 06:12:26 CEST 2020 CRITICAL - CPU Load (5 min average): 170 % Date/Time: Mon Oct 19 02:14:10 CEST 2020 CRITICAL - CPU Load (5 min average): 145 % Date/Time: Sun Oct 18 22:10:59 CEST 2020 CRITICAL - CPU Load (5 min average): 180 % Date/Time: Sun Oct 18 18:12:44 CEST 2020 CRITICAL - CPU Load (5 min average): 208 % Date/Time: Sun Oct 18 14:09:30 CEST 2020 CRITICAL - CPU Load (5 min average): 122 % Date/Time: Sun Oct 18 10:11:19 CEST 2020 CRITICAL - CPU Load (5 min average): 206 % Date/Time: Sun Oct 18 06:13:05 CEST 2020 CRITICAL - CPU Load (5 min average): 164 % _____________________________________________ Server Mongo Logs mongo-open-connections.out mongod-2020-10-20.log mongod-2020-10-19.log bruce.lucas@10gen.com commented on Tue, 20 Oct 2020 12:57:43 +0000: Hi pmakhija@ask-fast.com, thanks for collecting the data. Unfortunately the link you provided produces an access denied error when we try to use it. Can you please attach the data to this ticket? If it is too large to attach, you can upload it to this secure private upload portal. Also, can you please provide a timeline of your issue: which node is seeing the problem at what specific time (including timezone)? JIRAUSER1257302 commented on Tue, 20 Oct 2020 10:38:17 +0000: Hi @eric.sedor , please find the tarball attached diagnostic.data.tar.gz Feel free to let me know if anything else is needed Thanks, Prakhar eric.sedor commented on Mon, 19 Oct 2020 23:08:51 +0000: Hi pmakhija@ask-fast.com, can you please provide more information about the issue you are reporting? High CPU use could have many causes. As well, would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket? Thanks, Eric