BugZero | MongoDB BugID 613960 - Problem with initial sync when adding an member in...

MongoDB - Defect ID: 613960

Problem with initial sync when adding an member in a rs in 3.2.21

MongoDB - Defect ID: 613960

Problem with initial sync when adding an member in a rs in 3.2.21

Last updated on 12/28/2018

Overall: 6.26.2

Severity: 6.46.4

Community: 7.87.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Overall: 6.26.2

Severity: 6.46.4

Community: 7.87.8

Lifecycle: 9.19.1

What is the BugZero Risk Score?

Vendor details

Priority: Major - P3
Status: Closed

Info

Hi, Since i migrate an repliset in mongodb 3.2.21 i can't add an member because the initial sync not works. It's seems to be an bug corrected in 3.2.21. But it seems not. In logs in the member i try to add i see this messages : 2018-10-03T16:28:43.553+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017] 2018-10-03T16:28:43.553+0200 I NETWORK [rsSync] DBClientCursor::init call() failed 2018-10-03T16:28:43.555+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576857000|17 } } 2018-10-03T16:28:43.555+0200 E REPL [rsSync] initial sync attempt failed, 9 attempts remaining 2018-10-03T16:28:48.555+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:28:48.556+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017 2018-10-03T16:28:48.559+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576928000|18, t: -1 } 2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017 2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017] 2018-10-03T16:29:18.559+0200 I NETWORK [rsSync] DBClientCursor::init call() failed 2018-10-03T16:29:18.560+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576928000|18 } } 2018-10-03T16:29:18.560+0200 E REPL [rsSync] initial sync attempt failed, 8 attempts remaining 2018-10-03T16:29:23.561+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:29:23.561+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017 2018-10-03T16:29:23.566+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576932000|7, t: -1 } 2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017 2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017] 2018-10-03T16:29:53.566+0200 I NETWORK [rsSync] DBClientCursor::init call() failed 2018-10-03T16:29:53.568+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576932000|7 } } 2018-10-03T16:29:53.568+0200 E REPL [rsSync] initial sync attempt failed, 7 attempts remaining 2018-10-03T16:29:58.568+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:29:58.568+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017 2018-10-03T16:29:58.572+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576998000|18, t: -1 } 2018-10-03T16:30:00.586+0200 E REPL [rsSync] 11602 operation was interrupted 2018-10-03T16:30:00.587+0200 E REPL [rsSync] initial sync attempt failed, 6 attempts remaining 2018-10-03T16:30:02.420+0200 I NETWORK [conn14] end connection 192.168.50.22:57160 (19 connections now open) 2018-10-03T16:30:02.601+0200 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to sf06-b.essos.lan:27017 2018-10-03T16:30:02.601+0200 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections 2018-10-03T16:30:02.602+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to sf06-b.essos.lan:27017 2018-10-03T16:30:02.603+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-b.essos.lan:27017, took 2ms (1 connections now open to sf06-b.essos.lan:27017) 2018-10-03T16:30:02.604+0200 I REPL [ReplicationExecutor] Member sf06-b.essos.lan:27017 is now in state SECONDARY 2018-10-03T16:30:04.424+0200 I REPL [ReplicationExecutor] replSetElect voting yea for sf06-a.essos.lan:27017 (4) 2018-10-03T16:30:04.603+0200 I REPL [ReplicationExecutor] Member sf06-a.essos.lan:27017 is now in state PRIMARY 2018-10-03T16:30:05.587+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:30:05.587+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017 2018-10-03T16:30:05.592+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538576999000|19, t: -1 } 2018-10-03T16:30:08.720+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.22:57178 #21 (20 connections now open) 2018-10-03T16:30:15.617+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.23:47300 #22 (21 connections now open) ^C [root@sf06-c ~]# tail /var/log/mongodb/mongod.log -F 2018-10-03T16:30:40.745+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.185:45334 #24 (23 connections now open) 2018-10-03T16:31:06.007+0200 I NETWORK [initandlisten] connection accepted from 127.0.0.1:43886 #25 (24 connections now open) 2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017 2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017] 2018-10-03T16:31:10.599+0200 I NETWORK [rsSync] DBClientCursor::init call() failed 2018-10-03T16:31:10.600+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-b.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538576999000|19 } } 2018-10-03T16:31:10.600+0200 E REPL [rsSync] initial sync attempt failed, 4 attempts remaining 2018-10-03T16:31:15.600+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:31:15.600+0200 I REPL [ReplicationExecutor] syncing from: sf06-a.essos.lan:27017 by request 2018-10-03T16:31:15.604+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538577047000|79, t: -1 } 2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.23:27017 2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.23:27017] 2018-10-03T16:31:45.604+0200 I NETWORK [rsSync] DBClientCursor::init call() failed 2018-10-03T16:31:45.606+0200 E REPL [rsSync] 10276 DBClientBase::findN: transport error: sf06-a.essos.lan:27017 ns: local.oplog.rs query: \{ ts: { $gte: Timestamp 1538577047000|79 } } 2018-10-03T16:31:45.606+0200 E REPL [rsSync] initial sync attempt failed, 3 attempts remaining 2018-10-03T16:31:50.606+0200 I REPL [rsSync] initial sync pending 2018-10-03T16:31:50.607+0200 I REPL [ReplicationExecutor] syncing from: sf06-b.essos.lan:27017 2018-10-03T16:31:50.712+0200 I REPL [rsSync] initial sync ensuring correct oplog visibility. Starting at \{ ts: Timestamp 1538577110000|31, t: -1 } 2018-10-03T16:32:20.712+0200 I NETWORK [rsSync] Socket recv() timeout 192.168.50.22:27017 2018-10-03T16:32:20.712+0200 I NETWORK [rsSync] SocketException: remote: (NONE):0 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.50.22:27017]

Top User Comments

danny11 commented on Thu, 11 Oct 2018 10:58:53 +0000: Hi Guillaume LAPOUGE I had very similar issue (say almost identical) and this is a bug for sure. I downgraded MongoDB to the 3.2.19 version (I choose random version, but maybe 3.2.20 also works in your example) and "C" instance started replication without any problems. So interesting, in my situation/configuration "A" and "B" instances still have 3.2.21 and only problematic "C" got 3.2.19. My commands: $ yum erase $(rpm -qa | grep mongodb) $ yum install mongodb-enterprise-3.2.19 Tip. Restore your "mongod.conf" file after installation. Good Luck! Regards, Danny glapouge commented on Tue, 9 Oct 2018 07:50:49 +0000: Hi, I have upload log files for both nodes. In the sf06-b log file you can see the sf06-c node isn't online, it's just time to start the mongod service in sf06-c. nick.brewer commented on Mon, 8 Oct 2018 16:43:33 +0000: glapouge Can you provide the corresponding logs as well? You can upload them to our secure portal If you'd prefer not to make them publicly accessible. -Nick glapouge commented on Mon, 8 Oct 2018 14:44:31 +0000: diagnostics data files are posted. nick.brewer commented on Mon, 8 Oct 2018 13:52:11 +0000: glapouge Thanks. To determine the cause of the slowness, we'll need a complete set of logs from both of the mongods when the failed initial sync is occurring, as well as archives (tar or zip) of the dbpath/diagnostic.data directory on each of the nodes. -Nick glapouge commented on Mon, 8 Oct 2018 10:09:49 +0000: When testing with mtr between nodes B and C i not detect issue in network performance. HOST: sf06-b.essos.lan Loss% Snt Last Avg Best Wrst StDev 1. sf06-c.essos.lan 0.0% 10 0.3 0.2 0.2 0.3 0.0 HOST: sf06-c.essos.lan Loss% Snt Last Avg Best Wrst StDev 1.|-- sf06-b.essos.lan 0.0% 10 0.2 0.2 0.2 0.3 0.0 nick.brewer commented on Fri, 5 Oct 2018 17:17:23 +0000: glapouge The connection is initially successful, but it appears to be quite slow: 2018-10-05T10:03:11.623+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-c.essos.lan:27017, took 2023ms (1 connections now open to sf06-c.essos.lan:27017) Given that these machines appear to be on a the same local network, a connection time of 2023ms seems unusual. You can use a tool like mtr to determine the average response time from the affected node to the primary and secondary. -Nick glapouge commented on Fri, 5 Oct 2018 08:11:52 +0000: I also try to change the sync member in the replicaset from B (secondary) to A (primary) i have the same issue. glapouge commented on Fri, 5 Oct 2018 08:06:48 +0000: The tcp 27017 port is totally open. We already have A and B nodes in this rs and the replication works perfectly between this nodes. The C want to have initial sync. The first log in the bug declaration is from the node C. The node C want to synchronise with the node B. This is an extract from the node B 2018-10-05T10:03:09.600+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to sf06-c.essos.lan:27017 2018-10-05T10:03:11.043+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:03:10.796+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms 2018-10-05T10:03:11.623+0200 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to sf06-c.essos.lan:27017, took 2023ms (1 connections now open to sf06-c.essos.lan:27017) 2018-10-05T10:03:11.623+0200 I REPL [ReplicationExecutor] Member sf06-c.essos.lan:27017 is now in state STARTUP 2018-10-05T10:03:11.625+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60814 #730 (24 connections now open) 2018-10-05T10:03:11.634+0200 I NETWORK [ReplicaSetMonitorWatcher] Successfully connected to sf06-c.essos.lan:27017 (1 connections now open to sf06-c.essos.lan:27017 with a 5 second timeout) 2018-10-05T10:03:11.635+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60818 #731 (25 connections now open) 2018-10-05T10:03:11.636+0200 I NETWORK [conn731] end connection 192.168.50.65:60818 (24 connections now open) 2018-10-05T10:03:11.644+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60820 #732 (25 connections now open) 2018-10-05T10:03:11.653+0200 I NETWORK [conn730] end connection 192.168.50.65:60814 (24 connections now open) 2018-10-05T10:03:13.624+0200 I REPL [ReplicationExecutor] Member sf06-c.essos.lan:27017 is now in state STARTUP2 2018-10-05T10:03:13.744+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60822 #733 (25 connections now open) 2018-10-05T10:03:41.325+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:03:41.043+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms 2018-10-05T10:03:49.107+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60824 #734 (26 connections now open) 2018-10-05T10:04:11.606+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:04:11.325+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms 2018-10-05T10:04:24.114+0200 I NETWORK [initandlisten] connection accepted from 192.168.50.65:60826 #735 (27 connections now open) 2018-10-05T10:04:41.886+0200 I SHARDING [LockPinger] cluster sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019 pinged successfully at 2018-10-05T10:04:41.606+0200 by distributed lock pinger 'sgridfscfg01.essos.lan:27019,sgridfscfg02.essos.lan:27019,sgridfscfg03.essos.lan:27019/sf06-b.essos.lan:27017:1538576543:738138034', sleeping for 30000ms The node C was correctly detected. nick.brewer commented on Thu, 4 Oct 2018 21:17:55 +0000: glapouge This is often the result of a connectivity issue - have you tested connectivity between the primary and this node? It would be useful to see the mongod logs from both the primary and the node you were attempting to add. If you'd prefer to keep this information private, we can generate a secure portal for you to upload it to. -Nick

Steps to Reproduce

5.9Defect ID: 2956672
Some time-series tests implicitly rely on measurement insertion order for unordered inserts when checking bucket catalog stats
6.14Defect ID: 2965528
Remove push, publish_packages, and crypt_push tasks from Graviton 4 variants in v7.0 and v8.0
6.14Defect ID: 2947969
[SBE] Release storage engine resources when saveState() or restoreState() throws
5.68Defect ID: 2919474
StackLocator broken by v5 toolchain ASAN
5.88Defect ID: 2968769
Make new write path helper functions use acquireAndValidateBucketsCollection instead of acquireCollection

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

MongoDB - Defect ID: 613960

Problem with initial sync when adding an member in a rs in 3.2.21

MongoDB - Defect ID: 613960

Problem with initial sync when adding an member in a rs in 3.2.21

Last updated on 12/28/2018

Vendor details

Vendor details

Description

Info

Top User Comments

Steps to Reproduce

Links

Top MongoDB defects by risk score

Ready to prevent the next vendor outage?