...
Issue #1: The following output in service-console is seen after running an upgrade on ECS from version 3.0.X or earlier to version 3.1 or later: 20180309 01:49:28.456: | | | PASS (21 min 29 sec)20180309 01:49:28.462: | | PASS (21 min 29 sec)20180309 01:49:28.463: | Run Keyword If20180309 01:49:28.464: | | Node Service Upgrade Initializing... Executing Program: NODE_SERVICE_UPGRADE |-Disable CallHome | +-[0.0.0.0] SetCallHomeEnabled PASS (1/7, 1 sec) |-Push Service Image To Registries | |-Push Service Image to Head Registry | | |-[169.254.1.1] LoadImage PASS (2/7, 1 sec) | | +-[169.254.1.1] PushImage PASS (3/7) | +-Push Service Image to Remote Registries |-Upgrade Object On Specified Nodes | +-Initiate Object Upgrade if Required | +-[0.0.0.0] UpdateApplicationOnNodes PASS (4/7, 1 sec) |-Update Services Ownership To Lifecycle Manager on Specified Nodes | +-Update Ownership For Object | +-[169.254.1.1] UpdateOwnership PASS (5/7) |-Post-check Services Health | +-Validate Object Service on Specified Nodes | +-[169.254.1.1] ServiceHealth PASS (6/7, 21 sec) +-Enable CallHome +-[0.0.0.0] SetCallHomeEnabled PASS (7/7, 3 sec) Elapsed time is 30 sec. NODE_SERVICE_UPGRADE completed successfully Collecting data from cluster Information has been written to the Information has been written to theExecuting /configure.sh --start action in object-main container which may take up to 600 seconds.20180309 01:52:51.711: | | | PASS (3 min 23 sec)20180309 01:52:51.720: | | PASS (3 min 23 sec)20180309 01:52:51.722: | Run Keyword If20180309 01:52:51.724: | | Update manifest file[ERROR] On node 169.254.1.1, Lifecycle Jetty server is not up and running on port 9241!20180309 01:58:45.068: | | | FAIL (5 min 53 sec)20180309 01:58:45.071: | | FAIL (5 min 53 sec)20180309 01:58:45.072: | FAIL (45 min 43 sec)20180309 01:58:45.075: Service Console Teardown20180309 01:58:46.973: | PASS (1 sec)================================================================================Status: FAILTime Elapsed: 45 min 56 secDebug log: /HTML log: /================================================================================Messages:fabric-lifecycle service should be up and running================================================================================ Issue #2: xDoctor may report the following: - xDoctor reports the following:Timestamp = 2015-09-25_092907Category = healthSource = fcliSeverity = WARNINGMessage = Fabric Lifecycle Service not HealthyExtra = Monitoring of the Fabric Lifecycle Service using "sudo docker ps -a", shows that the service is restarting: venus2:~ # docker ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES7995f18ba27f ip.ip.ip.ip:5000/emcvipr/object:2.0.1.0-62267.db4d4a8 "/opt/vipr/boot/boot 4 weeks ago Up 21 hours object-main73f00ed0b6df ip.ip.ip.ip:5000/caspian/fabric:1.1.1.0-1998.1391e7e "./boot.sh lifecycle 4 weeks ago Up 3 seconds fabric-lifecycleba19a3c95151 ip.ip.ip.ip:5000/caspian/fabric-zookeeper:1.1.0.0-54.54a204e "./boot.sh 2 1=169.2 4 weeks ago Up 21 hours fabric-zookeepervenus2:~ # docker ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES7995f18ba27f ip.ip.ip.ip:5000/emcvipr/object:2.0.1.0-62267.db4d4a8 "/opt/vipr/boot/boot 4 weeks ago Up 21 hours object-main73f00ed0b6df ip.ip.ip.ip:5000/caspian/fabric:1.1.1.0-1998.1391e7e "./boot.sh lifecycle 4 weeks ago Exited (1) 2 seconds ago fabric-lifecycleba19a3c95151 ip.ip.ip.ip:5000/caspian/fabric-zookeeper:1.1.0.0-54.54a204e "./boot.sh 2 1=169.2 4 weeks ago Up 21 hours fabric-zookeeper
Cause Issue #1: ZooKeeper container could not start properly due to snapshot size.Cause Issue #2: ECS IP is resolving to incorrect hostname.
Solution Issue #1: Resolved in ECS version 3.0ECS 3.0 improves compaction and enabled retention for ZK messages.Note: This resolution works only when this build is already installed to the host. Which means that this resolution will not help when an upgrade is performed on a degraded system to this build.If this issue is seen, contact ECS support.How to verify this issue: Run command: # viprexec 'cat /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log | grep "GC overhead limit exceeded"'Example output: admin@:~> viprexec 'cat /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log | grep "GC overhead limit exceeded"' Output from host : 192.168.219.4java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.5java.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.3java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.7cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.2 Output from host : 192.168.219.8cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.6cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.1 adm@:in~> This message is seen in the ZooKeeper log files: OutOfMemoryError java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3236)at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)at org.apache.jute.BinaryOutputArchive.writeLong(BinaryOutputArchive.java:59)at org.apache.zookeeper.data.Stat.serialize(Stat.java:129)at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)at org.apache.zookeeper.proto.GetDataResponse.serialize(GetDataResponse.java:49)at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1067)at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74) Solution Issue #2: Verify the DNS using nslookup and the IP address of the ECS node that is having life cycle restarts. # nslookup If your DNS is correct and life cycle is still having issues, contact ECS Support.