BugZero | Dell BugID 64892 - ECS: xDoctor RAP014: Fabric Lifecycle Service not ...

Dell - Defect ID: 64892

ECS: xDoctor RAP014: Fabric Lifecycle Service not Healthy | Lifecycle Jetty server is not up and running on port 9241

Dell - Defect ID: 64892

ECS: xDoctor RAP014: Fabric Lifecycle Service not Healthy | Lifecycle Jetty server is not up and running on port 9241

Last updated on 1/31/2023

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

Support Case Count: 76
Article View Count: 4303
Impact Category: Code Bug

Overall: 0N/A

Severity: 0N/A

Community: 0N/A

Lifecycle: 0N/A

What is the BugZero Risk Score?

Vendor details

Support Case Count: 76
Article View Count: 4303
Impact Category: Code Bug

Symptoms

Issue #1: The following output in service-console is seen after running an upgrade on ECS from version 3.0.X or earlier to version 3.1 or later: 20180309 01:49:28.456: | | | PASS (21 min 29 sec)20180309 01:49:28.462: | | PASS (21 min 29 sec)20180309 01:49:28.463: | Run Keyword If20180309 01:49:28.464: | | Node Service Upgrade Initializing... Executing Program: NODE_SERVICE_UPGRADE |-Disable CallHome | +-[0.0.0.0] SetCallHomeEnabled PASS (1/7, 1 sec) |-Push Service Image To Registries | |-Push Service Image to Head Registry | | |-[169.254.1.1] LoadImage PASS (2/7, 1 sec) | | +-[169.254.1.1] PushImage PASS (3/7) | +-Push Service Image to Remote Registries |-Upgrade Object On Specified Nodes | +-Initiate Object Upgrade if Required | +-[0.0.0.0] UpdateApplicationOnNodes PASS (4/7, 1 sec) |-Update Services Ownership To Lifecycle Manager on Specified Nodes | +-Update Ownership For Object | +-[169.254.1.1] UpdateOwnership PASS (5/7) |-Post-check Services Health | +-Validate Object Service on Specified Nodes | +-[169.254.1.1] ServiceHealth PASS (6/7, 21 sec) +-Enable CallHome +-[0.0.0.0] SetCallHomeEnabled PASS (7/7, 3 sec) Elapsed time is 30 sec. NODE_SERVICE_UPGRADE completed successfully Collecting data from cluster Information has been written to the Information has been written to theExecuting /configure.sh --start action in object-main container which may take up to 600 seconds.20180309 01:52:51.711: | | | PASS (3 min 23 sec)20180309 01:52:51.720: | | PASS (3 min 23 sec)20180309 01:52:51.722: | Run Keyword If20180309 01:52:51.724: | | Update manifest file[ERROR] On node 169.254.1.1, Lifecycle Jetty server is not up and running on port 9241!20180309 01:58:45.068: | | | FAIL (5 min 53 sec)20180309 01:58:45.071: | | FAIL (5 min 53 sec)20180309 01:58:45.072: | FAIL (45 min 43 sec)20180309 01:58:45.075: Service Console Teardown20180309 01:58:46.973: | PASS (1 sec)================================================================================Status: FAILTime Elapsed: 45 min 56 secDebug log: /HTML log: /================================================================================Messages:fabric-lifecycle service should be up and running================================================================================ Issue #2: xDoctor may report the following: - xDoctor reports the following:Timestamp = 2015-09-25_092907Category = healthSource = fcliSeverity = WARNINGMessage = Fabric Lifecycle Service not HealthyExtra = Monitoring of the Fabric Lifecycle Service using "sudo docker ps -a", shows that the service is restarting: venus2:~ # docker ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES7995f18ba27f ip.ip.ip.ip:5000/emcvipr/object:2.0.1.0-62267.db4d4a8 "/opt/vipr/boot/boot 4 weeks ago Up 21 hours object-main73f00ed0b6df ip.ip.ip.ip:5000/caspian/fabric:1.1.1.0-1998.1391e7e "./boot.sh lifecycle 4 weeks ago Up 3 seconds fabric-lifecycleba19a3c95151 ip.ip.ip.ip:5000/caspian/fabric-zookeeper:1.1.0.0-54.54a204e "./boot.sh 2 1=169.2 4 weeks ago Up 21 hours fabric-zookeepervenus2:~ # docker ps -aCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES7995f18ba27f ip.ip.ip.ip:5000/emcvipr/object:2.0.1.0-62267.db4d4a8 "/opt/vipr/boot/boot 4 weeks ago Up 21 hours object-main73f00ed0b6df ip.ip.ip.ip:5000/caspian/fabric:1.1.1.0-1998.1391e7e "./boot.sh lifecycle 4 weeks ago Exited (1) 2 seconds ago fabric-lifecycleba19a3c95151 ip.ip.ip.ip:5000/caspian/fabric-zookeeper:1.1.0.0-54.54a204e "./boot.sh 2 1=169.2 4 weeks ago Up 21 hours fabric-zookeeper

Cause

Cause Issue #1: ZooKeeper container could not start properly due to snapshot size.Cause Issue #2: ECS IP is resolving to incorrect hostname.

Resolution

Solution Issue #1: Resolved in ECS version 3.0ECS 3.0 improves compaction and enabled retention for ZK messages.Note: This resolution works only when this build is already installed to the host. Which means that this resolution will not help when an upgrade is performed on a degraded system to this build.If this issue is seen, contact ECS support.How to verify this issue: Run command: # viprexec 'cat /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log | grep "GC overhead limit exceeded"'Example output: admin@:~> viprexec 'cat /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log | grep "GC overhead limit exceeded"' Output from host : 192.168.219.4java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.5java.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.3java.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceededjava.lang.OutOfMemoryError: GC overhead limit exceeded Output from host : 192.168.219.7cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.2 Output from host : 192.168.219.8cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.6cat: /opt/emc/caspian/fabric/agent/services/fabric/zookeeper/log/zookeeper.log: No such file or directory Output from host : 192.168.219.1 adm@:in~> This message is seen in the ZooKeeper log files: OutOfMemoryError java.lang.OutOfMemoryError: GC overhead limit exceeded at java.util.Arrays.copyOf(Arrays.java:3236)at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)at java.io.DataOutputStream.writeLong(DataOutputStream.java:224)at org.apache.jute.BinaryOutputArchive.writeLong(BinaryOutputArchive.java:59)at org.apache.zookeeper.data.Stat.serialize(Stat.java:129)at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)at org.apache.zookeeper.proto.GetDataResponse.serialize(GetDataResponse.java:49)at org.apache.jute.BinaryOutputArchive.writeRecord(BinaryOutputArchive.java:123)at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1067)at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74) Solution Issue #2: Verify the DNS using nslookup and the IP address of the ECS node that is having life cycle restarts. # nslookup If your DNS is correct and life cycle is still having issues, contact ECS Support.

Support Cases

Original Vendor Announcement

No bugs this month

Ready to prevent the next vendor outage?

Get a demo

OPERATIONAL DEFECT DATABASE

Dell - Defect ID: 64892

ECS: xDoctor RAP014: Fabric Lifecycle Service not Healthy | Lifecycle Jetty server is not up and running on port 9241

Dell - Defect ID: 64892

ECS: xDoctor RAP014: Fabric Lifecycle Service not Healthy | Lifecycle Jetty server is not up and running on port 9241

Last updated on 1/31/2023

Vendor details

Vendor details

Description

Symptoms

Cause

Resolution

Support Cases

Links

Top Dell defects by risk score

Ready to prevent the next vendor outage?