Symptoms
The below issues are seen on a vRealize Automation or vRealize Orchestrator 8.1 Patch 2 (8.1.0.9583) system:
Restarting a node will get stuck on "starting docker engine".This may take a long time, but eventually completes successfully.The user can login to the system in parallel via SSH. Running the following command journalctl -u docker -f similar messages are seen:
<MONTH> <DATE> HH:MM:SS <HOSTNAME> dockerd[679]: time="YYYY-MM-DTHH:MM:SS.970559016Z" level=info msg="Removing stale sandbox badfb5b205f959b40f7eb587106b9a8d62f86393876ab18783926e0b116700d5 (bb731fa1ed1a5cc552487a69bbf2cefab536d252877997bd2e212c7f2b2b467e)"<MONTH> <DATE> HH:MM:SS <HOSTNAME> dockerd[679]: time="YYYY-MM-DDTHH:MM:SS.128605700Z" level=info sg="Removing stale sandbox 2bba66f9d6eef514c2815622c4f10a7c538f6a344c596740622ae7f4cee1c5f1 (e24cd80099d57af42d82d00a95b72f177d44f8d76482265095b060fee65cb14c)"
This issue can occur during upgrade as well. The system may remain for a long time on the following step 'Deactivating cluster of appliance nodes. This might take several minutes.' In the meantime, LCM may timeout while the actual upgrade eventually succeeds on vRealize Automation.
Cause
This is a known bug in vRealize Automation 8.1 Patch 2 that prevents the docker batch clean-up to operate normally.
Resolution
VMware is aware of this issue. See the workaround below.A fix for this issue will be rolled into a later release of Cumulative Update for vRealize Automation 8.1.Notes:
A reboot (without this workaround) can take anywhere between 5 minutes and 8 hours.This should be executed BEFORE rebooting.This only needs to be completed ONCE.This should NOT be executed during a "long startup". This must be executed BEFORE the reboot.There is no downtime associated with running this procedure.
Workaround
To workaround the issue, run the following steps on vRealize Automation / vRealize Orchestrator 8.1 Patch 2 (8.1.0.9583):
Take simultaneous no memory snapshots on all cluster nodes.On one of the nodes, run:
vracli cluster exec -- bash -c "[[ -f '/etc/systemd/system/docker.service.d/20-10-k8s-config.conf' ]] && echo 'W1NlcnZpY2VdCkV4ZWNTdGFydFByZT0vYmluL2Jhc2ggLWMgJ1tbIC1mICIvb3B0L3NjcmlwdHMvY2xlYW51cF9kb2NrZXJfc3RvcmFnZS5zaCIgXV0gJiYgL29wdC9zY3JpcHRzL2NsZWFudXBfZG9ja2VyX3N0b3JhZ2Uuc2ggfHwgdHJ1ZScKRXhlY1N0YXJ0UG9zdD0vYmluL2Jhc2ggLWMgJ1tbIC1mICIvb3B0L3NjcmlwdHMvY2xlYW51cF9kb2NrZXJfc3RvcmFnZS5zaCIgXV0gJiYgL29wdC9zY3JpcHRzL3Jlc3RvcmVfZG9ja2VyX2ltYWdlcy5zaCAib24tZGVtYW5kIiB8fCB0cnVlJwo=' | base64 -d > /etc/systemd/system/docker.service.d/20-10-k8s-config.conf; systemctl daemon-reload"