...
You recently deployed NSX Application Platform (NAPP) in version 4.1.1 or upgraded NAPP to version 4.1.1.In the VMware NSX UI under alarms we see 'Metrics Delivery Failure' alarms for ESXi transport nodes: On the ESXi host which the alarm is for, in the /var/run/log/nsx-syslog we see the following Error: Wa(180) nsx-sha: NSX 2315153 - [nsx@6876 comp="nsx-esx" subcomp="nsx-sha" username="root" level="WARNING" s2comp="tsdb-sender-napp"] Failed to send one msg node_id: "xxxxxxxx-8635-4487-88bf-xxxxxxxxxxxx"Wa(180)[+] nsx-sha: timestamp: 1693143034Wa(180)[+] nsx-sha: health_check_poll: falseWa(180)[+] nsx-sha: :Wa(180)[+] nsx-sha: <_InactiveRpcError of RPC that terminated with:Wa(180)[+] nsx-sha: status = StatusCode.UNAUTHENTICATEDWa(180)[+] nsx-sha: details = ""Wa(180)[+] nsx-sha: debug_error_string = "{"created":"@xxxxxxxxxx.480732227","description":"Error received from peer ipv4:192.168.1.1:443","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"","grpc_status":16}"
Issue 1. The NSX_UA_TN missing in authserver . The NAPP authentication server is missing an entity ID for the transport node, in this case the transport node is an ESXi host. The missing ID leads to metric service being unable to authenticate and therefore fails to deliver metrics. Issue 2. For every new edge or esxi that gets added on NSX after NAPP deployment Issue 3. API cert on NSX Manager has changed after the NAPP deployment
This is a known issue impacting VMware NSX NAPP.
Steps to validate the certs in sync between NAPP / TN or Edge / NSX : NAPP: Authserver : The certs added to authserver from trust manager after restart can be grepped from the following command. example cmd : napp-k logs authserver-<podname> | grep "NSX_UA_TN" napp-k logs authserver-<podname> | grep "NSX_UA_EDGE" Trust-manager : We can query the certs present in trust manager using the following API. The NSX_UA_TN / NSX_UA_EDGE type cert should be present inside the result of GET call. In the result, the alias field represents the UUID of the TN/Edge node, which can be got by executing get node-uuid on TN node. GET https://<NSX_MANAGER_IP>/napp/api/v1/platform/trust-management/certificates Example TN node cert from trust manager get certs API call. { "uuid": "45040503-188b-4757-84cd-7e74YYYYYY", "alias": "0af35bd1-f42e-4039-b48c-396650aXXXX", "pem_encoded": "-----BEGIN CERTIFICATE-----\nMIIEEDCCAvgCCQC...sFADCXXXXXXshCSk\n-----END CERTIFICATE-----", "used_by": "NSX_UA_TN" }, TN / Edge node: The following command can be used on the TN/Edge node to get the host cert which has to match from NAPP trust manager certs too. cat /etc/vmware/nsx/host-cert.pem We can get the UUID of the TN node using the following CLI command get node-uuid . This should match with the alias of the TN/Edge node from trust-manager get certs API call. Example : In TN node : >> /bin/nsxcli -c get node-uuid In Edge node : >> su admin >> get node-uuid NSX manager side : We can query the TN certs present in NSX manager using the following API. GET "https://<NSX_MANAGER_IP>/api/v1/messaging/clients" Common agent is the service which takes care of pushing the certs into trust manager on NAPP side. In order to check if common agent has synced properly we can check the logs of /var/log/proton/nsxapi.log around the time, the TN/Edge node was added. Steps to identify the leader node of common agent service: Figure out which of the 3 manager nodes, has common agent leadership role. The following command would give tell which node is common agent leader with node id. 1. su admin -c get clus stat verb | grep "COMMON_AGENT_SERVICE" 2. To figure which node from the id , su admin -c get clus stat Issue 1 - The NSX_UA_TN missing in authserver : On NAPP side : SSH to the NSX manager as root and run the below: napp-k edit deployment authserver Then search for the below line: --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE Edit the line and add the missing entity ID: --trustmanager-entities=NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN Note: There should be 3 entity ID's: NSX_UA_SVM, NSX_UA_EDGE, NSX_UA_TN restart authserver napp-k -n nsxi-platform edit deployment authserver (If the napp-k alias is not functional, you can directly point to the kubernetes config file like: kubectl --kubeconfig /config/vmware/napps/.kube/config) with this, the authserver should restart and sync the entity certificates from trust manager, then the auth issue on NAPP side should get resolved. To validate, we can check if the certs in trust manager and auth server are in sync. Issue 2 - For every new edge or esxi that gets added on NSX after NAPP deployment: Validate if the edge/TN node cert ( /etc/vmware/nsx/host-cert.pem) is present inside the NAPP trust manager certs (GET certs api). Once we validate that, we should restart authserver pod to refresh authserver to get the certs. Issue 3 - API cert on NSX Manager has changed after the NAPP deployment : Check certificate which is in use for SHA agent on NSX manager: Get the root certificate and node certificate by searching syslog via command `zgrep nsx-sha /var/log/syslog* | grep "NAPP Profile" `.It is possible to fail to get the certificate from the log, since the log about connection has been rotated. In this case, restart SHA agent following the below steps. Get API certificate: Get the certificate ID from the NSX UI: a. Login to the NSX Manager UI, navigate to System > Certificatesb. Locate the API certificate for the node id used by the manager in question (find Manager node id by running get nodes in Manager CLI, or in the NSX UI at System → Appliances → NSX Manager → VIEW DETAILS → Copy the UUID)c. Expand this certificate item, note its UUID Use API GET /api/v1/trust-management/certificates/[certificate ID from last step] curl -k -i -H "Accept: application/json" -u admin -X GET https://<Manager IP>/api/v1/trust-management/certificates/<certificate ID> { "pem_encoded": "-----BEGIN CERTIFICATE-----\nMIIrdzCCAl+gAwIBAgIJAP/fpd0dacyuMA0GCSqGSIb3DQEBCwUAMGgxFDASBgNV\nBAMMC25zeC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMC1ZNd2FyZSBJ\nbmMuMRIwEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJV\nUzAWerSWQertQxMTM2MzVaFw0yNTEwMjgxMTM2MzVaMGgxFDASBgNVBAMMC25z\neC1tYW5hZ2VyMQwwCgYDVQQLDANOU1gxFDASBgNVBAoMC1ZNd2FyZSBJbmMuMRIw\nEAYDVQQHDAlQYWxvIEFsdG8xCzAJBgNVBAgMAkNBMQswCQYDVQQGEwJVUzCCASIw\nDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMJbSP3uVlVmkyp0pR95+ppbuTaL\ncvxMgCxvgTQ0LifTJFX3wraatPXgwBo4r8cpXeP+aLn/KxR8jTWnXbZyK+4ssaW9\n/X5tFYabn0TNyQl6aPO03mmJNLZOKUcIXXP7DKtkt6TEaWH1X4C45gxteXponbc9\nCFnVmArci0pkkBFng+l9fASu35P4LuBkHmFbspOA23JNmCCTtvW0n+Ry0NqP6mw8\nbRqAkymlQI6q2aVPcPUChmptdNqx1gEWXnGaxlfK2hu6tvTnC4jeTirG5Yepv2yP2n5DTGog0GPHP1k9f7bQkNwDkQ7YlvC3AvJUt/b4b3WMeWlnHhEMyFEwMmDaECAwEA\nAaMkMCIwEwYDVR0lBAwwCgYIKwYBBQUHAwEwCwYDVR0RBAQwAoIAMA0GCSqGSIb3\nDQEBCwUAA4IBAQBQq/XN1HkYAnENYkxlwjuzlxzDkYsnr82E7PEVyJ5yP4m6sF85\nzz5FsCt7Y6kHt+2xrgNi0UHuSByvYxtkOzBTOqqLoDltBEng+HOTW2Cd6zD+xvHL\nxg41K6ykfMvjBc+wb+h2JQfUiL8yh10g4Uvpv1HKtCCUJb00kLRK4TIm5+KHtIB4\nF8uWHtwBnz93G0PO1/K89gybQgy+WitjM0NYExytIiLcWVETQc9rVd2ubLxxExJ3\nBxQsEMOviB6I6KjCmDtk69vOSvrZGXxUBQhve3BQku44jWVUg5AWJRZKm9sRMAHu\nOyE2ycIrToxsFuiwpWOzyTMReq2NQIuh0F2Q\n-----END CERTIFICATE-----\n", "has_private_key": true, "used_by": [ { "node_id": "<UUID>", "service_types": [ "API" ] }], "leaf_certificate_sha_256_thumbprint": "15:B5:93:F0:35:77:91:3B:22:B6:D3:24:6F:F1:9D:15:DE:4E:D3:C4:EB:51:2D:D2:0D:66:D1:65:2B:7F:18:BE", "resource_type": "certificate_self_signed", "id": "<UUID>", "display_name": "API certificate for node <UUID>", "_create_time": 1690371634010, "_create_user": "system", "_last_modified_time": 1690377045873, "_last_modified_user": "admin", "_system_owned": false, "_protection": "NOT_PROTECTED", "_revision": 2 } If the node certificate is different from the updated API certificate, restart SHA agent via command `service nsx-sha restart`.If restarting SHA agent does not work, please restart proton via command `service proton restart`. Last step of remediation would be to restart proton (common agent leader node) , which will force full sync. Give a few minutes for common agent to full sync, and then restart auth server on NAPP side. This will resync the cert from trust manager and will resolve the issue. restart proton : systemctl restart proton restart authserver on napp side : napp-k delete pod authserver-<podname>