...
This issue is seen when Smart Licensing Using Policy [SLP] is implemented and Resource Utilization Measurement report (RUM report) are accumulated in large quantities on the device. The device generates RUM reports as per the interval specified on it. It also generates new reports during configuration changes, reload, changes in license count, HA configuration changes, etc. Depending on the interval, it pushes the report to the On-Prem server, which in turn forwards it to the CSSM. The CSSM processes the reports and sends acknowledgments back to the On-Prem server, which in turn forwards it to the product instance. Once the end device gets the ACK, it moves the reports from the "Unacknowledged" to the "Acknowledged" state. All the Acknowledged reports then get automatically purged from the device. If there is any communication issue during this end-to-end process, RUM reports do not get the ACK they need, and they can pile up/accumulate on the device. When this reaches substantial numbers, it can cause High CPU utilization of the Smart agent processes. Customers may see the following CPUHOG messages calling out Smart Agent ("SA") processes: Jan 1 01:02:03: %SYS-3-CPUHOG: Task is running for (2057)msecs, more than (2000)msecs (19/19),process = SAUtilReport. High CPU utilization is due to Smart Agent processes within IOSd: Router#show process cpu sorted CPU utilization for five seconds: 100%/1%; one minute: 61%; five minutes: 28% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 662 272007421 19185503 14177 60.00% 45.62% 13.41% 0 SAGetRUMIds 150 310763355 20821285 14925 38.03% 12.50% 13.43% 0 SAUtilRepSave
Smart Licensing Using Policy [SLP] is being used in an On-Prem deployment, and there's a buildup of unacknowledged reports.
A "license smart factory reset" (along with the required reload to take effect) should clear out the buildup of RUM reports. However, this is only a temporary mitigation as it does not correct the source of the buildup. Over time, devices could end up back in this problematic state. Customers can try switching to the offline method of SLP (rather than using On-Prem) to help avoid the generation of RUM reports periodically. For details, please refer to the "Workflow for Topology: No Connectivity to CSSM and No CSLU" section of the "Smart Licensing Using Policy for Cisco Enterprise Routing Platforms" guide: https://www.cisco.com/c/en/us/td/docs/routers/sl_using_policy/b-sl-using-policy/how_to_configure_workflows.html#Cisco_Concept.dita_7057e18c-3c69-4d91-841b-0b5beb7a2d88
The commits from this defect will purge excess reports. The accumulation of RUM reports can lead to high CPU and memory utilization. The fix for high CPU utilization was addressed in CSCwa85199: CSCwa85199 - Unacknowledged Reports can cause High CPU Utilization due to Smart Agent The fix for high memory utilization under MallocLite was addressed in CSCwa85525: CSCwa85525 - Memory leak in *MallocLite* due to growing Smart Agent Memory Utilization Starting in IOS-XE 17.9, the way SLP handles RUM reports has been updated, and this issue does not affect those newer releases. Please ensure that your licensing workflow is operating as expected. Devices should be able to sync with On-Prem, On-Prem should be able to sync with CSSM, and the acknowledgment (ACK) from CSSM should make its way back to On-Prem and ultimately onto the product instance.